This will be used in upcoming commits to varlinkify `systemd-sysupdate`;
it will need a way to identify targets over varlink, and the existing
way with a `Target` over D-Bus seems to work quite well.
`process_image()` has historically used `umount_and_freep` to clean up
the mounted directory locally, but callers to it have used
`umount_and_rmdir_and_freep`.
No directory is created after any of the error return paths in
`process_image()`, so it should probably be using
`umount_and_rmdir_and_freep` too.
This is another step towards varlinkifying the program, as it means the
various verb implementations are no longer relying on global state from
the command line.
As part of this, move init of the `Context` struct into a new
`context_from_cmdline()` function.
Additionally pass some context into config parsing `userdata` arguments,
as various config parsers were using `arg_root` via a sneaky `extern`.
This introduces no functional changes.
There’s no need for it to be heap allocated — there’s only ever one
instance of it, and it’s allocated for the lifetime of a `verb_*()`
function.
Simplify things a bit by making it stack allocated. This will also help
with upcoming commits where we introduce derived context structs to help
with varlinkifying sysupdate. By allowing `Context` to be stack
allocated we can include it in the derived context structs.
As part of this, rename `context_make_{offline,online}()` to
`context_load_{offline,online}()` for clarity (since they no longer init
the struct).
This introduces no functional changes.
`process_image()` is always called immediately before (almost) every
`context_make_online()` or `context_make_offline()`, and the structures
it allocates have the same lifetime as `Context`, so we might as well
factor them all together to reduce duplication.
This will also simplify the following commit, which changes heap
allocation of `Context`s, and simplify upcoming changes to factor out
`arg_*` handling.
The call in `verb_pending_or_reboot()` is safe because it already
validates that `arg_image` is `NULL`, hence `process_image()` will bail
out early.
This introduces no functional changes.
This makes it like all the other verbs and therefore easier to refactor.
At the same time, remove the separate `component` argument and instead
use the `component` set on the `Context`. This guards against bugs, as
various parts of the `Context` state depend on the component (for
example, `installdb_fd`) and overriding the component without also
overriding its dependent variables will lead to bugs.
If the FD has already been opened, return 1 as if opening was
successful, rather than returning 0 as if it gave `ENOENT`.
This fixes doing multiple installdb operations on a single `Context`.
Signed-off-by: Philip Withnall <pwithnall@gnome.org>
The example states that the first /bin/echo invocation (using ${ONE})
receives the argument 'one' (with literal single quotes). However,
Environment=ONE='one' strips the syntactic single quotes during
unquoting — see systemd.syntax(7), "Quotes themselves are removed" —
so ONE holds the value one, and ${ONE} (exact-value substitution,
always a single argument) yields the argument one without quotes.
Fixes#42442
Signed-off-by: Wang Yu <wangyu@uniontech.com>
Previously the key was unknown so add the correct mapping as it does not follow the general
case for MSI Laptops.
[ 192.562000] atkbd serio0: Unknown key released (translated set 2, code 0xd7 on isa0060/serio0).
[ 192.562011] atkbd serio0: Use 'setkeycodes e057 <keycode>' to make it known.
Add it currently as a definition specific for this model but can be generalized to other MSI
Laptops if this issue is present also elsewhere.
When running in a mkosi namespaced env tmpfiles fails to set ACLs:
Running create action for entry a /buildroot/var/log/journal
Setting access ACL u::rwx,g::r-x,g:adm:r-x,m::r-x,o::r-x on /buildroot/var/log/journal
Setting access ACL "u::rwx,g::r-x,g:adm:r-x,m::r-x,o::r-x" on /buildroot/var/log/journal failed: Invalid argument
If EINVAL is returned and we are in a chroot, skip gracefully via
EOPNOTSUPP. The ACLs will be set on first boot.
event_log_record_extract_firmware_description() walks the device path
of a UEFI_IMAGE_LOAD_EVENT taken from the firmware TPM2 measurement log.
The per-node loop checks the remaining bytes against the node and its
declared length, but never that dp->length covers the 4-byte node header
offsetof(packed_EFI_DEVICE_PATH, path).
For a Media/File-Path node with length 3, the file-name extraction
computes dp->length - offsetof(packed_EFI_DEVICE_PATH, path) == 3 - 4,
which wraps to SIZE_MAX. utf16_to_utf8() treats SIZE_MAX as unbounded
and runs char16_strlen() over dp->path, reading past the log buffer; a
length of 0 also leaves dp non-advancing.
efi_get_boot_option() in src/shared/efi-api.c already rejects such nodes
with "if (dpath->length < 4) break;"; do the same here.
Re-enables `--set-credential=` / `--load-credential=` under
`--coco=sev-snp` by packaging credentials into a cpio appended to the
initrd, mirroring what `systemd-stub` does for ESP-sourced credentials.
The initrd is covered by the launch measurement via `kernel-hashes=on`,
so the credentials are too.
Tested end-to-end on an SNP-capable host: credentials passed via
`--set-credential=` land in `/run/credentials/@encrypted/` inside the
guest.
Use sd_json_variant_is_blank_array() instead of is_blank_object() for
p.addresses and p.names, which are declared as JSON arrays. The wrong
predicate never triggered, allowing empty arrays to bypass the guards:
for p.names this caused a size_t underflow leading to an out-of-bounds
heap write; for p.addresses it returned success with no addresses.
Add explicit n_addresses == 0 guards after the family-filter loops so
entries with unsupported families also return NOTFOUND rather than
crashing on a NULL dereference.
In gethostbyname3_r (family-specific entry point), return NO_DATA for
all zero-address results — both blank array and all-filtered — since
both mean "name resolved, no record of the requested family". Keep
HOST_NOT_FOUND in gethostbyname4_r (both-families) where a blank or
all-unsupported result genuinely means the name was not found.
Signed-off-by: dongshengyuan <dongshengyuan@uniontech.com>
Co-developed-by: Claude Opus 4.8 <noreply@anthropic.com>
In Meta production we have been considering using journald more widely
for some time. One of the blockers to doing that which I have noticed is
that often journald seems to have vastly less data after lockups/power
failures compared to plain files, which is not great when debugging
outages.
On small write rates this tends to be hard to reproduce, but when
writing thousands of messages a second, an unclean shutdown can result
in the end result being an active journal file with a header that
records an arena larger than the data that actually reached disk. What
happens is then that journalctl then discards the entire file(!),
completely ignoring that there is a huge amount of data which is
actually perfectly readable.
The reason for that is that the journal header is updated on every
append, while the file size and newly written arena contents are only
made durable on the filesystem's own schedule. After a crash, the header
can therefore describe writes which were logically completed by journald
but whose backing data or file metadata never reached disk.
Take the following example of how this can happen at high log rates:
1. journald appends objects into an mmap()ed arena, periodically growing
the file with fallocate() in FILE_SIZE_INCREASE (8M) steps and advancing
the header's arena_size tail pointers as it goes along.
2. The header is dirtied on every append, and its arena_size is advanced
at each fallocate(). It is, from the kernel's perspective, an ordinary
data page and is only made durable by the kernel's periodic page cache
writeback on its own schedule. The file's length, by contrast, is
metadata, made durable only when the filesystem commits a transaction
(or on an fsync(), which journald does not issue between sync
intervals).
3. journald marks journals NOCOW, so the header's data block is
overwritten in place and is decoupled from the size metadata. Nothing
orders the two with respect to each other. Writeback therefore can
routinely persist a header whose arena_size has run ahead of the file
length recorded on disk.
4. Power is lost. On the next boot the persisted header reflects an
arena_size and tail pointers which have been advanced for appends.
However their payload and the file metadata were never committed, so
header_size + arena_size now points well past the end of the file as it
exists on disk.
5. journal_file_verify_header() then rejects this with -ENODATA:
if (... || header_size + arena_size > (uint64_t) f->last_stat.st_size)
return -ENODATA;
That is correct when opening for writing, because we must not append to
a file whose recorded state we cannot trust, and the caller must rotate
it away. But the same check also runs on read only opens, where it is
actively harmful. In the case of journalctl, the entire file is skipped,
even though the data hash table, the field hash table, and the head of
the array all are present and fully intact, and the great majority of
entries are physically present. In fact, only a very small part of the
most recently written tail is missing, but everything before is
readable. This results in mistakenly rejecting the entire file as
corrupt.
This happens extremely frequently on machines with high write rates
during power cuts or lockups. In testing writing ~7500 msg/s through
journald and then cutting power, I reproduced it in ten out of ten
attempts across different machines.
In each case, the header was left claiming ~296M of arena while only
~192-208M had reached disk. In this case, journalctl reports that it has
recovered 0 of ~335000 messages. Whether a given crash trips the
condition depends on where it falls relative to the header's writeback,
but when it does, the loss today is total. After this patch the vast
majority of messages can be retrieved.
Let's fix this by keeping the rejection for writing, but for read-only
opens, let's just clamp the arena to the real file size and skip the
consistency checks on the now unreliable tail pointers. The reader will
walk the entry array chain from its intact head and stop at the
truncation point by the bounds check that already exists, so there's no
need to do any more than that there.
`systemd-analyze security --user foo.service` currently flags units
without `User=` as running as root. For user manager instances this is
impossible: per systemd.exec(5), switching user identity is not
permitted there, so the service always runs under the calling user's
UID.
Track the runtime scope inside SecurityInfo and short-circuit
security_info_runs_privileged() and assess_user() for
RUNTIME_SCOPE_USER, so that User=/DynamicUser=, SupplementaryGroups=
and RemoveIPC= are no longer marked as if the service ran as root in
both the bus-backed and --offline paths.
Fixes#40292
Signed-off-by: Shihao Ren <renshihao.rsh@bytedance.com>
unit_name_mangle_with_suffix() is quite benevolent by default and allows
the unit to "transition" into a different unit type than what's
requested via its suffix argument. For example, calling
unit_name_mangle_with_suffix() with "/foo/bar" as a unit name and
".service" as a suffix would give you "foo-bar.mount", without any
warning or error.
This could then lead to a quite confusing errors in certain situations:
```
~# systemd-run --remain-after-exit --unit /foo/bar true
Failed to start transient service unit: Cannot set property RemainAfterExit, or unknown property.
```
Given we can't change the default behaviour of
unit_name_mangle_with_suffix() as some parts of systemd already depend
on its "benevolence" (like systemctl), let's introduce a new flag -
UNIT_NAME_MANGLE_STRICT - that checks if the mangled/resolved unit
name's suffix matches the requested one and errors out if not.
With the flag used throughout systemd-run's code, the error in the above
case is now a bit more clear:
```
~# build/systemd-run --remain-after-exit --unit /foo/bar true
Path "/foo/bar" resolves to unit type "mount", but "service" is expected as unit.
Failed to mangle unit name: Invalid argument
```
Resolves: #39996
In home_unlocking_finish(), the success path calls operation_result_unref()
with the local variable r and the uninitialized error object. If either
user_record_good_authentication() or home_save_record() fails (both are
logged as "ignoring"), r is left negative and the D-Bus caller receives
an error reply despite the home having been unlocked successfully.
This causes PAM to reject the session even though the home directory is
mounted and accessible.
Fix by passing 0 and NULL — consistent with every other success path in
the file (home_locking_finish(), home_activation_finish(), etc.).
Signed-off-by: dongshengyuan <dongshengyuan@uniontech.com>
The data and field hash table chains have the same problem the previous
commit fixed for entry array chains. New data and field objects are
linked at the tail of their hash bucket by patching the previous tail
object's next_hash_offset in place, so after a crash a persisted
predecessor (or the bucket head) can point at an object whose body never
reached disk.
journal_file_find_data_object_with_hash() and
journal_file_find_field_object_with_hash() walk those chains while
resolving matches, and on -EADDRNOTAVAIL/-EBADMSG from
journal_file_move_to_object() they simply return the error directly.
That propagates up to real_journal_next(), which discards the whole file
from the query.
Give those two lookups the same tolerance: on a read-only file, treat an
unreadable chain node as the end of the bucket chain.
generic_array_get() which is used for the unfiltered iteration path in
the previous commit treats a chain pointer that resolves past the end of
the file as the end of the chain. In that case, moving to the missing
array object returns -EADDRNOTAVAIL (or -EBADMSG), and it either stops
(going downwards) or steps back to the previous array (going upwards).
However, generic_array_bisect(), which is used for filtered or seeking
reads does not. On -EADDRNOTAVAIL/-EBADMSG from
journal_file_move_to_object(), it instead returns the error directly to
the caller, which propagates out through
sd_journal_next()/sd_journal_previous() and aborts the query.
The per-data entry array chain has the same issue as the global one,
since n_entries and entry_array_offset are (re)written in place as
entries are linked, and thus after a crash they can reference more
arrays than actually reached the disk. That is to say in practical
terms, a journal recovered for reading by the previous commit could
nevertheless still drop matching entries from `journalctl FIELD=value`,
and a seqnum or time seek into the lost region could fail outright.
Let's give generic_array_bisect() the same tolerance generic_array_get()
already has. That is, when moving to an entry array object fails, treat
the chain as ending at the previous array. This means that the result
matches what generic_array_get() would yield for the same file.
request_handler() owns the hostname var and passes it by value to
request_meta(), which hands it to source_new(), which stores it in
source->importer.name without copying. If build_accept_encoding()
then fails, the hostname var is freed, and then the caller's
_cleanup_free_ frees it a second time.
Follow-up for 9ff48d0982
request_reader_entries() negated m->n_skip in signed context before
casting to uint64_t, which is undefined behaviour for
n_skip == INT64_MIN.
Follow-up for 77ad3b93de
Follow-up for a7bfb9f76b
* f7762b7143 sandbox: Preserve net caps across user namespace before unsharing net
* 582eadee34 Revert "Put build history into the output directory"
* 5ef262bc53 action: don't fail if apk cannot be downloaded
* bdd341ff9b Lock the package cache during package manager invocations
* da49fe976c Put build history into the output directory
* 1c392f1918 tests: Use unique machine names
* e4f4026e30 tests: Reduce VM RAM size
* de41a5e03e Don't leak gpg-agent when signing with gpg
* 1bc5d61e1d ci: Pin openSUSE to second-to-last Tumbleweed snapshot
* c4d565a009 test: Use the main build's snapshot for extension builds
* 718b06c866 tests: ignore masked units in check-and-shutdown
* 0dc5ecbc02 ci: enable postmarketOS in integration testing
* d4c6761ad3 action: install apk to /usr/bin
* 9980f31309 mkosi-vm: add systemd-efistub to postmarketOS config
* 5640ace38f mkosi.conf: add grub to postmarketOS
* 6741b440c0 mkosi-initrd: add sulogin, device-mapper to postmarketOS initrd
* c3575c035c mkosi-tools: add missing packages to postmarketOS tools tree
* 0774bc2498 mkosi-tools: add apk-tools to tools trees for Arch and OpenSuSE
| * bb87e48401 curl: Retry on failures
|/
* 41fea1dd8d dnf: Work around librepo rejecting valid repomd signatures cross-distro
* 647e3b610b dnf: Proper repository metadata signature requirement
* 46d907cce2 dnf: Don't skip unavailable repositories during makecache
* a91e89c3b7 run_locale_gen: noop if output_format is confext
* 30329e401b tests: Make integration tests runnable locally
* be549f04db config: Don't propagate $MKOSI_DNF when using a tools tree
* 42ed648981 build(deps): bump actions/upload-artifact from 7.0.0 to 7.0.1
* fd5eedd62b build(deps): bump aws-actions/configure-aws-credentials
* 86733c703d tree: check for root when copying SELinux attributes as well
* de2256f8fe Skip security.ima xattrs when copying tree as non-root
| * 08ebf6d678 vmspawn: Exclude secure-boot unless requested
|/
* 1d3c51e36d obs workflow: do not build aarch64/i586
When 'homectl deactivate' is called immediately after a preceding
operation, the umount inside systemd-homework can fail with EBUSY
because something briefly holds a reference to the home mount (e.g. a
concurrent inspect). systemd-homed already handles this gracefully
by moving the home into the 'lingering' state and retrying deactivation
after 15 seconds, but the bus reply for the original DeactivateHome
call returns the org.freedesktop.home1.HomeBusy error immediately,
which makes TEST-46-HOMED flaky.
Fix homectl to follow homed and retry for up to 30 seconds on HomeBusy
and add a test case trying to make the issue more reproducible.
Let's do the standard thing. The 'static const' variable requires space
and less efficient code (moving from memory instead of a const insertion).
This doesn't matter much, but let's follow the standard pattern.
Follow-up for 93e9c2c974.
setup_swtpm() decided whether a software TPM had already been
manufactured by checking whether the state directory was empty. But
manufacture_swtpm() writes swtpm's config files before forking
swtpm_setup, so an interrupted manufacture leaves the directory
non-empty yet without a usable TPM. The next boot then mistook it for a
complete TPM and started swtpm against a broken state directory.
Keying off a swtpm state file like tpm2-00.permall is no better, as
swtpm_setup gives no guarantee any single one is written atomically or
last. Instead, have manufacture_swtpm() write a marker (.manufactured)
as its very last step, once swtpm_setup has exited successfully, and
gate on it: re-manufacture when it is missing in the initrd, and refuse
rather than start a broken TPM outside it.
Signed-off-by: Paul Meyer <katexochen0@gmail.com>
Open the swtpm state directory once and write the three config files
relative to that fd with WRITE_STRING_FILE_ATOMIC, rather than by path
with a plain truncating write. Writing atomically ensures a crash or a
concurrent reader never observes a half-written config file, and
operating through a single directory fd lets later steps reuse it.
Signed-off-by: Paul Meyer <katexochen0@gmail.com>
unit_name_mangle_with_suffix() is quite benevolent by default and allows
the unit to "transition" into a different unit type than what's
requested via its suffix argument. For example, calling
unit_name_mangle_with_suffix() with "/foo/bar" as a unit name and
".service" as a suffix would give you "foo-bar.mount", without any
warning or error.
This could then lead to a quite confusing errors in certain situations:
~# systemd-run --remain-after-exit --unit /foo/bar true
Failed to start transient service unit: Cannot set property RemainAfterExit, or unknown property.
Given we can't change the default behaviour of
unit_name_mangle_with_suffix() as some parts of systemd already depend
on its "benevolence" (like systemctl), let's introduce a new flag -
UNIT_NAME_MANGLE_STRICT - that checks if the mangled/resolved unit
name's suffix matches the requested one and errors out if not.
With the flag used throughout systemd-run's code, the error in the above
case is now a bit more clear:
~# build/systemd-run --remain-after-exit --unit /foo/bar true
Path "/foo/bar" resolves to unit type "mount", but "service" is expected as unit.
Failed to mangle unit name: Invalid argument
Resolves: #39996
The new polkit will return a new detail regarding a successful
authentication: the actual result type, which we can use to
see whether the user authenticated as admin. This can be used
to grant additional privileges.
Apply sandboxing. The plain backend's needs writable StateDirectory and
/dev/urandom for key generation. The service must stay root (the
private key is root-only), but everything else is locked down.
Signed-off-by: Paul Meyer <katexochen0@gmail.com>
client_context_read_extra_fields() reads a 64-bit field length v from
the per-unit log-extra-fields file. n = sizeof(uint64_t) + v overflows
when v is near UINT64_MAX, so the "left < n" check is bypassed and the
following memchr() scans v bytes past the buffer. Bound v against the
remaining bytes instead, which cannot overflow.
uid_range_partition() filled the grown entries[] buffer backwards in
place. The backward-fill invariant (the write cursor stays above the
read index) only holds when every source entry contributes at least
one partition; an entry with nr < size contributes zero, so the cursor
stalls while the read index keeps descending. A later multi-part
entry's writes then overwrite the still-live zero-part slot, the
corrupted slot is re-read as a one-part entry, and the next
range->entries[--t] underflows.
Add a forward compaction first pass that drops the zero-part entries
before the backward fill.
Follow-up for 025439faaa
Co-Authored-by: Paul Meyer <katexochen0@gmail.com>