Linux 7.0 added the ability to mark socket inodes with xattrs. Let's use
that to clearly mark all our Varlink sockets as being varlink related.
This is then used to implement a very useful new command "varlinkctl
list-sockets" which lists all varlink entrypoint sockets marked this
way.
By marking not just the entrypoint inodes but also the connection
sockets properly, we can one day add an ebpf based "varlinkctl trace"
command that watches varlink sockets for traffic. but that's material
for a later PR.
When the test suite is run in the "standalone" mode, the minimal
container might not contain the test-fdstore binary that's needed for a
couple of tests. Since installing systemd-tests into the minimal
container pulls in a lot of other dependencies, let's just skip the
affected tests instead to avoid this.
This commit exposes the last 10 high priority logs as metrics so that
the systemd-report reports them. The entries are reported as
`io.systemd.Journal.HighPriorityMessage` and include all fields that are
printable as strings.
This is archived via a new socket-activated unit listens on
/run/systemd/report/io.systemd.Journal
With EnqueueUnitJobMany(), one anchor can collapse to NOP (inactive
unit + try-restart) while another anchor pulls that same unit in as a
regular start/restart job, leaving a NOP and a regular job in one
unit's transaction list, hitting an assert:
#11 0x00007f3fd2a446dc in __assert_fail (assertion=<optimized out>, file=<optimized out>, line=<optimized out>,
function=<optimized out>) at ./assert/assert.c:127
#12 0x00007f3fd326e872 in job_type_lookup_merge (a=<optimized out>, b=<optimized out>) at ../src/core/job.c:428
#13 0x00007f3fd32e5641 in job_type_merge_and_collapse (a=0x7ffc7dda2430, b=<optimized out>, u=0x557bb11434c0)
at ../src/core/job.c:523
#14 0x00007f3fd335e4b3 in transaction_ensure_mergeable (tr=tr@entry=0x557bb0f6d150,
matters_to_anchor=matters_to_anchor@entry=true, e=e@entry=0x7ffc7dda33e0) at ../src/core/transaction.c:241
#15 0x00007f3fd3360242 in transaction_merge_jobs (tr=0x557bb0f6d150, e=0x7ffc7dda33e0)
at ../src/core/transaction.c:273
#16 transaction_activate (tr=0x557bb0f6d150, m=0x557bb0dd9c10, mode=JOB_REPLACE, affected_jobs=0x0, e=0x7ffc7dda33e0)
at ../src/core/transaction.c:797
#17 0x00007f3fd33091ed in manager_add_jobs (m=<optimized out>, type=<optimized out>, names=<optimized out>,
reload_if_possible=false, mode=JOB_REPLACE, extra_flags=0, affected_jobs=0x0, reterr_error=0x7ffc7dda33e0,
ret_jobs=0x557bb0fe8790) at ../src/core/manager.c:2386
Follow-up for 7d3b32daef
This commit exposes the last 10 high priority logs as metrics
so that the systemd-report reports them. The entries are
reported as `io.systemd.Journal.HighPriorityMessage` and
include all field as the new METRIC_FAMILY_TYPE_OBJECT.
Individual fields from a journal entry that are unprintable
(invalid utf-8) are skipped.
This is archived via a new socket-activated unit listens on
/run/systemd/report/io.systemd.Journal
Currently only a single job for a single unit can be enqueued
atomically,
so there is no guarantee that, e.g., starting a unit and its socket
at the same time will happen in the same transaction. That forces
callers to 'know' the right order in which to start new units being
installed, or failures will occur. It also means some ordering
constraints are ignored, in case the separate calls are done
in the wrong manual order.
Add a new EnqueueUnitJobMany() D-Bus method that takes a list of units
to start.
Currently the partition list is ordered like this: First come the
partitions that exist as definition files (could be pre-existing
partitions or could be new ones), then come the pre-existing partitions
that aren't matched to a definition file.
This ordering is visible to the user when we print our partition table,
and it doesn't really make sense from a UX perspective: Partition tables
are usually either presented in order of the partition indices, or in
order of the partition offsets. Arguably the latter would be nicer here,
since the visualization below is already ordered by physical offsets.
So reorder the list after we assigned the new partitions to their
respective free areas, according to the physical offset (or, for
partitions to newly create, the order that we will allocate them in).
Another potential upside of this is that we could rely on the partition
order in the code now more, too.
To ensure it keeps working, also add a test in the integration tests for
it.
Screenshot before:
<img width="2853" height="686" alt="Screenshot From 2026-06-05 00-58-07"
src="https://github.com/user-attachments/assets/7f24b527-7d79-49c4-916b-52faa892d4eb"
/>
Screenshot after:
<img width="2853" height="686" alt="Screenshot From 2026-06-05 00-58-16"
src="https://github.com/user-attachments/assets/4505ec5e-cab4-4ac1-95f0-b5af3991509e"
/>
Transfer files might come and go, components might be enabled and
disabled. Patterns might change. Let's keep track of what we install, so
that we can automatically gc everything no longer owned by any enabled
transfer.
Closes#42693. Specifiers are now expanded in symlink targets
(previously, they were only expanded in the source) - this is
technically a breaking change, but I'd be very surprised if anyone was
relying on this.
No other simplification is applied to the target (unlike the source,
which goes through `path_simplify_and_warn`).
Also a few minor changes:
- rename local `path` variable to `source` to match documentation
convention
- document that `MakeSymlinks=` accepts specifiers
- fix error message to print `MakeSymlinks=` option instead of
`Subvolumes=`
Currently the partition list is ordered like this: First come the partitions that
exist as definition files (could be pre-existing partitions or could be new ones),
then come the pre-existing partitions that aren't matched to a definition file.
This ordering is visible to the user when we print our partition table, and it
doesn't really make sense from a UX perspective: Partition tables are usually
either presented in order of the partition indices, or in order of the partition
offsets. Arguably the latter would be nicer here, since the visualization below
is already ordered by physical offsets.
So reorder the list after we assigned the new partitions to their respective free
areas, according to the physical offset (or, for partitions to newly create, the
order that we will allocate them in).
Another potential upside of this is that we could rely on the partition order in
the code now more, too.
To ensure it keeps working, also add a test in the integration tests for it.
This commit uses the abstractions added in the previous commit to
add a bunch more properties to the io.systemd.StartTransient()
to showcase how straightforward this is now.
New helpers for tristate bools and an init helper are added. A
dedicated dispatcher for LogLevelMax parses the string-form name
("info", "debug" etc.) declared in the varlink IDL.
The new properties are: DynamicUser, IgnoreSIGPIPE, LockPersonality,
MemoryDenyWriteExecute, NoNewPrivileges, OOMScoreAdjust, RemoveIPC,
RestrictRealtime, RestrictSUIDSGID, RootEphemeral, UMask.
The remaining ProtectKernel*, Private*, ProtectClock properties are
declared as STRING in the varlink IDL (matching the modern *Ex/enum
form) so a bool dispatcher does not pass schema validation. Those
need a string-parsing dispatcher and will be added in a follow-up.
This brings us closer to parity with the D-Bus code (still a long
way to go though).
This commit adds support to /etc/hostname for substitution of $ from
wordlists located in /etc,/run,/usr/lib}/systemd/. Each $ is resolved to
a number (1,2,3...) and the corresponding file "1" is opened to acquire
the word. With that we can do a petname [1] style hostname in systemd,
e.g. below a possible expansion for a hostname template:
$-$-$-???? -> wildly-happy-octopus-92a9
The substitution of words is stable (based on machine-id) but if the
wordlist changes the hostname would change. We could pick it once and
cache it but Lennart did not like this so this version instead always
picks it (based on offset of the file so the operation is cheap). To
persist it one can use the `firstboot.hostname` credential. One a live
system this will be expanded and then written in the expanded form to
/etc/hostname.
This also includes a wordlist from the "petname" project that can be
optionally installed.
Thanks to Dustin Kirkland for this wonderful project.
[1] https://github.com/dustinkirkland/petname
---
I'm a bit unsure if this should include the word lists (I think its nice
to have them though) and if so if they should be their own commit.
When placing new partitions and there's space left because the new partitions
aren't occupying the whole free space, context_grow_partitions_on_free_area()
is supposed to distribute the free space between the new partitions. If still
no partition wants the free space, the free space ends up becoming padding.
Currently that padding is allocated to the partition preceding the FreeArea
(ie. a->after). This obviously means that the new partitions now end up at
the *end* of the free area rather than at the beginning, which is somewhat
unexpected given how partition placement usually is done.
Fix it by finding the last partition that belongs to the free area, and then
allocating the padding to that partition, so that the new partitions end up
getting aligned with the beginning of the free area, not the end.
Because the span might not be rounded to grain if there's a pre-allocated
a->after partition before the free area, we need to round it down ourselves
(otherwise the "left >= p->new_padding" assertion in context_place_partitions()
is going to fail).
Also ensure the fix works as expected by adding a test.
Now that we support the `$` we want to also make this available
inside the system.hostname and firstboot.hostname credentials and
the firstboot --hostname option. This commit adds it (and also `?`).
This commit adds support to /etc/hostname for substitution
of $ wordlists from {/etc,/run,/usr/lib}/systemd/hostname-wordlist.
The first $ will lookup hostname-wordlist/1, the next
hostname-wordlist/2 and so on.
With that we can do a petname [1] style hostname in systemd, e.g.
below a possible expansion for a hostname template:
$-$-$-???? -> wildly-happy-octopus-92a9
The substitution of words is stable (based on machine-id) but
not persisted, it is picked on every boot via a stable file
offset so the operation is cheap. But this means that if the
wordlist changes the hostname would change. The next commit
will add the pattern to the firstboot.hostname credential which
is persistet with the resolved names to avoid this issue.
This also includes a wordlist from the "petname" project
that can be optionally installed.
Thanks to Dustin Kirkland for this wonderful project.
[1] https://github.com/dustinkirkland/petname
We already support exponential backoff for automatic restarts via
RestartSec=/RestartSteps=/RestartMaxDelaySec=, but there is no way to
randomize the restart delay. When many instances of a service fail at
the same time (e.g. because a shared resource briefly went away) they
are all restarted in lockstep, creating a thundering herd problem.
So this commit adds a simple `RestartRandomizedDelaySec=` service
option which is similar to the timer `RandomizedDelaySec=` and
adds a randomized restart delay.
test_wildcard() was never executed: run_testcases() only picks up
functions named testcase_* so this test never ran. This commmit
makes it run and fixes two issues in the test:
1. /etc/hostname is absent in the test image so we need to guard
for that.
2. The pattern check was written as [[ "$P" == "$H" ]] with both
sides quoted, but we need to one side unquoted as otherwise
the comparison will always be false.
The networkd metrics interface already reports a lot of interesting
metrics. With this commit it also report the network addresses too.
Each ready address is emitted as one record per (interface, address)
pair:
- object: ifname
- value: address in CIDR notation
- fields: { family: "ipv4"|"ipv6", scope: "global"|"link"|"host"|... }
The loopback addresses are not reported as its just noise.
Example output:
```
root@localhost:~# varlinkctl --more --json=short call /run/systemd/report/io.systemd.Network io.systemd.Metrics.List '{}'
{"name":"io.systemd.Network.Address","object":"enp0s1","value":"fe80::5054:ff:fe12:3456/64","fields":{"family":"ipv6","scope":"link"}}
{"name":"io.systemd.Network.Address","object":"enp0s1","value":"fec0::5054:ff:fe12:3456/64","fields":{"family":"ipv6","scope":"site"}}
{"name":"io.systemd.Network.Address","object":"enp0s1","value":"10.0.2.15/24","fields":{"family":"ipv4","scope":"global"}}
```
This commit adds the boot timeline (MANAGER_TIMESTAMP_KERNEL/USERSPACE/FINISH) as
metrics. The kernel CLOCK_MONOTONIC value is 0 by definition, so only its
.Realtime is reported. For userspace and finish report both .Realtime and
.Monotonic. The naming follows D-Bus.
Up to now we recommended to use TARGET.upholds/ symlinks to start units
when an extension is loaded. However, this has some drawbacks. First,
for services that should not be tried to be started all the time we have
to resort to hiding them through a target that gets uphold and then
uses regular .wants/ for the actual service. Second, we actually leak
services on extension unload even if the unit has disappeared with the
extension. Third, to affect a service through a drop-in or a config
change from a confext/sysext and that service is already running, we
need a way to restart/reload it instead of just starting it.
Similar to EXTENSION_RELOAD_MANAGER=1, add a EXTENSION_RESTART_UNITS=
and a EXTENSION_RELOAD_OR_RESTART_UNITS= setting to the
extension-release metadata file, carrying a whitespace-separated list
of units to restart/reload on merge/refresh/unmerge after the daemon
reload. Also detect when the unit has vanished which is normally the
case when the unit was part of the unmerged extension, and stop it
explicitly to prevent it leaking. When the extension itself ships the
binary it should use EXTENSION_RESTART_UNITS= to make sure the new
binary is picked up. Since starting through this setting does not work
when the extension is mounted from the initrd, extensions should still
ship at least a .wants/ symlink to start at boot but can also continue
to ship a .upholds/ symlink for backwards compatibility without any
drawback and still benefit from the unit stopping triggered by the new
setting. While there are cases where one could want to set
EXTENSION_RESTART_UNITS= without requiring a daemon reload (e.g., an
env var file change instead of a unit drop-in), we now do an implicit
daemon reload when we have to restart units so that we know we work on
the right state and we spare users remembering to set this setting in
addition to prevent running into this issue.
Move the --reboot/--component= rejection into parse_argv() alongside the
other cross-option checks, and tighten TEST-72 to assert the specific
guard message rather than merely a non-zero exit.
The `pending` and `reboot` verbs, as well as the `--reboot` switch, compare
the newest installed version against the booted OS version (IMAGE_VERSION= from
os-release). When a component is selected via --component=, this compares the
component's version against the unrelated host OS version, which by design live
in separate version spaces. The result is arbitrary reboot decisions: depending
on the relative version strings sysupdate would either always or never reboot.
Refuse the combination with a clear error instead of silently performing a
bogus comparison. Correctly tracking a per-component booted version is left as a
future feature.
Fixes: https://github.com/systemd/systemd/issues/42330
This adds the following:
1. systemd-report gains a new --sign= option, taking a boolean. If true,
this makes systemd-report generate + systemd-report upload generate a
signed report, instead of a regular one. The signatures are collected
from Varlink-based backends.
2. One such backend is added which does a simple Ed21159 based signing
scheme.
3. this adds a new metrics source which just reports text files
symlinked into a special dir as metrics. This is used to report the
Ed21159 public key as metric, by default, if it exists.
4. finally, systemd-report itself is turned into a varlink service. this
is useful for example for extracting a report from a system coming in
via the varlink/http bridge.
I thought a long time about the format of signing of reports. Initially
i intended to do this like homed's user record signing, i.e. require
normalization of the record, then normalize the record, and write it out
in dense form, since the result. Finally insert the resulting hash into
the user record itself. People have pointed me to the inherent messiness
of signing JSON this way though, as it requires any participant that
wishes to sign/authenticate records this way to implement the exact same
normalization/formatting rules, and in particular in the area of
floating point numbers (of which metrics presumably will have many) this
is quite problematic.
This signing hence goes a different way. instead of expecting
signer+verifier to independently come to the same normalized text form
of the json data, let's instead output a JSON-SEQ sequence, where the
first object is the report, and any subsequent objects are one signature
each. the signatures are supposed to cover the precise binary
representation of the first element in the JSON-SEQ stream. (i.e. from
the RS to the NL).
or in other words: a verifier would receive the JSON-SEQ stream, split
it up before each RS. Then it would leave object 1 unparsed for the
moment, and parse objects 2…n. It would then authenticate object 1's
precise binary representation with objects 2…n. Once that checks out, it
would parse object 1, and use it as report.
Acquiring a LUO session from /dev/liveupdate requires privileges,
and also the device is a single-owner driver so only a single
process can open it at any given time.
Add a LUOSession= service settings that allows units running
without privileges to get a session assigned to them.
The kernel imposes a 64 chars limit on session names, which is
too short to avoid clashes, so derive a hash from joining the
unit name with the parameter name, that way two units using
the same setting don't clash.
663f0bf5cb stopped reusing the original block device fd whenever
partition scanning was requested (LO_FLAGS_PARTSCAN) but couldn't be
enabled on the device, so that nested partition tables on devices the
kernel won't scan (e.g. the pmOS/android case) get exposed via a real
loop device.
However that also forced a pointless loop device for any partition that
carries a file system directly, e.g. a btrfs subvolume mounted via
MountImages=. For multi-device btrfs this is fatal: the kernel rejects
seeing the same member via both the original partition and the loop
device, and the mount fails.
A loop device is only ever needed here to expose a nested partition
table. So only refuse the shortcut when the device actually carries one,
probed via gpt_probe(), instead of whenever partition scanning is
disabled. Devices carrying a file system directly (or nothing) take the
shortcut as before.
Add an integration test to cover the failure scenario of the original
issue.
Fixes: https://github.com/systemd/systemd/issues/42520
Replaces: https://github.com/systemd/systemd/pull/42576
Follow-up for 663f0bf5cb
Co-Authored-By: Luca Boccassi <luca.boccassi@gmail.com>
Co-developed-by: Claude Opus 4.8 <noreply@anthropic.com>
unhexmem_full() ignores whitespace, so a 64-byte manifest digest field
can decode to fewer than 32 bytes. Reject that while parsing instead.
[ 83.883087] TEST-72-SYSUPDATE.sh[5995]: systemd-sysupdate: ../src/src/sysupdate/sysupdate-resource.c:581: resource_load_from_web: Assertion `h.iov_len == sizeof(instance->metadata.sha256sum)' failed.
Follow-up for 43cc7a3ef4
Assisted-by: kres (claude-opus-4-7)
Co-developed-by: Claude Opus 4.8 <noreply@anthropic.com>
The flags parameter is parsed into a global variable, which means
when there are multiple consecutive calls it is reused. Switch to
a local copy.
Follow-up for 066f6bfb62
Assisted-by: kres (claude-opus-4-7)
Co-developed-by: Claude Opus 4.8 <noreply@anthropic.com>
The have_updatectl variable is meant to gracefully handle the case where
updatectl is missing. But, because the script runs with -e, it fails
immediately in that case instead. Moreover, expanding $have_updatectl
when it is present actually executes updatectl, rather than simply
checking for its existence.
Re-factor this check so that it does handle a missing updatectl.
When libdevmapper is built without UDEV_SYNC_SUPPORT (e.g. on Alpine/postmarketOS),
it creates a device node under /dev/mapper/ instead of relying on udev to create a symlink.