kexec-tools has a --reuse-cmdline option which is very convenient
when doing a lot of reboots, add the same to systemctl.
Dedup options, letting the last one wins in case of duplicates,
so that 'systemctl kexec --reuse-cmdline' can be chained many times
without continuosly expanding the cmdline with duplicates from
the boot entry.
This is meant to mirror sudo's -k/--reset-timestamp and
-K/--remove-timestamp options, which revoke the temporary authorization
provided by the timestamp files in /var/run/sudo/ts.
To achieve the same effect in run0, we ask polkit to revoke our
temporary authorization. If used with a command, run0 will revoke the
temporary auth and then immediately authorize the user again, just like
sudo -k. All the bus calls are completed synchronously, as they need to
complete before authorizing the user anyway.
Like sudo, the effect of -k/--reset-timestamp is to revoke only the
tmpauthz that polkit would have used to authorize the command, if
available. The -K/--remove-timestamp option will revoke all temporary
authorizations across all ttys.
machinectl bind-volume MACHINE PROVIDER:VOLUME[:CONFIG][:K=V,...]
machinectl unbind-volume MACHINE PROVIDER:VOLUME
For bind-volume, machinectl parses the SPEC with the shared
bind_volume_parse(), Acquires the storage volume from the named
provider on the machinectl side, locates the target machine's
io.systemd.MachineInstance control socket via
machine_get_control_address(), pushes the fd across, and calls
io.systemd.MachineInstance.AddStorage with name='<provider>:<volume>'
and the user-supplied config string.
For unbind-volume, machinectl just forwards the name string to
io.systemd.MachineInstance.RemoveStorage.
Volumes attached at machine startup (e.g. via systemd-vmspawn's
--bind-volume=) are rejected with StorageImmutable when the user
attempts to unbind them at runtime.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
systemd-vmspawn --bind-volume=PROVIDER:VOLUME[:CONFIG][:K=V,...]
For each --bind-volume passed at startup, vmspawn calls Acquire() on
the named StorageProvider and attaches the resulting fd to the VM as
an additional drive. The drive is identified by the user-visible name
'<provider>:<volume>' on the bridge — that is also the handle used
later when machinectl unbind-volume detaches drives at runtime
(though boot-time drives like these are NOT removable; that is the
StorageImmutable behaviour added earlier).
The colon grammar is parsed by the shared bind_volume_parse() helper.
The 3rd 'config' field selects the guest device type from the
disk_type_table[] vocabulary (virtio-blk, virtio-scsi, nvme, scsi-cd);
empty defaults to virtio-blk per the TASK grammar.
Wiring lives next to the existing --extra-drive setup: parse_argv()
appends a parsed BindVolume to arg_bind_volumes, and prepare_device_info()
hands the array to vmspawn_bind_volume_prepare_boot() which Acquires
each volume and pushes a DriveInfo onto the existing drives array.
PCIe port assignment (assign_pcie_ports()) and the QMP setup loop pick
them up automatically.
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Among other things this changes tracking of the location of resources
during GC from using the BootEntrySource enum rather than a path, since
we have that and it is more efficient and easier to grok.
CLI for inspecting and using storage providers. Scans
/run/systemd/io.systemd.StorageProvider/ (or the user-mode equivalent)
for AF_UNIX sockets and talks to each one over Varlink. Verbs:
"volumes" lists volumes across all providers, "templates" lists
supported creation templates, "providers" lists the endpoints
themselves.
Also installed as a mount.storage helper, so
'mount -t storage PROVIDER:VOLUME /mnt' (or 'mount -t storage.<fstype>'
to put a fresh filesystem on a block volume) acquires the volume and
mounts it. Ships with bash/zsh completions and a man page.
Add --forward-journal=FILE|DIR to forward the container's journal
entries to the host via systemd-journal-remote. When specified,
nspawn starts systemd-journal-remote listening on a Unix socket,
bind-mounts it into the container at /run/host/journal/socket, and
passes a journal.forward_to_socket credential pointing to it.
Add --forward-journal-max-use=, --forward-journal-keep-free=,
--forward-journal-max-file-size=, and --forward-journal-max-files=
to configure disk usage limits for the forwarded journal.
Consolidate nspawn's per-machine on-disk state under a single runtime
directory at /run/systemd/nspawn/<machine>/. The container rootdir
mount point moves from /tmp/nspawn-root-XXXXXX to <runtime_dir>/root,
the unix-export directory moves from
/run/systemd/nspawn/unix-export/<machine> to <runtime_dir>/unix-export,
and the journal-remote socket lives at
<runtime_dir>/journal-remote-socket. Update ssh-generator and
ssh-proxy to follow the new unix-export path layout.
Extract fork_journal_remote() into fork-notify.{c,h} as a shared
helper used by both nspawn and vmspawn, replacing vmspawn's
start_systemd_journal_remote().
Extract runtime_directory_make() into path-lookup.{c,h} as a shared
helper used by both nspawn and vmspawn, replacing vmspawn's inline
runtime directory creation logic.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Add options to vmspawn to configure journal-remote disk usage limits
when forwarding journal entries from the VM. These are passed through
as --max-use=, --keep-free=, --max-file-size=, and --max-files=
command-line arguments to systemd-journal-remote.
Add --max-use=, --keep-free=, --max-file-size=, and --max-files=
command-line options to systemd-journal-remote to allow overriding the
corresponding settings from the configuration file.
Add $SYSTEMD_JOURNAL_REMOTE_CONFIG_FILE environment variable support
to systemd-journal-remote. When set, the specified file is used
instead of the default configuration file and drop-in directories.
When set to the empty string or /dev/null, configuration file parsing
is skipped entirely. vmspawn sets this to /dev/null in the child
process to avoid inheriting the host's journal-remote configuration.
Make fork_notify() argv parameter optional. When NULL is passed,
fork_notify() returns 0 in the child (with $NOTIFY_SOCKET set) and
lets the caller run custom code before exec. Returns 1 in the parent.
This allows vmspawn to set environment variables in the child without
polluting the parent process.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Allows appending kernel command line arguments, like
kexec-tool does. This is especially needed for the integration
tests, as mkosi adds a bunch of options that are needed for the
test suite to work, and it breaks without them.
Add a new --restrict-address-families= command line option and
corresponding RestrictAddressFamilies= setting for .nspawn files to
restrict which socket address families may be used inside a container.
Many address families such as AF_VSOCK and AF_NETLINK are not
network-namespaced, so restricting access to them in containers
improves isolation. The option supports allowlist and denylist modes
(via ~ prefix), as well as "none" to block all families, matching the
semantics of RestrictAddressFamilies= in unit files.
The address family parsing logic is extracted into a shared
parse_address_families() helper in parse-helpers.c, which is now also
used by config_parse_address_families() in load-fragment.c.
This is currently opt-in. In a future version, the default will be
changed to restrict address families to AF_INET, AF_INET6 and AF_UNIX.
Drop the --cxl= option and unconditionally enable cxl=on the QEMU
machine type whenever the host architecture supports it (x86_64 and
aarch64). The flag was only added for testing parity with mkosi's CXL=
setting and there is no reason to leave it as an opt-in toggle: with no
pxb-cxl device or cxl-fmw window attached, enabling it on the machine
only reserves a small MMIO region and emits an empty CEDT, so the cost
is negligible while removing one knob users would otherwise have to
flip explicitly to exercise the CXL code paths in QEMU.
Rename nspawn's --user=NAME option to --uid=NAME for selecting the
container user. The -u short option is preserved. --user=NAME and
--user NAME are still accepted but emit a deprecation warning. A
pre-parsing step stitches the space-separated --user NAME form into
--user=NAME before getopt sees it, preserving backwards compatibility
despite --user now being an optional_argument.
Repurpose --user (without argument) and --system as standalone
switches for selecting the runtime scope (user vs system service
manager).
Replace all uses of the arg_privileged boolean with
arg_runtime_scope comparisons throughout nspawn. The default scope
is auto-detected from the effective UID.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Move register_machine() and unregister_machine() from
vmspawn-register.{c,h} into shared machine-register.{c,h} so both
nspawn and vmspawn can use the same implementation.
The unified register_machine() uses varlink first (for richer
features like SSH support and unit allocation) with a D-Bus
RegisterMachineWithNetwork fallback for older machined. The
interface adds a class parameter ("vm" or "container") and
local_ifindex for nspawn's network interface support.
The unified unregister_machine() similarly tries varlink first
(io.systemd.Machine.Unregister) before falling back to D-Bus.
Both register_machine() and unregister_machine() only log at debug
level internally, leaving error/notice logging to callers.
Add register_machine_with_fallback() which tries system and/or user
scope registration based on a RuntimeScope parameter
(_RUNTIME_SCOPE_INVALID for both), and
unregister_machine_with_fallback() as its counterpart. Both use
RET_GATHER() to collect errors from each scope.
Make --register= a tristate (yes/no/auto) defaulting to auto. When
set to auto, registration failures are logged at notice level and
ignored. When set to yes, failures are fatal.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Add --cxl=BOOL option to enable CXL (Compute Express Link) support in
the virtual machine. CXL is a high-speed interconnect standard that
allows CPUs to access memory attached to devices such as accelerators
and memory expanders, enabling flexible memory pooling and expansion
beyond what is physically installed on the motherboard. When enabled,
adds cxl=on to the QEMU machine configuration. Only supported on x86_64
and aarch64 architectures.
This is added for testing purposes and for feature parity with mkosi's
CXL= setting.
Extend --ram= to accept an optional maximum size for memory hotplug,
using the syntax --ram=SIZE[:MAXSIZE] (e.g. --ram=2G:8G). When a
maximum is specified, the maxmem key is added to the QEMU memory
configuration section to enable memory hotplug up to the given limit.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
The varlink spec supports protocol upgrades and they are very
useful to e.g. transfer binary data directly via varlink. So
far varlinkctl/sd-varlink was not supporting this. This commit
adds support for it in varlinkctl by using the new code in
sd-varlink and the generalized socket-forward code.
It's useful to be able to check what firmware description vmspawn
will select. In particular, this will allow me to figure out the
nvram template file that will be picked up so I can pick it up in
mkosi and operate on it to pass a modified version of it to vmspawn
with --efi-nvram-template=.
Add --efi-nvram-template=PATH to specify a custom firmware variables
file to copy and use as the initial EFI NVRAM state instead of the
default template from the firmware definition.
Add --firmware-features=FEATURE[,FEATURE...] to require or exclude
specific firmware features during automatic firmware discovery.
Features prefixed with "!" are excluded. If a feature appears in both
the included and excluded lists, inclusion takes priority. Firmware
with the "enrolled-keys" feature is excluded by default.
Refactor --secure-boot= to operate on the firmware features sets
instead of maintaining a separate tristate. --secure-boot=yes adds
"secure-boot" to the include set, --secure-boot=no adds it to the
exclude set, and --secure-boot=auto removes it from both.
Generalize find_ovmf_config() to accept include/exclude feature sets
instead of a secure boot tristate, removing the special-cased
enrolled-keys and secure-boot filtering logic.
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
vmspawn previously hardcoded virtio-blk for all drives. This adds
--image-disk-type= to select the root disk type (virtio-blk,
virtio-scsi, or nvme) and allows per-drive overrides via a
colon-separated prefix on --extra-drive=. The format and disk type
prefixes can appear in any order since their value sets don't overlap.
For virtio-scsi, a single shared controller is created with drives
attached as scsi-hd devices. For nvme, each drive gets its own
controller. Both have serial number length limits (30 and 20 characters
respectively), so long filenames are replaced with a truncated SHA-256
hex digest.
Extend --image-disk-type= and the --extra-drive= disk type prefix to
support nvme in addition to virtio-blk and virtio-scsi:
systemd-vmspawn --image-disk-type=nvme --image=image.raw
systemd-vmspawn --image=image.raw --extra-drive=nvme:data.raw
The NVMe serial number is limited to 20 characters by the NVMe spec.
If the image filename exceeds this, it is hashed with SHA-256 and
truncated to 20 hex characters via the disk_serial() helper introduced
in the previous commit.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Add --image-disk-type= to select the disk type for the root disk, and
allow specifying the disk type as a colon-separated prefix on
--extra-drive=:
systemd-vmspawn --image-disk-type=virtio-scsi --image=image.raw
systemd-vmspawn --image=image.raw --extra-drive=virtio-scsi:data.raw
For --extra-drive=, the format and disk type prefixes can appear in any
order since the value sets don't overlap:
--extra-drive=raw:virtio-scsi:/path
--extra-drive=virtio-scsi:raw:/path
Extra drives inherit --image-disk-type= by default unless overridden
with an explicit prefix.
vmspawn originally used virtio-scsi for all drives but switched to
virtio-blk in 1f24a954e4 for simplicity and direct kernel boot
compatibility. This makes virtio-scsi available again as an explicit
option for cases where a SCSI storage topology is desired.
For virtio-scsi, a shared virtio-scsi-pci controller is created and
drives are attached as scsi-hd devices. The SCSI serial number is
limited to 30 characters, so filenames exceeding this are hashed with
SHA-256.
Signed-off-by: Christian Brauner <brauner@kernel.org>
These two completers are written in a stacked _arguments style, and some
generic options are valid before or after the verb. If the toplevel
_arguments is permitted to match options after the verb, it will halt
completion prematurely, so stop toplevel matching after the verb.
This corrects the following error:
$ userdbctl --output=class user <TAB> # completes users
$ userdbctl user --output=class <TAB> # completes nothing
Closes#40883. As described in the issue, it's not "jobs" that are
marked, and also the name is unnecessarilly long.
I think we don't need any compatibility measures here. At least in the
rpm world, package upgrade scripts go through the helper which is part
of the package so the new systemctl and the new helper are upgraded
together.
When the extensions for the final system are already set up from the
initrd we should avoid disrupting the boot process with the remount
(which currently isn't atomic) and the daemon reload for
systemd-confext and systemd-sysext. Similarly, when sysupdate ran and
updated extensions it's best to avoid the remount and daemon reload if
no changes are found.
To do this, encode the current extension state in more detail than
before where only the names of the extensions where encoded in the
overlay mount. This can also be used to provide more details about the
extension origin in "systemd-sysext status (--json=)". During the
refresh add a check whether the old state matches the new state and in
this case skip the refresh unless the user provides a flag to always
refresh. Besides the extension name and the resolved path the best
method for identification is the verity hash but that is not available
for plain image files or directories. Therefore, also include data to
check for file/directory replacements. The creation/modification times
are not always real on reproducible images or extracted archive content.
The file handle together with the unique mount ID is the next best
identifier we can use when we have no verity hash. Fall back to an inode
when we get no handle. With the creation/modification time and the path
this should be good enough. Using a unique mount ID is important (with
a fallback to the regular non-unique mount ID) instead of st_dev because
st_dev gets reused too easily, e.g., by a loop device mount and the
mount ID helps to catch this. For the mount ID to be valid it has to be
resolved before we enter the new mount namespace. Thus, it gets provided
by the image dissect logic and handed over to the sysext subprocess
which runs in a new mount namespace.
Luckily, we can rule out online modification of directories or image
files because this is anyway not well supported with overlay mounts, so
we don't do a file checksum nor do we recurse into a directory to look
for the most recently touched files. But, as said, with the
always-refresh flag one can force a reload.