701 Commits

Author SHA1 Message Date
Simran Singh
4de3f59774 man: EnvironmentFile= honors %h, not \$HOME 2026-05-05 17:31:16 +02:00
Daan De Meyer
158f2d50bf docs: Update MEMORY_PRESSURE.md => PRESSURE.md
Make the doc more generic and mention all pressure types, not just
memory.
2026-04-09 22:47:10 +02:00
Daan De Meyer
594659da06 core: Add I/O pressure support 2026-04-09 22:47:10 +02:00
Daan De Meyer
316d17fcbd core: Add support for CPU pressure notifications
Works the same way as memory pressure notifications. Code is refactored
to work on enum arrays to reduce duplication.
2026-04-09 22:47:10 +02:00
Luca Boccassi
c751714d8c man: document that with RuntimeDirecoryPreserve= dirs are under /run/private/
This is not immediately obvious so document it explicitly.

Follow-up for 40cd2ecc26
2026-03-16 20:26:33 +01:00
Lennart Poettering
3df88e836c man: document explicitly that ProtectHome= has no effect on non-standard homedir locations
Fixes: #41045
2026-03-12 17:57:21 +00:00
Zbigniew Jędrzejewski-Szmek
0a17bb5c34 core: simplify requirements in unit_get_private_var_tmp() to just After=
As in the previous commit, checking for both requirements and ordering seems
unnecessary. In practical cases, the mount will be pulled in by the rest
of the transaction, so ordering is the part that matters. (The setup is
racy without the ordering.)  If we drop the second check, the admin can
just use After=tmp.mount to achieve the desired behaviour, without needing
to explicitly pull in the unit. This is easier to configure and more robust.

This changes the implementation introduced in
6156bec7a4.

Also actually describe the implemented behaviour in the man page.
2026-02-25 12:38:11 +01:00
Zbigniew Jędrzejewski-Szmek
fa33eef344 core: upgrade /tmp when PrivateTmp=yes/DefaultDeps=no to disconnected
In https://github.com/systemd/systemd/issues/28515, multiple people report that
services that have PrivateTmp=yes and DefaultDependencies=no fail to create the
temporary directories under /tmp, when /tmp is e.g. a bind mount or some other
kind of mount that takes more time.

Before PrivateTmp=disconnected was added, we didn't have a nice solution:
DefaultDependencies=no is used to start services very early, so we wouldn't
want to add a dependency on /tmp automatically. With PrivateTmp=disconnected we
have a fairly nice solution. Let's "upgrade" to this mode automatically.
Strictly speaking, it is a small compat break, but in practice it's unlikely to
matter for early-boot services whether their /tmp is private or disconnected.

The dependency on /tmp that is checked is After. I think this is enough,
since any tmp.mount would be pulled in by local-fs.target and the rest of
the transaction anyway, so we don't need to check more than After.

The asserts are relaxed, because now the two settings can now diverge
in either way.

Resolves https://github.com/systemd/systemd/issues/28515.

[yhndnzj: fix unit_add_exec_dependencies() to handle the new
          combination; add a comment in exec_needs_sys_admin()]
2026-02-25 12:38:11 +01:00
Yu Watanabe
5329f4bf76 man: fix typo
Follow-up for 6b22ac31af.
2026-02-20 01:18:59 +09:00
Lennart Poettering
eb581ff6d9 man: document everything we just added 2026-02-19 15:08:20 +01:00
Lennart Poettering
fe487d3670 namespace: extend bind mount ignore field to permission issues
A later commit will add transient allocation of user namespaces with
dynamic UID range assignment. That creates certain permission issues.
Let's hence allow them to be handled gracefully in case the 'ignore'
field is set for a mount.
2026-02-19 15:07:19 +01:00
Lennart Poettering
6b22ac31af core: add PrivateUsers=managed 2026-02-19 15:05:15 +01:00
Mike Yuan
24f458da81 man: document RefreshOnReload= 2026-02-10 21:54:13 +01:00
Luca Boccassi
3e2a5dc2e1 dissect: support mount options when going through mountfsd, requiring privileges via polkit (#39394)
RootImageOptions=/ExtensionImages=/MountImages= all support custom
mount options, but mountfsd does not support it. Add varlink
parameters to allow callers to specify mount options so that
those directives can work as expected. Require additional privs via
polkit.
2026-01-16 14:55:02 +00:00
Yu Watanabe
5366dbdbd4 core: fix typo
Follow-up for 32614b9aab.
2026-01-08 12:20:22 +09:00
Luca Boccassi
3a8759e5d4 dissect: support mount options when going through mountfsd
RootImageOptions=/ExtensionImages=/MountImages= all support
custom mount options, use the new mountfsd parameters to
configure them if they are specified.

This requires additioanl privileges via polkit due to security
implications of mount options, so document an example policy
that allows to use the nosuid mount option.
2026-01-07 00:47:53 +01:00
Yu Watanabe
17ea504efb core: change mount options settings so that last defined wins (#39449) 2026-01-07 04:11:29 +09:00
Luca Boccassi
9de41f677c core: change mount options settings so that last defined wins
Currently mount options are handled in such a way that the first
definition for a given partition wins, and documented as such.
Change them so that they behave like other options, and the
last specified wins.
Applies to RootImageOptions=, MountImages= and ExtensionImages=.
Switch from a linked list to an array indexed by the partition
specifier to store them.
2026-01-06 17:59:10 +01:00
Usama Arif
32614b9aab core: introduce MemoryTHP= unit file setting
Transparent Hugepages (THP) is a Linux kernel feature that manages
memory using larger pages (2MB on x86, compared to the default 4KB).
The main goal is to improve memory management efficiency and system
performance, especially for memory-intensive applications.
However, it can cause drawbacks in some scenarios, such as memory
regression and latency spikes. THP policy is governed for the entire
system via /sys/kernel/mm/transparent_hugepage/enabled.
However, it can be overridden for individual workloads via prctl(2)
call.
MemoryTHP= is used to disable THPs at exec-invoke to stop
providing THPs for workloads where the drawbacks outweigh the advantages.
When set to "disable", MemoryTHP= disables THPs completely for the
process, irrespecitive of global THP controls.
When set to "madvise", MemoryTHP= disables THPs for the process except
when specifically madvised by the process with MADV_HUGEPAGE or MADV_COLLAPSE.
2026-01-06 03:26:14 -08:00
Matt Fleming
4dcbfbb1ad process-util: Add support SCHED_EXT scheduling policy
Allow CPUSchedulingPolicy to be set to "ext". SCHED_EXT is a new
scheduling policy in Linux v6.12 that allows processes to be scheduled
using custom BPF schedulers instead of the default in-kernel ones.

Selectively setting the SCHED_EXT policy is useful for systems running
in "partial mode" where not all processes are run using a custom
scheduler.

Fallback to SCHED_OTHER and print an error message for systems where
SCHED_EXT isn't available.
2025-12-20 18:31:55 +01:00
Andrew Halaney
0927356f8e man/systemd.exec: Make EnvironmentFile error conditions more explicit
It is not entirely clear what happens when EnvironmentFile fails in the
prior wording. With the new wording it should now be clear that if it
fails to process the file the service will fail, and if it is prefixed
with "-" all errors are silently ignored.

Signed-off-by: Andrew Halaney <ahalaney@netflix.com>
2025-12-17 11:56:52 +01:00
Lennart Poettering
fc3adbbbcb man: always prefix links to uapi specs with their UAPI.XY spec number
Let's try to establish the spec numbers, by mentioning them in most doc
links.

Follow-up for: https://github.com/uapi-group/specifications/pull/187
2025-11-23 18:09:11 +01:00
Christoph Anton Mitterer
07f4718242 man: clarify what “failed” means
systemd.service(5)’s documentation of `ExecCondition=` uses “failed” with
respect to the unit active state.
In particular the unit won’t be considered failed when `ExecCondition=`’s
command exits with a status of 1 through 254 (inclusive). It will however, when
it exits with 255 or abnormally (e.g. timeout, killed by a signal, etc.).

The table “Defined $SERVICE_RESULT values” in systemd.exec(5) uses “failed”
however rather with respect to the condition.

Tests seem to have shown that, if the exit status of the `ExecCondition=`
command is one of 1 through 254 (inclusive), `$SERVICE_RESULT` will be
`exec-condition`, if it is 255, `$SERVICE_RESULT` will be `exit-code` (but
`$EXIT_CODE` and `$EXIT_STATUS` will be empty or unset), if it’s killed because
of `SIGKILL`, `$SERVICE_RESULT` will `signal` and if it times out,
`$SERVICE_RESULT` will be `timeout`.

This commit clarifies the table at least for the case of an exit status of 1
through 254 (inclusive).
The others (signal, timeout and 255 are probably also still ambiguous (e.g.
`signal` uses “A service process”, which could be considered as the actual
service process only).

Signed-off-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name>
2025-11-06 10:47:06 +01:00
Quentin Deslandes
79dd24cf14 core: Add UserNamespacePath=
This allows a service to reuse the user namespace created for an
existing service, similarly to NetworkNamespacePath=. The configuration
is the initial user namespace (e.g. ID mapping) is preserved.
2025-11-04 10:55:04 +01:00
Luca Boccassi
e84aa21af8 man: RootImageOptions= is only supported for system services right now
Support via mountfsd is being worked on but will take more time,
fix the documentation to be correct in the meanwhile

Follow-up for fad01f798d
2025-10-22 17:22:03 +01:00
Daniel Foster
c7a444a9c1 tree-wide: extend $LISTEN_FDS protocol with $LISTEN_PIDFDID
Although extremely unlikely, there is a race present in solely checking the
$LISTEN_PID environment variable, due to PID recycling. Fix that by introducing
$LISTEN_PIDFDID, which contains the 64-bit ID of a pidfd for the child process
that is not subject to recycling.
2025-10-22 09:34:14 +02:00
Luca Boccassi
fad01f798d dissect: add support for verity-protected bare filesystems via mountfsd
Needed to implement support for RootHashSignature=/RootVerity=/RootHash=
and friends when going through mountfsd, for example with user units,
so that system and user units provide the same features at the same
level
2025-10-16 16:22:33 +01:00
Luca Boccassi
68b476a298 core: also enable PrivateUsers= for user services when using images via mountfsd
RootDirectory= and other options already implicitly enable PrivateUsers=
since 6ef721cbc7 if they are set in user
units, so that they can work out of the box.
Now with mountfsd support we can do the same for the images settings,
so enable them and document them.
2025-10-16 12:58:59 +01:00
Lennart Poettering
4be269563d core: if we cannot decode a TPM credential skip over it for ImportCredential=
let's skip over credentials we cannot decode when they are found with
ImportCredential=. When installing an OS on some disk and using that
disk on a different machine than assumed we'll otherwise end up with a
broken boot, because the credentials cannot be decoded when starting
systemd-firstboot. Let's handle this somewhat gracefully.

This leaves handling for LoadCredential=/SetCredential= as it is (i.e.
failure to decrypt results in service failure), because it is a lot more
explicit and focussed as opposed to ImportCredentials= which looks
everywhere, uses globs and so on and is hence very vague and unfocussed.

Fixes: #34740
2025-09-18 22:11:57 +02:00
Yu Watanabe
369f311686 man: fix typo
Follow-up for 7aefb194e7.
2025-07-11 14:11:04 +09:00
Matteo Croce
7aefb194e7 man/systemd.exec: explain how BPF token works
Add a small paragraph explaining how BPF token works, how it's being
created and its relationship between the BPF filesystem.
Move all the relevant documentation in the PrivateBPF= section and let
point all the BPFDelegate* options to that one.
2025-07-10 21:40:07 +02:00
Yu Watanabe
f436c64e61 man: fix typo
Follow-up for 7baf403430.
2025-07-10 14:02:00 +09:00
Yu Watanabe
1cf5b39d64 core: add 'DefaultRestrictSUIDSGID' config option (#38126)
closes #37602, see there for extra motivation and considered
alternatives.

On typical systems, only few services need to create SUID/SGID files.
This often is limited to the user explicitly setting suid/sgid, the
`systemd-tmpfiles*` services, and the package manager. Allowing a
default to globally restrict creation of suid/sgid files makes it easier
to apply this restriction precisely.

## testing done
- built on aarch64-linux and x86_64-linux
- ran a VM test on x86_64-linux, checking for:
    - VM system boots successfully
    - defaults apply (both `yes`, `no`, and undefined)
    - systemd tmpfiles can set suid/sgid on journal log path
- Other services explicitly defining `RestrictSUIDSGID=no` can create
suid files
2025-07-10 13:30:07 +09:00
Matteo Croce
7baf403430 man/systemd.exec: update documentation for PrivateBPF=
Add a short description about what PrivateBPF=yes does
and how it can be useful.
2025-07-10 01:57:14 +02:00
Grimmauld
0316fb8219 core: document 'DefaultRestrictSUIDSGID' 2025-07-09 21:45:46 +02:00
Matteo Croce
ea9826eb94 core: add options to delegate BPFFS token creation
Add four new options BPFDelegate{Commands,Maps,Programs,Attachments}=
in order to delegate to a BPFFS instance the permission to create tokens.

The value is a list of options taken from:
https://github.com/torvalds/linux/blob/v6.14/include/uapi/linux/bpf.h#L922-L1121
The special value "any" means to allow every possible values.

More informations about BPF tokens here:
https://lwn.net/Articles/947173/
2025-07-08 22:35:29 +02:00
Matteo Croce
3a47437fc9 core: Introduce PrivateBPF= to mount a private BPFFS
Add a new option PrivateBPF= to mount a new instance of bpffs within a
namespace.
PrivateBPF= can be set to "no" to use the host bpffs in readonly mode
and "yes" to do a new mount.
The mount is done with the new fsopen()/fsmount() API because in future
we'll hook some commands between the two calls.
2025-07-08 22:33:28 +02:00
Andres Beltran
26c6f3271a core: add quota support for State, Cache, and Log exec directories 2025-07-07 17:28:47 +00:00
Lennart Poettering
2be3a06bb2 core: when PrivateDevices= is enabled and we need to decrypt TPM2 credentials, go via IPC
Also, if a device ACL list is defined, also go via IPC (instead of
trying to patch it, as before).

The outcome is that the tighter rules continue to apply when configured.

Fixes: #35959
2025-06-24 22:16:01 +02:00
Anton Ryzhov
bd02e15710 man/systemd-creds: fix documentation typo in systemd.exec.xml 2025-06-03 07:42:44 +09:00
Zbigniew Jędrzejewski-Szmek
b082968d19 man: better tags, more links, minor grammar and formatting improvements
Closes https://github.com/systemd/systemd/issues/35751.
2025-05-28 15:35:53 +02:00
Luca Boccassi
6946eed3fa core: Also refresh confext extensions when reloading notify-reload service (#33995)
`ExtensionImages=` and `ExtensionDirectories=` now let you specify
vpick-named extensions; however, since they just get set up once when
the service is started, you can't see newer versions without restarting
the service entirely. Here, also reload confext extensions when you
reload a service. This allows you to deploy a new version of some
configuration and have it picked up at reload time without interruption
to your workload.

Right now, we would only reload confext extensions and leave the sysext
ones behind, since it didn't seem prudent to swap out what is likely
program code at reload. This is made possible by only going for the
`SYSTEMD_CONFEXT_HIERARCHIES` overlays (which only contains `/etc`).

This PR:
- Adjusts `service.c` to also refresh extensions when needed. 
- Adds integration tests to check that a confext reload actually
occurred.
- Adds to the `systemd.exec` man pages to document this behavior.

This is a follow up to #24864 and #31364. Thank you to @bluca and
@goenkam for help in getting this up.
2025-05-20 11:27:34 +01:00
maia x.
67ecc2c7fe man: document confext reload behavior for ExtensionDirectories/Images 2025-05-19 13:36:21 +01:00
Lennart Poettering
bfb1f9e2c9 core: pass the socket cookie to invoked per-connection service instances as $SO_COOKIE env var
The socket cookie is just too useful for identifying connections, let's
emphasize this a bit and pass it as environment variable.
2025-05-15 09:45:32 +02:00
Lennart Poettering
3bdcd994cd man: correct version information when $REMOTE_ADDR/$REMOTE_PORT where added
This was in commit 3b1c524154, i.e. in the
v220 cycle.
2025-05-15 09:45:19 +02:00
Yu Watanabe
8ac5b047fc man/systemd.exec: update documents for PrivateTmp= 2025-05-11 03:33:02 +09:00
Zbigniew Jędrzejewski-Szmek
2dc4e87849 man/systemd.exec: reword description of RestrictAddressFamilies=
The text is reordered and broken into more paragraphs.
A recommendation to combine RestrictAddressFamilies= with
SystemCallFilter=@service is added.
2025-05-06 21:14:03 +02:00
Zbigniew Jędrzejewski-Szmek
802d23fcfb man/systemd.exec: reword description of SystemCallFilter=
The existing text grew organically as features were added and was
not very organized. Reorder it and break into paragraphs grouped
by topic. The description of the :errno syntax is replaced by a short
reference to the SystemCallErrorNumber= setting. This makes the
text shorter and makes it easier to explain how the two settings combine.
2025-05-06 21:14:03 +02:00
Yu Watanabe
4db8663b81 tree-wide: fix typo 2025-04-27 10:36:12 +09:00
Daan De Meyer
ba77798bba unit: Make sure individual unit maximum log level always takes priority
Currently LogLevelMax= can only be used to decrease the maximum log level
for a unit but not to increase it. Let's make sure the latter works as
well, so LogLevelMax=debug can be used to enable debug logging for specific
units without enabling debug logging globally.
2025-04-23 14:46:12 +02:00