The data and field hash table chains have the same problem the previous
commit fixed for entry array chains. New data and field objects are
linked at the tail of their hash bucket by patching the previous tail
object's next_hash_offset in place, so after a crash a persisted
predecessor (or the bucket head) can point at an object whose body never
reached disk.
journal_file_find_data_object_with_hash() and
journal_file_find_field_object_with_hash() walk those chains while
resolving matches, and on -EADDRNOTAVAIL/-EBADMSG from
journal_file_move_to_object() they simply return the error directly.
That propagates up to real_journal_next(), which discards the whole file
from the query.
Give those two lookups the same tolerance: on a read-only file, treat an
unreadable chain node as the end of the bucket chain.
generic_array_get() which is used for the unfiltered iteration path in
the previous commit treats a chain pointer that resolves past the end of
the file as the end of the chain. In that case, moving to the missing
array object returns -EADDRNOTAVAIL (or -EBADMSG), and it either stops
(going downwards) or steps back to the previous array (going upwards).
However, generic_array_bisect(), which is used for filtered or seeking
reads does not. On -EADDRNOTAVAIL/-EBADMSG from
journal_file_move_to_object(), it instead returns the error directly to
the caller, which propagates out through
sd_journal_next()/sd_journal_previous() and aborts the query.
The per-data entry array chain has the same issue as the global one,
since n_entries and entry_array_offset are (re)written in place as
entries are linked, and thus after a crash they can reference more
arrays than actually reached the disk. That is to say in practical
terms, a journal recovered for reading by the previous commit could
nevertheless still drop matching entries from `journalctl FIELD=value`,
and a seqnum or time seek into the lost region could fail outright.
Let's give generic_array_bisect() the same tolerance generic_array_get()
already has. That is, when moving to an entry array object fails, treat
the chain as ending at the previous array. This means that the result
matches what generic_array_get() would yield for the same file.
In Meta production we have been considering using journald more widely
for some time. One of the blockers to doing that which I have noticed is
that often journald seems to have vastly less data after lockups/power
failures compared to plain files, which is not great when debugging
outages.
On small write rates this tends to be hard to reproduce, but when
writing thousands of messages a second, an unclean shutdown can result
in the end result being an active journal file with a header that
records an arena larger than the data that actually reached disk. What
happens is then that journalctl then discards the entire file(!),
completely ignoring that there is a huge amount of data which is
actually perfectly readable.
The reason for that is that the journal header is updated on every
append, while the file size and newly written arena contents are only
made durable on the filesystem's own schedule. After a crash, the header
can therefore describe writes which were logically completed by journald
but whose backing data or file metadata never reached disk.
Take the following example of how this can happen at high log rates:
1. journald appends objects into an mmap()ed arena, periodically growing
the file with fallocate() in FILE_SIZE_INCREASE (8M) steps and
advancing the header's arena_size tail pointers as it goes along.
2. The header is dirtied on every append, and its arena_size is advanced
at each fallocate(). It is, from the kernel's perspective, an
ordinary data page and is only made durable by the kernel's periodic
page cache writeback on its own schedule. The file's length, by
contrast, is metadata, made durable only when the filesystem commits
a transaction (or on an fsync(), which journald does not issue
between sync intervals).
3. journald marks journals NOCOW, so the header's data block is
overwritten in place and is decoupled from the size metadata. Nothing
orders the two with respect to each other. Writeback therefore can
routinely persist a header whose arena_size has run ahead of the file
length recorded on disk.
4. Power is lost. On the next boot the persisted header reflects an
arena_size and tail pointers which have been advanced for appends.
However their payload and the file metadata were never committed, so
header_size + arena_size now points well past the end of the file as
it exists on disk.
5. journal_file_verify_header() then rejects this with -ENODATA:
if (... || header_size + arena_size > (uint64_t) f->last_stat.st_size)
return -ENODATA;
That is correct when opening for writing, because we must not append to
a file whose recorded state we cannot trust, and the caller must rotate
it away. But the same check also runs on read only opens, where it is
actively harmful. In the case of journalctl, the entire file is skipped,
even though the data hash table, the field hash table, and the head of
the array all are present and fully intact, and the great majority of
entries are physically present. In fact, only a very small part of the
most recently written tail is missing, but everything before is
readable. This results in mistakenly rejecting the entire file as
corrupt.
This happens extremely frequently on machines with high write rates
during power cuts or lockups. In testing writing ~7500 msg/s through
journald and then cutting power, I reproduced it in ten out of ten
attempts across different machines.
In each case, the header was left claiming ~296M of arena while only
~192-208M had reached disk. In this case, journalctl reports that it has
recovered 0 of ~335000 messages. Whether a given crash trips the
condition depends on where it falls relative to the header's writeback,
but when it does, the loss today is total. After this patch the vast
majority of messages can be retrieved.
Let's fix this by keeping the rejection for writing, but for read-only
opens, let's just clamp the arena to the real file size and skip the
consistency checks on the now unreliable tail pointers. The reader will
walk the entry array chain from its intact head and stop at the
truncation point by the bounds check that already exists, so there's no
need to do any more than that there.
You might also wonder, why not address this on the write side? That
would be astronomically expensive and require an fsync() after every
fallocate().
In terms of improvements, when reading from a file affected in the way
described above, previously journalctl recovers 0 entries, and now it
can recover all actually intact entries.
No longer exists since latest spec changes, as the binaries have moved
to another package so it doesn't get autogenerated anymore
Follow-up for db9d7265f5
The previous policy was primarily written from a standpoint
that AI models are not very good and we didn't wanna waste any
time reviewing PRs generated by AI. Now that AI models have become
actually good and their output is just as good as regular contributions,
let's stop requiring the disclosure as its pointless to still have it,
it doesn't really matter anymore whether a patch was written with or
without
AI. It's up to the author to make sure they're not wasting our time by
submitting unreviewed, untested code upstream, regardless of whether
that
code is written by an AI or not.
The new policy is inspired by https://github.com/lxc/incus/pull/3506,
with
various removals to be less adverse to the usage of AI.
2026-06-17T10:11:08.3789573Z Untracked files:
2026-06-17T10:11:08.3790064Z (use "git add <file>..." to include in what will be committed)
2026-06-17T10:11:08.3790566Z systemd-network.pre
2026-06-17T10:11:08.3790908Z systemd-resolve.pre
Follow-up for db9d7265f5
The previous policy was primarily written from a standpoint
that AI models are not very good and we didn't wanna waste any
time reviewing PRs generated by AI. Now that AI models have become
actually good and their output is just as good as regular contributions,
let's stop requiring the disclosure as its pointless to still have it,
it doesn't really matter anymore whether a patch was written with or without
AI. It's up to the author to make sure they're not wasting our time by
submitting unreviewed, untested code upstream, regardless of whether that
code is written by an AI or not.
The new policy is inspired by https://github.com/lxc/incus/pull/3506, with
various removals to be less adverse to the usage of AI.
No longer exists since latest spec changes, as the binaries have
moved to another package so it doesn't get autogenerated anymore
Follow-up for db9d7265f5
into portrait mode.
The IMU is not active in lizard mode. The command below disables
lizard mode and activates the IMU.
echo N | sudo tee /sys/module/hid_steam/parameters/lizard_mode
Closes#42586
On hppa '.equ' is overridden, so even this workaround ('.set' is
overridden on alpha) causes a build failure:
cc -Isrc/basic/libbasic.a.p -Isrc/basic -I../src/basic -Isrc/fundamental -I../src/fundamental -Isrc/systemd -I../src/systemd -Isrc/version -I../src/version -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wextra -std=gnu17 -O0 -g -Wno-missing-field-initializers -Wno-unused-parameter -Wno-nonnull-compare -Warray-bounds -Warray-bounds=2 -Wdate-time -Wendif-labels -Werror=bool-compare -Werror=discarded-qualifiers -Werror=flex-array-member-not-at-end -Werror=format=2 -Werror=format-signedness -Werror=implicit-function-declaration -Werror=implicit-int -Werror=incompatible-pointer-types -Werror=int-conversion -Werror=missing-declarations -Werror=missing-parameter-name -Werror=missing-prototypes -Werror=overflow -Werror=override-init -Werror=pointer-sign -Werror=return-type -Werror=sequence-point -Werror=shift-count-overflow -Werror=shift-overflow=2 -Werror=strict-flex-arrays -Werror=undef -Wfloat-equal -Wimplicit-fallthrough=5 -Winit-self -Wlogical-op -Wmissing-include-dirs -Wmissing-noreturn -Wnested-externs -Wold-style-definition -Wpointer-arith -Wredundant-decls -Wshadow -Wstrict-aliasing=2 -Wstrict-prototypes -Wsuggest-attribute=noreturn -Wunterminated-string-initialization -Wunused-function -Wwrite-strings -Wzero-as-null-pointer-constant -Wzero-length-bounds -fdiagnostics-show-option -fexcess-precision=standard -fno-common -fstack-protector -fstack-protector-strong -fstrict-flex-arrays=3 -fno-math-errno --param=ssp-buffer-size=4 -Wno-unused-result -Werror=shadow -fPIC -fno-strict-aliasing -fstrict-flex-arrays=1 -fvisibility=hidden -fno-omit-frame-pointer -include config.h -isystem../src/include/glibc -isystem../src/include/override -isystemsrc/include/override -isystem../src/include/uapi -fvisibility=default -MD -MQ src/basic/libbasic.a.p/compress.c.o -MF src/basic/libbasic.a.p/compress.c.o.d -o src/basic/libbasic.a.p/compress.c.o -c ../src/basic/compress.c
/tmp/ccxm7Waj.s: Assembler messages:
/tmp/ccxm7Waj.s:2085: Error: bad or irreducible absolute expression; zero assumed
/tmp/ccxm7Waj.s:2085: Error: junk at end of line, first unrecognized character is `,'
/tmp/ccxm7Waj.s:2268: Error: bad or irreducible absolute expression; zero assumed
/tmp/ccxm7Waj.s:2268: Error: junk at end of line, first unrecognized character is `,'
/tmp/ccxm7Waj.s:2544: Error: bad or irreducible absolute expression; zero assumed
/tmp/ccxm7Waj.s:2544: Error: junk at end of line, first unrecognized character is `,'
/tmp/ccxm7Waj.s:2800: Error: bad or irreducible absolute expression; zero assumed
/tmp/ccxm7Waj.s:2800: Error: junk at end of line, first unrecognized character is `,'
/tmp/ccxm7Waj.s:2956: Error: bad or irreducible absolute expression; zero assumed
/tmp/ccxm7Waj.s:2956: Error: junk at end of line, first unrecognized character is `,'
'.equiv' works on all architecures, but breaks on CentOS 9 due to binutils
2.35. Use an ifdef. Can be dropped and switch to '.equiv' once binutils 2.36
is the baseline.
Follow-up for 7590eb0c74
Co-developed-by: Claude Opus 4.8 <noreply@anthropic.com>
sysext: utimensat() failure was logged with stale r (which is 0 after
the preceding successful write_backing_file call). Pass errno instead
so the actual failure reason is recorded and returned.
sysusers: rename() failure in make_backup() returned the raw positive
errno value. All callers check 'if (r < 0)', so the error was silently
ignored, allowing execution to continue after a failed backup. Return
-errno instead.
Signed-off-by: dongshengyuan <dongshengyuan@uniontech.com>
Co-developed-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* 462bd9f5ea Update systemd to version 260.2 / rev 469 via SR 1356344
* 28967f9151 Update systemd to version 260.1 / rev 468 via SR 1353801
* 086bdf7ca5 Update systemd to version 260.1 / rev 467 via SR 1348897
* 8e7d3d3067 Update systemd to version 259.5 / rev 466 via SR 1338788
* 069ac9826b Update systemd to version 259.3 / rev 465 via SR 1336527
* 7ed02aefd6 Update systemd to version 258.5 / rev 464 via SR 1335466
* 811b7f2076 Update systemd to version 258.4 / rev 463 via SR 1332808
* 45a28d7f95 Update systemd to version 258.3 / rev 462 via SR 1329291
* 37342ddc36 Update systemd to version 257.9 / rev 461 via SR 1324470
* 7eafa80da7 Update systemd to version 258.3 / rev 460 via SR 1323386
* 29c9ee6b49 Update systemd to version 257.9 / rev 459 via SR 1321158
* 39613f8d2e Update systemd to version 258.2 / rev 458 via SR 1320482
* c235f1dcf5 Update systemd to version 257.9 / rev 457 via SR 1305565
The suse spec now uses:
%{_mandir}/man3/*.3%{?ext_man}
which defaults to .gz and fails, as we disable compression.
Redefine it to avoid a build failure:
Processing files: systemd-doc-260.2-00.noarch
error: File not found: /var/tmp/BUILD/systemd-260.2-build/BUILDROOT/usr/share/man/man3/*.3.gz
The condition on line 1206 checks other->load_state != UNIT_STUB to
decide whether to call the vtable done() callback, but the state was
already overwritten to UNIT_MERGED on line 1198, making the condition
always true.
Save the original load_state before overwriting it, so that units in
UNIT_STUB state (which never went through a load attempt) correctly
skip the done() call.
Signed-off-by: dongshengyuan <dongshengyuan@uniontech.com>
Co-developed-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The only runtime trigger for registry cleanup is the BPF kprobe that fires
on user namespace destruction; when it is missed (ring buffer overflow,
kprobe missing, fdstore entry dropped), the dead namespace's registry entry
survives and keeps its UID/GID ranges blocked until the manager restarts and
its startup sweep runs. The allocation hot path checked whether a candidate
range was already taken but never whether the namespace holding it was still
alive, so a single dead namespace could permanently starve an allocation.
This is most visible when a parent delegates its entire container UID window
to a child that then dies: every subsequent allocation from the parent fails
with NoDynamicRange even though the ranges are reclaimable.
Add userns_registry_reap_if_dead(), which probes a registered namespace's
liveness via the kernel namespace identifier recorded at allocation time and,
if it is authoritatively dead, releases its registry entry — restoring any
ranges it received via delegation to their ancestors. Call it from the
allocation availability check for both transient registrations and delegated
ranges, walking a chain of dead ancestors in the delegation case. This
mirrors the existing inode-slot stale cleanup and makes allocation
self-healing without waiting for a restart.
The startup sweep grew the same load-probe-release logic, so route it through
the new helper too; its errno return distinguishes alive, no-recorded-id, and
unprobeable-environment cases so the sweep keeps its early-out when lookup by
id isn't possible at all.
Co-developed-by: Claude Opus 4.8 <noreply@anthropic.com>
We might want to add more state to the LUO session json payload,
so add a version (to allow clean compat breaks if needed) and nest
the current fdstore contents under a 'units' object, so that more
top-level data can be added in the future without breaking
backward compatibility.
Follow-up for 257c35c1a3
fstab-generator: pass k instead of r to bus_error_message() so the
fallback error string reflects the actual bus call failure, not the
accumulated result that was reset to 0 earlier.
networkd-ndisc: return -ENOMEM when newdup() fails, since r is 0 at
that point and the OOM would otherwise be reported as success.
storagetm: add missing NULL check after strndup() for attr_model,
matching the pattern already used for attr_firmware and attr_serial.
Signed-off-by: dongshengyuan <dongshengyuan@uniontech.com>
Co-developed-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Avoid trying to query for our LUO session on reexec/softreboot/reload/etc.
Currently /dev/liveupdate is only accessible to root so it's not a big
issue, but this might change in the future, so make sure nobody can
play games with us.
Follow-up for 257c35c1a3
Link ParticleOS images in the workflow subproject for the PR,
so that they can be enabled with a click when needed.
But keep disabled by default, as they take a lot of resources,
especially disk space.
When generator_write_initrd_root_device_deps() fails, the error was
swallowed by returning 0 (success) instead of r. The two subsequent
calls in the same block correctly return r on failure.
Signed-off-by: dongshengyuan <dongshengyuan@uniontech.com>
Co-developed-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
control_command and control_command_id were cleared before being passed
to unit_log_process_exit(), so the log always showed an invalid/unknown
command name.
Move both clears after the log call, matching the ordering in
socket_sigchld_event() and service_sigchld_event().
Signed-off-by: dongshengyuan <dongshengyuan@uniontech.com>
Co-developed-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
663f0bf5cb stopped reusing the original block device fd whenever
partition scanning was requested (LO_FLAGS_PARTSCAN) but couldn't be
enabled on the device, so that nested partition tables on devices the
kernel won't scan (e.g. the pmOS/android case) get exposed via a real
loop device.
However that also forced a pointless loop device for any partition that
carries a file system directly, e.g. a btrfs subvolume mounted via
MountImages=. For multi-device btrfs this is fatal: the kernel rejects
seeing the same member via both the original partition and the loop
device, and the mount fails.
A loop device is only ever needed here to expose a nested partition
table. So only refuse the shortcut when the device actually carries one,
probed via gpt_probe(), instead of whenever partition scanning is
disabled. Devices carrying a file system directly (or nothing) take the
shortcut as before.
Add an integration test to cover the failure scenario of the original
issue.
Fixes: https://github.com/systemd/systemd/issues/42520
Replaces: https://github.com/systemd/systemd/pull/42576
Follow-up for 663f0bf5cb
Co-Authored-By: Luca Boccassi <luca.boccassi@gmail.com>
Co-developed-by: Claude Opus 4.8 <noreply@anthropic.com>
We really want to use io.systemd.Report for the interface provided by
systemd-report itself, not by its backend. hence, rename the interface
that uploading plugins shall implement to io.systemd.Report.Uploader.
Note that we ideally should have a varlink interface definition for that
interface. if we had, we'd have noticed that earlier.
let's name the dir "/run/systemd/report.upload/" (rather than
"/run/systemd/metrics-upload/"). After all, these are reports that we
upload, not indiviudual metrics. And it would be particular confusing
since the dir to pick up metrics is called /run/systemd/report/, rather
than /run/systemd/metrics/. Hence the thing that deals with reports is
nmamed metrics, and the thing that deals in metrics is named reports...
We really want to use io.systemd.Report for the interface
provided by systemd-report itself, not by its backend. hence, rename the
interface that uploading plugins shall implement to
io.systemd.Report.Uploader.
Note that we ideally should have a varlink interface definition for that
interface. if we had, we'd have noticed that earlier.
probe_gpt_boot_disk_needs_loop() sets ID_PART_GPT_AUTO_ROOT_DISK_NEEDS_LOOP
for any whole disk that holds the boot ESP/XBOOTLDR but whose partition table
the kernel cannot parse. Until now the udev rule turned that into a
systemd-loop@.service for every block device.
That is too broad: device-mapper devices also report kernel partition
scanning as disabled, but their partitions are managed in userspace by kpartx
(see 66-kpartx.rules). Setting up a loop device on top of them re-exposes the
same partition table a second time and only causes trouble.
Restrict the rule to optical drives, the one class that genuinely needs a
kernel-side loop device (El Torito GPT sector size mismatch, or drives that do
not support partition scanning) and that has no userspace partition manager of
its own.
Co-developed-by: Claude Fable 5 <noreply@anthropic.com>
Since 4e0eabd401 ("udev: also trigger loop device for boot disk when
partition scanning is unsupported"), builtin_blkid() bails out entirely as
soon as probe_gpt_boot_disk_needs_loop() reports that a loop device is
needed, skipping all superblock probing. As a result whole-disk properties
such as ID_PART_TABLE_UUID and ID_FS_* are no longer set.
This regresses any whole disk whose partitions the kernel cannot expose
itself but which is otherwise perfectly probeable, most notably
device-mapper multipath disks: kernel partition scanning is disabled on them
(their partitions are managed in userspace by kpartx), so they are now
flagged as needing a loop device and lose their ID_PART_TABLE_UUID.
The early return was never necessary. The original intent was only to skip
root partition discovery on the device, and that already happens on the loop
device instead: find_gpt_root() bails when the kernel can't scan partitions,
blkid probes at the device's own logical sector size so a GPT written for a
different sector size is simply not detected, and PART_ENTRY_* is only
emitted for partitions the kernel actually registered, of which a
loop-needing whole disk has none. So keep probing the device for its
whole-disk properties unconditionally and let partition and root discovery
happen on the loop device.
Co-developed-by: Claude Fable 5 <noreply@anthropic.com>
Now that the fast path performs a deep copy identical to the general
loop (when n_changes_attached==0, found stays false for all entries),
the block is redundant. Remove it and let the general loop handle this
case.
Signed-off-by: dongshengyuan <dongshengyuan@uniontech.com>
Co-developed-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>