I distinctly recall the 7 am morning when I entered work and discovered a production server with all 16 cores sitting at 100 percent of their dedicated processing power. The entire dashboard monitoring this server was an all-red display, and my on-call mobile phone had already gone off on two different occasions prior to my arrival. Logging in to this server via SSH was extremely slow, due to the fact that every command I attempted to execute on this computer was taking several seconds to respond, and executing the command htop revealed the culprit of the high CPU usage: the program systemd-journald. This program accounted for close to 90 percent of this server’s overall CPU usage and was consistently writing to the /var/log/journal every second at such an alarming rate that it sounded as if the hard drives were about to explode. I knew that I had less than ten minutes to resolve the problem, or else I would start receiving help desk calls from my colleagues.
Because I had to find a solution to this issue within a very limited time frame, I encountered significant restrictions while trying to resolve this incident. First, the server in question was a database server that was covered by a Service Level Agreement that allowed for five nines of availability, which meant that I could not reboot this server. Second, the database developers needed access to the latest one hour of log file information in order to locate a potential bug that caused this issue to occur, so I was not able to randomly delete log files from the database server and still retain the most recent logs. Even though I thought that this infrastructure component was designed with high availability and reliability, it was ultimately the failure of this component that caused the outage with this server. Therefore, my goal was to quickly resolve the problem by applying a precise solution that did not risk losing the most recent log files and that did not require me to patch the live Linux kernel.
Once I was able to gather my thoughts and eliminate all of the anxiety that I was feeling, I realized that the solution to this problem was a sequence of commands that I have executed many times since that first morning. This sequence is a combination of flushing the journald logs, rotating out corrupt journal file segments, and applying hard limits for disk space and memory. The collection of commands listed below is the complete solution that I developed from this incident and from several others since then.This guide will give you the tools to diagnose systemd-journald’s high CPU usage, fix corrupted journal files that can occur, and ensure that your configuration is locked down so the previous problem doesn’t happen again.
Quick Summary
- Diagnosing CPU spikes involve examining live log spamming and I/O bottlenecks.
- Fix corrupted journal files without losing your most current entries.
- Implement memory cgroup overrides to bypass a memory leak in systemd-journald.
- Use journald.conf to enforce a permanent disk space limit for journalctl.
- Gracefully stop the systemd-journald service and Restart it.
Environment used for testing
All test steps were performed on Ubuntu 22.04 LTS using systemd version 249.11. All concepts will apply to any modern distribution with systemd. However, it should be noted that there may be minor differences between RHEL 9 and Arch regarding package versions and file paths.
Understanding systemd-journald Failures
Core Architecture and Log Routing
Systemd-Journald is in a busy subsystem, and all services writing to stdout or stderr communicate to Systemd-Journald via the socket located at /run/systemd/journal/stdout. Once a message is delivered to journald, it will index the message, compress it into a file, and write to a structured binary file located in /var/log/journal (if there is an existing persistent directory). Additionally, if the application is configured to forward messages to Syslog (ForwardToSyslog=yes), it will forward the entire message to syslog as well. When any part of this process is slow (disk I/O), the filesystem freezes, or one application produces huge quantities of log messages; everything gets queued in memory by journald. The longer journald does not have the ability to write to disk, the higher the CPU usage from attempting to compress and rotate previously received log messages when it is in this state.After some time has passed, you might observe the daemon continually hammering your input/output (I/O) system. In response, the operating system’s (OS) kernel will begin producing error messages indicating that a process has “hung” or become unresponsive.
Signs of corrupted journal files linux systems encounter
In most cases, there will be indications of possible corruption in the kernel ring buffer prior to any noticeable spikes in CPU utilization. To view this information, immediately after experiencing a slowdown, run dmesg and you should see something like the following:
[ 853.412101] systemd-journald[312]: Failed to write entry (29 items, 773 bytes), ignoring: Bad message
[ 853.412345] systemd-journald[312]: Journal corrupted, rotating.
[ 853.412997] systemd-journald[312]: Journal header limits reached or header out of date, rotating.
[ 854.112989] systemd-journald[312]: Failed to rotate /var/log/journal/b8c6...c0f: File too large
One of the most prevalent indications of journal file corruption is the presence of the “Bad Message” error. This error states that journald encountered a broken record that it had previously written to the binary log file. The main causes of this error message are sudden shutdowns due to power loss or a full disk, and a filesystem that can no longer process any further writes due to this corruption. Because of this, when the daemon attempts to rotate past a corrupted entry, it can get caught in a constant retry loop which consumes a lot of CPU time and produces high system load.
Steps to troubleshoot systemd-journald high cpu usage
To resolve the issue of high CPU utilization on an overloaded server, there are several steps that you can take to quickly address this problem. The first thing I do when I log into a server that appears to be under heavy load is to perform some quick triage, which typically includes keeping two SSH client sessions open.
Verifying Active Log Spam
I usually start my troubleshooting by determining if any third-party software has corrupted the journal files and is generating excessive amounts of log entries. To verify this during an active load on the server, you will need to run the following command:
journalctl -f -n 50 --no-pager
As you can see, your output may include the same message from a process ID (PID) being repeated at a frequency of ten message events per second. For example, I was able to capture the following output due to a misconfigured cron job that was repeatedly generating the same error message over and over again:
Mar 14 09:23:14 srv01 dbus-daemon[1423]: [system] Connection ":1.78" is not allowed to own the service "com.example.agent" due to security policies in the configuration file
Mar 14 09:23:14 srv01 dbus-daemon[1423]: [system] Connection ":1.78" is not allowed to own the service "com.example.agent" due to security policies in the configuration file
Mar 14 09:23:14 srv01 dbus-daemon[1423]: [system] Connection ":1.78" is not allowed to own the service "com.example.agent" due to security policies in the configuration file
Mar 14 09:23:15 srv01 dbus-daemon[1423]: [system] Connection ":1.78" is not allowed to own the service "com.example.agent" due to security policies in the configuration file
To give you an idea of how bad the output was, the above example of spam caused the dbus-daemon process to create 8,000 log entries every minute. As a result, the journald daemon was spending all of its processing resources on compressing these nearly identical entries that would never be read by anyone.Running systemctl stop dbus-broker.service to stop this service decreased the CPU load from 94% to 15% almost instantly, but if you cannot stop the service there are still ways to limit the number of logs written via rate-limiting. In this section, I will describe how to achieve this in the journald.conf file.
Diagnosing I/O Bottlenecks
If the journal daemon is consuming a relatively high amount of CPU resources and the journal daemon is also in an uninterruptible state (D) this indicates a disk issue. Use this command to check what process(es) are creating a high amount of I/O activity:
sudo iotop -o -b -n 5
A snippet from a sick server looked like this:
Total DISK READ: 0.00 B/s | Total DISK WRITE: 4.55 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
312 be/4 root 0.00 B/s 4.48 M/s 0.00 % 99.99 % systemd-journald [journal-offline]
When systemd-journald consumes close to 100% of the available I/O resources and the journal-offline thread is active, this indicates that the systemd-journald daemon is attempting to reconcile all of the corrupt Desktop log files it has on disk. All of these log file writes do not include valid log information, they are only a set of failed attempts to create a full, functional log file, which means you need to step in right away before the disk fills up or the file system dies entirely.
What Didn’t Work for Me
The initial time I experienced an I/O storm on my server, I tried to resolve the issue by increasing the SyncIntervalSec to 5 minutes in the journald.conf file, however this did not work because the corruption within the journal daemon forced it to skip this time interval and continue flushing continuously. I also attempted to mount the /var/log/journal directory with the sync option set to off. While I did get some improvement in the amount of I/O activity generated by the systemd-journald process, the disk was still being thrashed due to the continuing corruption of the journal entries. The only solution to this problem is to remove the corrupted segments from the journals and impose hard size limits on the /var/log/journal directories.
Resolving /var/log/journal corruption
Executing journalctl flush rotate
Follow the process below to backup your journal directory:
sudo cp -a /var/log/journal /var/log/journal.bak.$(date +%s)
Next, flush all the logs to disk and force the log files to rotate:
sudo journalctl --flush
sudo journalctl --rotate
The --flush flag tells systemd-journald to write all buffered log entries to currently open journal files, and the --rotate flag tells systemd-journald to “close” the old journal files and open up new journal files. This operation can increase performance by reducing CPU load on journald, particularly if there are thousands of MB of journal files. The old journal files are still present on disk, we will clean these up in the next step.
Cleaning Orphaned Files
Remove orphaned journal files. After rotating the journal files, we have removed the current corrupt log files from being opened, but the old corrupt log files are still taking up disk space and potentially throwing off journal statistics. We can safely remove them from the file system by using a time-based purge:
sudo journalctl --vacuum-time=1d
By utilizing the --vacuum-time command line option, we can delete all journal files that are older than 1 day. You may want to adjust the time value for --vacuum-time if you are running out of space; however, please be careful not to delete any logs that are being investigated at the moment. The vacuum command respects any active file being written at the time it is called. As such, the current journal file being written to will not be affected.
Safely restart systemd-journald service
Now that the journal has been cleaned, we can restart the systemd-journald service in a safe manner. You should never call kill -9 on the systemd-journald process, as this will cause systemd to continuously restart the systemd-journald service, and all journal entries that were not flushed to disk at the time you killed the process will be lost:
sudo systemctl restart systemd-journald.service
sudo systemctl status systemd-journald.service
The output from the above command should indicate that the service is now “ACTIVE (RUNNING)” and the CPU utilization for the systemd-journald process should be approximately 0.3%.In the event that an attempt to restart fails due to a PID that has stopped responding, you may clear the unit’s state and allow another attempt to restart systemd-journald.service by using the following command: sudo systemctl reset-failed systemd-journald.service.
Handling Edge Cases: The systemd-journald memory leak
Some configurations of systemd, particularly those with large numbers of short-lived Docker image deployments or of user sessions, exhibit what is known as a gnawing memory leak within the hash tables of systemd-journald. As time goes on, this increase in the Resident Set Size (RSS) of systemd-journald eventually leads to an Out Of Memory (OOM) condition resulting in the kernel’s out-of-memory (OOM) event handler being invoked.
When the amount of available memory starts running low, journald will begin becoming very reliant upon swap space, thereby directly creating the appearance of CPU thrashing.
Identifying Namespace and Cgroup Limitations
The rate of leakage is also determined by running the journald process within the root cgroup, thereby creating no explicit limits to memory allocated. Without a limit on the cgroup for the journald process, as RSS increases to the level of 8 GB, it will ultimately take down the entire host. If you monitor the journald process over the course of a week, looking for evidence of early warning signs:
ps -eo pid,rss,comm | grep journald
If you observe that the RSS is increasing daily at a continuing rate, then you likely have a known memory leakage issue within your systemd-journald configuration.
The information trail for the memory leakage problem comes from an excess fragmentation of the interned string tables, occurring when there has been excessive numbers of rotating namespaces created around the same time. For example, in the case of Docker containers being continuously began and stopped on a regular basis (once every second).
Applying Workarounds via systemd Overrides
While this issue can be corrected by updating or modifying the systemd software, you can create a drop-in override to the journald service that enforces restrictions on the amount of memory allocated to the journald process (backend).First, you will need to create the directory if it does not exist.
sudo mkdir -p /etc/systemd/system/systemd-journald.service.d
Next, create the override file:
sudo tee /etc/systemd/system/systemd-journald.service.d/override.conf <<EOF
[Service]
MemoryHigh=512M
MemoryMax=1G
EOF
The MemoryHigh setting provides a soft throttling threshold of 512 MB for journald, where MemoryMax represents the hard maximum limit of 1 GB, at which time the kernel will terminate instance of journald. As systemd restarts the daemon automatically, this means that forced kills of journald caused by memory pressure not only resets the leaked heap but avoids the need to reboot the entire node. To make the above changes, you will need to reload systemd and restart the service.
sudo systemctl daemon-reload
sudo systemctl restart systemd-journald.service
Using the above override in a container-heavy environment has allowed me to limit the RSS of journald to under 600 MB permanently and prevent nightly OOM events.
Best Practices to limit journalctl disk space
Tuning journald.conf systemmaxuse
Backup original configuration prior to making changes:
sudo cp /etc/systemd/journald.conf /etc/systemd/journald.conf.bak
Open the /etc/systemd/journald.conf file and limit the maximum amount of disk space used by the journal:
[Journal]
SystemMaxUse=500M
This line assures that journald will keep a maximum of 500 MB of journal data stored on disk regardless of the amount of log files being generated. On busy servers with 100 GB of available disk space, I usually set the limit to 2G or 4G, but never above 10 GB, because as the size of journal files increases, the performance of the daemon will decrease linearly. Once you have saved the file, you can apply the configuration change without having to restart the service:
sudo systemctl kill -s USR1 systemd-journald
After issuing the USR1 signal to journald, it will immediately read its new configuration file. Hence, there is no downtime.
Enforcing Rate Limits on Noisy Services
Some noisy apps will generate errors at a rate of once every microsecond. You can implement a rate limit on noisy apps without modifying the source code by setting up a rate limit in the journal daemon configuration file (journald.conf). To add a rate limit to an app, you would enter the following line at the bottom of your configuration file:
RateLimitIntervalSec=10s
RateLimitBurst=100
The above command will limit the number of logs generated by an app to 100 per service every 10 seconds. Once 100 logs exceed that limit (bursts), the journal daemon will drop the extra logs and only log one line (Suppressed 17 messages). This is a very aggressive configuration, so it should be tested in a staging environment to confirm that it works as intended and prevents runaway logging from causing your application bug to result in a system-wide failure.
Frequently Asked Questions
Will deleting corrupted journal files break my Linux system?
No, removing journal files from the /var/log/journal directory is safe. The journal files only contain log entries, and you can delete the files by using the command ‘rm’, provided that the journal daemon (systemd-journald) has been stopped first. You should gracefully delete old log entries using the command journalctl --vacuum-time=1d or --vacuum-size=100M in case you may lose some recent diagnostic log data when doing so. In other words, although the OS and running services will not crash when the journal files have been removed, you will lose the log entries of what happened prior to their removal.
How often should I manually rotate systemd-journald logs?
If you have set up the SystemMaxUse and RuntimeMaxUse values properly, you will rarely need to perform a manual system log rotation. I check my high-traffic servers for logs manually once a month, allowing the automated log vacuuming process to remove the old log files. If there is an event generated during the month that has flooded my logs with messages, I will run journalctl --rotate just after the event to close the current journal files so that they can be vacuumed by the next vacuum operation.
Why does journalctl take so long to load historical logs?
The reason journalctl takes such a long time to load past logs is that the journal directory becomes overly fragmented as thousands of small files accumulate over time. This means that journalctl has to open every single file and traverse its entire file system hierarchy until it finds all the logs it is looking for in the output order. The vacuum commands that remove old files from the journal directory also assist in reducing fragmentation; by setting the vacuum time to one day with a low maximum log size per service, I can easily keep my journal number per server at a couple of dozen file sizes or lower. As a result, I do not experience any slowdowns when loading past logs with journalctl.