The first time I lost my MariaDB instance was around 3 AM. I woke up to see the monitoring dashboard completely flat. The server logs provided no indication of multiple access attempts. They represented zero in terms of activity. The only activity recorded was that the database server had been “panting” right before it was killed by the OOM (out-of-memory) killer.
I built a strong system with 64GB of RAM. The system was built for web applications. I thought running out of memory would never be a concern.
The limitations set by the production server were harsh. I could not increase the amount of RAM. I could only add swap space. Adding swap space made things worse. It turned my server into a pneumatic, brain-dead swap-thrashing zombie. What haunted me at the time was the OOM killer consistently terminating the database process but leaving the numerous Apache worker processes — all of them idle at the time — left unscathed. I wanted to know why the kernel chose to kill the database process rather than the Apache worker processes.
There was no magic command to fix the problem. It took a combination of going through the kernel’s ring buffer like an investigator, tuning the kernel’s overcommit policy, and explicitly marking the MariaDB process as a “do not kill” item. As soon as I stopped treating the OOM killer as an unpredictable force and instead began reviewing the kernel’s log files to interpret the OOM killer’s decisions, I quickly discovered my fix. The following guide explains the steps of the investigative process. It’s a complete guide to forensic investigation on a Linux system.
Summary in Short:
- Use
dmesgto extract OOM events from the kernel’s ring buffer, and decode the section of the log where “Out of Memory” occurred. - Using the syslog (local log buffer), cross-reference the OOM kills to create a complete timeline.
- Learn the reasons for why swap thrashing often obscures the underlying memory issues.* To harden MariaDB (and any other important daemons) against unintended termination due to out-of-memory (OOM) conditions, adjust the OOM settings via the
oom_score_adjfunctionality and/or use overrides in systemd unit files. - Value tuning (i.e. a safety/control mechanism) of
vm.overcommit_memoryprotects a process from being killed unpredictably due to a lack of available memory while still ensuring it does not crash.
OS Environment – Ubuntu 22.04. LTS with Kernel Version – 5.15.0-86-generic.
Understanding the Linux OOM Daemon Architecture
OOM killer is not a user space daemon. The OOM killer is part of the Linux kernel and is called when the kernel determines there are no longer enough physical pages available to satisfy an allocation request. When this occurs, the Linux kernel will call out_of_memory() and will execute select_bad_process() to choose which process to terminate based on “badness” score.
The Role of Virtual Memory and Overcommit
All processes in the Linux OS environment view a virtual memory space (e.g., VM Space) as a single contiguous block of memory (Logical Address Space). The OS kernel has an ingenious way of managing memory – it will make a “promise” to an application, but at the same time will not immediately allocate any physical RAM pages to that promise. This process of promising the application memory before allocating it is known as “overcommitting” memory.
When an application requests memory to be allocated via a call to malloc(), the allocation will always be successful regardless of whether or not there are currently any free pages. This is possible because the OS kernel “guesses” that an application will not make use of all of the pages that it has promised to allocate.
When too many promises are made at once, the Linux OS will reach a point that it can no longer provide sufficient physical pages to meet current demands for memory. When overcommit settings are set too aggressively (using the default vm.overcommit_memory=0 settings), the kernel may continue to make memory promises until it reaches the point where the OOM killer has to choose to terminate some process(es). The only way to make sense of the output of the kernel’s log (dmesg) is to understand these simple calculations first.
Why the Kernel Sacrifices Processes
Once the kernel determines that recovery is no longer an option, it immediately executes select_bad_process() and selects the victim of the OOM conditions using oom_score.When a process is terminated by the kernel’s OOM killer, it does so according to the score of the process. This score is computed based on three criteria —how much memory the process uses (in percentage), how much CPU Time it has used (processes that have run for shorter periods of time are considered to be of greater concern), and how much the process has been adjusted, if at all, with ‘oom_score_adj’. The process that has the highest score will receive a SIGKILL signal, which will immediately deallocate all memory allocated to that process by the kernel.
The logic behind this method of selecting a process for termination is sound, but in practice, the default algorithm that is used will tend to select database daemons, or other cache-heavy workloads, because it estimates the amount of anonymous memory that they consume (the type of memory that the kernel is looking for at that time).
For my own situation, I attempted to increase swap space from 2 GB to 16 GB. This did not solve the problem; all it did was mask the problem for around 10 minutes with a lot of disk thrashing by the disk subsystem before the OOM Killer was invoked. Later, I also tried disabling overcommit (vm.overcommit_memory=2, without a ratio) entirely, which caused immediate memory allocation failures for my applications while running normally. To find the best way forward, I had to do research on the OOM killer’s logs and make adjustments to the kernel’s settings, and not simply add more swap space.
How to debug linux oom killer dmesg Terminations
The kernel ring buffer is the primary source of information regarding OOM killer terminations. In order to properly utilize this information, you will need to extract the relevant kill messages from the kernel’s ring buffer before they are wiped out, and then decode all fields.
Triggering and Reading the Ring Buffer
The easiest command to use is dmesg -T | egrep -i 'killed process'. The -T flag will convert timestamp information from the kernel’s ring buffer into standard human-readable date format, while the command egrep is used to filter out the message regarding the process that was killed.Here’s a depiction of what a typical emergency looks like as the killer acts:
$ dmesg -T | egrep -i 'killed process'
[Wed Oct 4 02:17:23 2023] Out of memory: Killed process 1423 (mysqld) total-vm:2048000kB, anon-rss:1500000kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:12000kB oom_score_adj:0
In that single line you will see who was killed (mysqld, PID 1423), the total amount of virtual memory it was using, and the amount of “anonymous” resident set memory — in this case 1.5GB of a heap that the kernel recovered. The oom_score_adj:0 indicates that nobody changed the score of the process. The kernel killed it purely on its default badness scores.
Decoding the “Out of memory:” Log Block
When you see that the only thing logged is “Killed process”, that is your verdict. The full explanation is reported earlier in the log. Use the command dmesg -T | grep -A 20 -B 5 'invoked oom-killer' to see the entire list of events leading up to the “Out of Memory:” line. The block ‘Out of Memory’ lists the allocation request for memory that triggered the Killer (gfp_mask) and the states of each memory node at that time of crisis. The per-process score list will normally be listed right there in the buffer, to show you exactly why mysqld had a higher score than httpd from Apache.
Look for the line that reads “[ …] mysqld invoked oom-killer:”. This indicates that the system was under so much stress that even mysqld‘s request for memory was the last straw.
Log Consolidation: Identifying Victims Beyond the Kernel Ring
The ring buffer is ephemeral. To track record of OOM events over several weeks, you must utilize persistent logs. Syslog compiles all OOM traces and often stores additional historical information about OOM events like cgroup names.
Commands to identify oom killed process IDs
If you wish to see all killed processes (PIDs) recorded in the syslog from that boot, you can easily review the syslog.On Ubuntu, syslog can be located in the following path: /var/log/syslog. You can utilize the following command:
grep -ioP 'Killed process \K\d+' /var/log/syslog
This command will return only the PIDs of the processes that were killed by the OOM-Killer. You can then use the PIDs to cross-reference with either ps or the individual program’s log to discover what the program was doing leading up to its termination.
Grepping the out of memory killer syslog Output
When I need to get the entire timeline of an OOM kill, I perform a case-insensitive grep across the entire syslog and search for the phrase “invoked oom-killer”:
$ grep -i oom /var/log/syslog
Oct 4 02:17:23 node01 kernel: [12345.678] mysqld invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Oct 4 02:17:23 node01 kernel: [12345.678] [<...>] out_of_memory+0x.../0x...
Oct 4 02:17:23 node01 kernel: [12345.678] Out of memory: Killed process 1423 (mysqld) total-vm:...
This output contains the same information as seen in the dmesg command, but it contains persistent entries instead of transient entries. Therefore, I can use the syslog output to correlate the timing of the kills with cron job timestamps or application error logs.
Steps to analyze var log messages oom Entries
On older versions or RHEL-based distributions of Linux, the OOM messages will be found in the file /var/log/messages. The procedure to analyze the messages is the same. Additionally, if you run journalctl against the messages for the log entries after an suspected OOM Kill, that may provide additional information regarding the OOM Kill, since the log may also contain other kernel messages as well, as the journal entries will also be persistent. You can run the command journalctl -k --grep=oom to retrieve the same information that is found in the syslog messages without having to parse the various logs and keep the flat files up to date.
The Silent Swap Thrash: When OOM Fails to Trigger Immediately
Occasionally the box will not log an OOM Kill because of extreme swap thrash. In this case, as the Load Average continues to rise and rise and SSH/W-Terminal stops responding, you may notice that the Disk LEDs on the server will continuously glow solid. This occurs when the kernel is continuously transferring pages to and from RAM and Swap, and at the same time the OOM Killer has not yet been invoked or triggered because Pages are already being reclaimed at a Glacial Speed.
Identifying High IO Wait States
You can monitor iowait time by utilizing top and/or vmstat 1. The iowait time can be observed in the column labeled “wa”.You may find yourself in a thrash scenario if your system has a Memory Allocation Ratio (MAR) greater than or equal to 50% and is near being completely frozen, while your free -h output indicates that your swap usage is close to 100%. In this event, the kernal Ring Buffer may show “Page Allocation Failures” as opposed to a Kill Signal (SIG KILL) because the Out Of Memory (OOM) heuristic assumes that memory is being released and thus is not giving a signal for a kill.
Forcing a Manual OOM Invocation via Sysrq
If you monitor your Linux system through your console and see that your Linux system is Choked or stuck reclaiming resources (or memory) and you are unable to log in through a graceful login, you may forcibly invoke the OOM killer to execute by pressing the Magic SysRq key.
This should only be done as a last resort and you should make sure that your system is hard‑deadlocked prior to invoking the OOM killer. Also, invoking the OOM killer will kill the worst offender process; hence, proceed with caution due to the irreversible nature of process interruption.
# Force the OOM killer to execute immediately
echo f > /proc/sysrq-trigger
By executing this command, you will issue the ‘f’ trigger to the sysrq interface, at which time the kernel will execute its out_of_memory() function and you should receive a new, fresh “Out of Memory: Killed Process” message in your dmesg console output.
Do not use this combination of keys as a routine crutch; it is intended solely as a diagnostic tool for use when the automated heuristics fail to take effect in a timely manner.
Advanced linux memory tuning and Policies
Once you have established the criteria by which your Linux kernel selects which processes to terminate and why, you may begin to recalibrate the way that your Linux kernel chooses to commit resources in order to reduce the chances of experiencing memory starvation by enforcing more stringent policies of memory allocation.
Applying the vm.overcommit_memory configuration
The memory overcommitment policy may be configured in /etc/sysctl.conf using the vm.overcommit_memory directive. The three overcommitment policy modes are described in detail in the kernel’s overcommit accounting documentation. Mode 2 is the “fire extinguisher” option which I always use for database servers that experience frequent OOM incidents.
# /etc/sysctl.conf addition
vm.overcommit_memory=2
vm.overcommit_ratio=80
Always Backup Existing Configuration Files: (to restore if needed) Command: sudo cp /etc/sysctl.conf /etc/sysctl.conf.bak
Open a second root shell before modifying any sysctl settings.
In Mode 2, the kernel will deny requests for memory via malloc() that result in a total allocation exceeding the total size of SWAP and the total size of RAM multiplied by the ratio divided by 100. For example, if you use a ratio of 80, the kernel will only allow allocations that total up to 80% of the physical RAM plus SWAP. This does not guarantee that you will not receive ENOMEM for your applications as they may legitimately require more memory than is allowed; however, it does protect the kernel from making erroneous guarantees that lead to application terminations later.
Applying and Verifying Sysctl Changes
After making the new changes, reload the new values and check again.
sudo sysctl -p
cat /proc/sys/vm/overcommit_memory
# Should output 2
cat /proc/sys/vm/overcommit_ratio
# Should output 80
For example, before deploying into a Production Environment, test the new setting in a Staging Environment to ensure you do not set a low enough value that would cause any large workloads to crash during start-up. If you already have a production server under memory pressure, reduce the ratio gradually while monitoring allocation failures.
How to prevent oom kill mariadb Terminations
When there is a process that you need to keep running at all costs, such as a database with active transactions, you must override the kernel’s calculation of the “badness” of the process. MariaDB recommends an adjustment to the OOM Score via an adjustment of the OOM Score directly, which can be made persistent through the use of systemd.
Implementing oom_score_adj tuning for Daemons
The /proc//oom_score_adj procfs knob allows you to set the adjustment factor for OOM kills. A value of -1000 will exclude a process from being OOM killed. If you do it manually, it is not very stable since the value will reset after restart, therefore, you should incorporate it into your service definition.
Modifying Systemd Service Overrides
According to the official systemd.exec documentation, Systemd’s OOMScoreAdjust setting writes this value into the service’s cgroup every time it is launched. You can modify the setting for MariaDB by creating an override file:
sudo systemctl edit mariadb
This will launch a temporary text editor window. Copy and paste in:
[Service]
OOMScoreAdjust=-1000
Afterwards, close and save it. Finally, to apply the changes, run:
sudo systemctl daemon-reload
sudo systemctl restart mariadb
Be sure to perform the above steps during a scheduled maintenance window. Restarting a live database that is currently servicing requests should always be done with caution. After the database has been restarted, you can verify that the oom_score_adj of MariaDB is set to -1000 by executing the following command: cat /proc/$(pidof mysqld)/oom_score_adj. Once this is confirmed, The kernel will ignore MariaDB even when there is memory pressure present. Please do not set every daemon that you run to -1000. You should still leave some processes available for the OOM killer to utilize as relief valves.
Frequently Asked Questions
Why does the OOM killer destroy my database instead of Apache?
It is not a personal decision; it is simply a mathematical problem. The database process normally has a very large amount of memory that is unallocated (heap memory), while the Apache workers (if running in prefork mode) may be idling while retaining a greater percentage of file-backed pages (which can easily be reclaimed by the kernel from storage). The kernel cannot drop anonymous memory; therefore, this is penalized more severely in the “badness” score determination. The tuning of oom_score_adj detailed above flips this logic for the daemon that you want to protect.
Can I completely disable the OOM killer in Linux?
You cannot simply flip a switch to disable the OOM killer; it is a function of the memory reclaim process. If you set vm.panic_on_oom=1, the kernel will panic rather than kill a process — which is worse for your uptime. The best option is to set vm.overcommit_memory=2, with a generous ratio, that allows your system to never reach a condition that requires the OOM killer to be utilized. Nonetheless, if the out-of-memory condition is created due to cgroup or memcg limits, this can still trigger an OOM kill at the cgroup level.
Does adding more swap space stop OOM terminations?
No, swap merely prevents the kernel from starving out of memory, it does not resolve your physical page limitations for managing anonymous memory allocations during memory reclamation. The OOM Killer is invoked when the kernel cannot reclaim enough memory to accommodate the allocated memory, and if you are constantly touching more physical memory than your system can support using RAM, then swap will fill up and you will still hit a Termination. The swap is simply a buffer against your constant overcommit.