Ansible Server Configuration

Configuring Persistent System Logging and Log Rotation via Ansible Playbooks

About a month ago, I woke up to a freezing cold Monday morning, as I experienced a tenfold production crash when our app went offline and our monitoring dashboards were not providing any relevant info on the outage. I took too long to find out what caused the outage; /var/log/MYSERVER had completely filled up all disk space because log rotation had failed after a recent package upgrade. No alerts were sent out, nor was there any evidence of the log files, just a dead and broken server and a ton of angry users.

I don’t want anyone else in the same shoes as I was in during this outage, so I created a number of Ansible playbooks to properly configure persistent system syslog and log rotation in an idempotent, testable, and completely hands-off fashion once they are initially configured. I will lay out the entirety of how I built and followed this approach here, even the ugly design issues I ran into later on when rsyslog held an additional reference to an old file descriptor.

Quick Summary

  • Use Jinja2 for templating rsyslog.conf, allowing easy remote log forwarding to a single central location for every server in your environment
  • Enforce the disk retention policy for all logs using the community.general.logrotate Ansible module across your server fleet
  • Safely restart rsyslog services with Ansible handlers and pre-flight syntax testing
  • Resolve log rotation silent failures due to stale file descriptors by using lsof and implementing the postrotate log file handling process
  • Utilize rsyslog disk-assisted queues for temporary network interruptions

Understanding Centralized Logging Architecture

Objective: Forward all system logs from every server that is under your management to one central location so that you don’t have to worry about someone overwriting logs on their individual servers written to /var/log/messages. By centrally managing all logs in one location, You will be able to review, archive and safely reference logs.

Facility Levels and Priorities

Messages are divided into by syslog by two factors: facilities, including ‘auth’, ‘daemon’, ‘kern’, and ‘local0’ through ‘local7’, and severity levels such as debug, info, notice, warning, err, crit, alert, and emerg. You’re going to use the facilities and severity levels assigned to each message to define the behaviour of your rsyslog rules so you can send messages where you want them to go. For example, you could send all ‘kern.‘ messages to the central server, while you might want to send only ‘.err’ messages originating from the mail subsystem. I always have a copy of the rsyslog documentation labelled ‘cheat sheet’ at hand to help me when I make changes like this.

The Role of Rsyslog and Logrotate

Rsyslog is responsible for actually forwarding or relaying local log streams to a remote syslog server via either TCP or UDP protocol, while Logrotate is responsible for managing the local disk, which includes compressing, deleting, and recreating rotated log files, meaning that you don’t fill up the disk with log files. They form a pair that you want to keep together. If the rsyslog configuration tries to write to a file that has just been rotated out from under it by logrotate, you will have messages lost between the two. I’ll discuss how to avoid this problem in a moment.

Prerequisites for Playbook Execution

SSH Keys and Privilege Escalation

To allow the controller node to ssh into each of the remote target nodes without having to enter a password at the remote end, I use an ed25519 key, which gets passed to the ssh-agent. When you create your playbook, you will set become: yes, because writing to /etc/rsyslog.conf and restarting services requires root access. Please note that the remote user must have access to sudo without being prompted for a password; otherwise, your handler will hang on attempting to run as the user and waiting for a password prompt, which you can’t give.

Inventory and Destination Server Mapping

The central log server’s IP address is saved in a group variable in order to change it easily. An inventory file looks something like this:

all:
  hosts:
    web1:
    db1:
  vars:
    log_server: 10.0.20.15

The rsyslog template uses the group variable as a target IP. If the destination server needs to be changed, you only modify a single line and reapply the playbook. You won’t have to look through numerous configurations.

Automate System Logging Configuration with Ansible

This is where all the hard work pays off—you will now automate the entire system configuration using ansible.

Writing the rsyslog.conf Jinja2 Template

I use an rsyslog.conf.j2 template file to replace the default rsyslog configuration. This template file will define the base module loading, global directives, and a forward rule to use the log server variable set during the previous section. The basic section of the rsyslog configuration file for forwarding to the logging server is as follows:

# Forward all messages with priority info and higher to the central server
*.*  @@{{ log_server }}:514

# But also keep a local copy
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat
*.* /var/log/messages

The two @ symbols mean to use TCP for communication between the log server and the remote servers, whereas using one @ symbol would indicate UDP. UDP should not be used for infrequent use because UDP packets may get dropped. You can also adjust the port used by the rsyslog daemon or enable TLS wrapping as needed without having to edit each machine’s rsyslog configuration file.

Executing Template Deployment via Playbook

Running the template pushing task is very easy:

- name: Deploy rsyslog configuration
  ansible.builtin.template:
    src: rsyslog.conf.j2
    dest: /etc/rsyslog.conf
    owner: root
    group: root
    mode: '0644'
    validate: 'rsyslogd -N1 -f %s'
  notify: restart rsyslog

By using the validate parameter to check the syntax of the new rsyslog configuration file before it has been copied to disk, the ansible template module is able to prevent misconfigured rsyslog configuration files from ever existing on the disk. The notify statement will invoke my handler only if the rsyslog configuration file has been modified—no unnecessary service status changes are caused by the playbook.

Triggering a Service Restart Using Handlers

The handler is defined one time, and can then be reused by multiple tasks:

handlers:
  - name: restart rsyslog
    ansible.builtin.service:
      name: rsyslog
      state: restarted
    listen: restart rsyslog

By using a listen topic, the same restart can be easily connected from a logrotate event without having to create multiple handler names for different roles.

Why I Ultimately Chose This Route

After a period of time using shell scripts and a git repository to manage rsyslog configurations, I was continually running into issues with config drift after manual emergency fixes and had no reasonable way to determine if any of my servers were up-to-date. I was able to find the answer to this with Ansible. Ansible is idempotent and uses inventory-driven templates to solve the previous problems. The presence of the Jinja2 layer gives me the ability to have a single source of truth and to handle per OS differences through simple conditionals. The template module docs have been very easy to follow, and the built-in validation hooks were the missing safety net that I did not have with raw scp methods.

Implementing Disk Management and Log Rotation

While forward-logging events are great to have, maintaining disk pressure from huge log files can still create an issue. Implementing a consistent retention policy across the board is critical.

Defining a Standard Retention Policy

I retain logs for 30 days, all logs that are older than 1 day are compressed, and all logs are rotated every week or every time they reach 100 MB, whichever comes first. This retention policy is aggressive enough to keep the server from running low on disk space, yet will allow for plenty of time to investigate events occurring before the event being investigated. Throughout my fleet, I have found that weekly rotate 4 compress delaycompress missingok notifempty

Applying Configuration via the Logrotate Module

Ansible provides a dedicated module to automatically create the /etc/logrotate.d files for log rotation. I call the module for all applications that write logs as follows:

- name: Configure logrotate for application logs
  community.general.logrotate:
    name: myapp
    path: /var/log/myapp/*.log
    options:
      - daily
      - rotate 7
      - compress
      - delaycompress
      - missingok
      - notifempty
      - create 640 root adm
  notify: restart rsyslog

Using the community.general.logrotate module eliminates any chance of syntax errors when copying / pasting. The rsyslog handler gets triggered to restart writing logs into the newly created log file.

Edge Case: Handling Stale File Handles After Log Rotation

This was really the kicker for me. I executed the logrotate process; all files had been rotated and compressed correctly, and the playbook executed successfully, but rsyslog stopped writing to the central log server. Rsyslog was still actively writing to a deleted file descriptor.

Diagnosing Missing Logs with lsof

When rsyslog can’t send logs, and there’s no data on the wire, you can use open file handles to find the problem. I ran the following:

sudo lsof -p $(pgrep rsyslogd) | grep deleted

The output from the command showed the following:

rsyslogd 12345 root    5w   REG   8,1    0 1048576 /var/log/messages (deleted)  <-- stale descriptor

The (deleted) indicates that the inode doesn’t exist anymore, but rsyslogd is still writing to it. The end result was a massive loss of data because rsyslog wrote into the garbage while concurrently created /var/log/messages was empty. No error, no warning, just silent loss of data.

Configuring copytruncate vs. postrotate Scripts

There are two options to fix this in logrotate. The lazy approach would use copytruncate, which would copy the file and truncate the original file; therefore, the inode will not change. This option works well, but you may have a few missing log lines when copying.My preferred method for dealing with the log rotation of rsyslog is using a postrotate script that signals the daemon by sending it a SIGHUP or restarting it altogether. This causes rsyslog to reopen its new log file for writing. My logrotate configuration section will look something like this:

/var/log/messages {
    weekly
    rotate 4
    compress
    delaycompress
    missingok
    notifempty
    sharedscripts
    postrotate
        /bin/kill -HUP $(cat /var/run/rsyslogd.pid 2>/dev/null) 2> /dev/null || true
    endscript
}

By adding the sharedscripts directive to my logrotate configuration, I am assuring that the postrotate script will only execute once no matter how many log files are matched by the logrotate configuration. This is important because when I deploy my logrotate configuration using Ansible, the handler that is invoked for the service restart due to configuration changes already provides a safeguard against manually executing a log rotation against the logs or using a cron job to perform a manual log rotation thereby bypassing Ansible.

Frequently Asked Questions

How do I validate my rsyslog.conf syntax before restarting the service with Ansible?

I include the configuration file syntax validation as part of the template task for rsyslog configuration deployment by adding the validate parameter using the command line option rsyslogd -N1 -f %s, where %s is the path and file name to the rsyslog configuration file being validated. Therefore if the rsyslog configuration file has invalid syntax the Ansible task fails prior to creating the configuration file on the filesystem. You can also create a separate Ansible task for validation, if you want an extra level of comfort, but I consider inline validation sufficient.

Can Ansible manage logging facility levels conditionally based on the operating system family?

Absolutely. Within the Jinja2 template, I can use a simple if condition like {% if ansible_os_family == 'Debian' %}, to specify which modules to include for Debian or to set the appropriate facility filters. I have grouped the variables that are used within the template that are specific to a particular operating system family in an Ansible inventory group variable and reference these from the Jinja2 template; therefore, the conditions used to determine which modules/facility filters to use do not complicate the playbook.

What is the best way to handle temporary network disconnections to the destination server?

Rsyslog supports the use of disk-assisted queues to hold messages while the remote server is down. I define a queue right in the forwarding action:

*.* action(type="omfwd" target="{{ log_server }}" port="514" protocol="tcp"
        queue.filename="fwdRule1" queue.type="LinkedList"
        queue.saveonshutdown="on" queue.maxdiskspace="1g"
        queue.dequeuebatchsize="256" )

Using disk-assisted queueing allows logs to be stored to disk during link outages and safely transferred to the destination when the link comes back up again. This is the only method I have found to maintain alert levels during a switch reboot. If you require customization of your queue definition, you will find all configuration options in the rsyslog action documentation.

That’s it. You can roll this recipe out to a hundred servers in a matter of minutes, and you can have peace of mind that you will never again walk into a situation where you find an error message stating “No space left on device”.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button