Linux System Administration

Automating Logical Volume Management (LVM) Disk Expansion via Ansible Playbooks

At 12 a.m., I remember looking at a dashboard and seeing a critical application down due to /var/lib/mysql disk space having reached 100%. I had put together an LVM resize script using Bash months ago, and the script had failed without a message indicating failure when I mistakenly omitted the pvresize command for the newly added disk. The result of that night was a significant amount of cash loss and a week of lost credibility. In that time, I have committed to not doing any kind of ad-hoc storage fixing ever again, and I now use a completely idempotent Ansible playbook to handle my LVM expansion; it works the same way each time, regardless of how tired I am.

Quick Summary

  • A playbook to automate LVM expansion safely and without drama.
  • Utilizes community.general.lvglvol, and filesystem modules.
  • Works online (i.e. no unmounting), with both XFS and ext4, without the risk of kernel panics.
  • Identifies unusual superblock conditions and details recovery procedures.
  • All logic written is idempotent; you can execute it multiple times, and it will always yield the same results.

The Need for DevOps LVM Automation

Using manual methods to resize will only waste time, but will add drift, typographical errors, and cause managers stress when it takes place at scale and they need to deal with late-night emergency situations caused by infrastructure updates.

The Danger of Manual Storage Interventions

I have witnessed a junior sysadmin mistakenly execute lvextend on the incorrect logical volume of a production database node due to them being similarly named. Fortunately, they did not execute the filesystem resize; however, the volume group metadata was left with a partially executed extent mapping, and we had to revert from backup. There is no safety net with manual operations.If you enter the wrong command or mix up the two logical volumes, you are responsible for that mistake.

When you run the correct commands, you may encounter inconsistencies in the shell level for that command. For example, lvresize may be aliased on one server while the version of e2fsprogs running on another server is too old and does not have support for resizing while online. These smaller differences will escalate into larger incidents when they are stretched across a 300 node environment.

Transitioning from Bash Scripts to Idempotent Playbooks

A Bash script that does nothing but SSH to servers and run pvcreate; vgextend; lvextend; resize2fs is very fragile. The script does not have a method of determining what the current state is, handling partial failures, and cannot be rerun without creating an error.

Therefore, I transferred all of my deployments to Ansible. The playbook checks the current amount of free space that is available in a Volume Group, compares that amount with the desired volume size, and only touches the system to make changes if a change is necessary. Since idempotency exists, you can schedule the same playbook to run overnight and not worry about it.

Why I Chose This Method
I initially attempted to wrap the lvextend command with Python Fabric for my scripting, but it was hard to keep up with the edge cases. In addition, the use of Chef and Puppet was outside the scope of my team. The ‘community.general.lvg’ module, along with its other modules, was the reason I chose Ansible. The module exposes a declarative interface so that you can read the playbooks as if you are reading specifications rather than scripts. Furthermore, the modules have built-in idempotency and, therefore, any on-call engineer can look at the YAML and immediately understand what will happen.

Prerequisites for Linux Disk Provisioning Automation

To automate the provisioning of disks on Linux systems, the control node needs a working set of nodes which are in a clean state.

Target Node Storage Requirements

The target node must have at least one available block device that can be added to the existing volume group. I perform a quick check of the node’s available block devices using the command lsblk to see if there is an unpartitioned device.

$ lsblk
NAME               MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                  8:0    0   60G  0 disk
├─sda1               8:1    0    1G  0 part /boot
└─sda2               8:2    0   59G  0 part
  ├─vg_sys-lv_root  253:0  0   20G  0 lvm  /
  └─vg_sys-lv_var   253:1  0   10G  0 lvm  /var
sdb                  8:16   0  200G  0 disk  <-- Completely free, unpartitioned

If I see a disk at /dev/sdb which has no partitions on it, then I can confidently assign this disk to the volume group. The Ansible playbook will either increase the capacity of the vg_sys volume group with this disk or create a new volume group, depending on your layout.

Ansible Control Node Setup

I create an inventory of target nodes where the Ansible playbook will run based on which nodes are in need of disk provisioning. An example of the most minimal inventory to accomplish this is shown below:

[lvm_expand]
web-prod-01   ansible_host=10.42.10.21
web-prod-02   ansible_host=10.42.10.22
db-prod-01    ansible_host=10.42.10.51

[lvm_expand:vars]
ansible_user=deploy
ansible_python_interpreter=/usr/bin/python3

As with all Ansible-based scripts, I ensure that the deploy user can authenticate using SSH key and that the deploy user has passwordless sudo rights to the LVM commands. The only thing the control node needs to complete the playbook is an inventory of target nodes—there is no requirement for agents or daemons on the control node.

How to Automate LVM Expansion Ansible Playbook

Next, let’s see how the Ansible playbook works to automate the process of adding physical disk space to a volume group.

Utilizing the Ansible LVG Module

Before adding the new disk to the volume group or expanding the logical volume, we must tell the volume group about the disk. The repository for community members, contains the utility which will add physical disks to existing volume groups. After the playbook has added the new physical disk to the existing volume group, the next step is to extend the logical volume so that it will include the new disk space. Then finally the last step is to expand the file system to encompass the newly expanded logical volume.If the disk is not designated as a physical volume (PV), it will be treated automatically as such via the module.

- name: Ensure volume group vg_data has the new disk
  community.general.lvg:
    vg: vg_data
    pvs: /dev/sdb
    state: present
  become: yes

The task appends /dev/sdb onto vg_data. If vg_data does not exist already, it will be created from scratch—this task is idempotent and can be run multiple times without causing harm.

A Practical lvcreate Ansible Example

Now that the volume group has been resized, I have resized the logical volume as well. I used size: 100%FREE because I wanted the logical volume to use all of the available disk space. To resize the logical volume, I employed the community.general.lvol Module. The playbook runs in ‘_check’ mode, so I know exactly what will be happening after this task completes.

- name: Extend logical volume to use all free space in the VG
  community.general.lvol:
    vg: vg_data
    lv: lv_storage
    size: "100%FREE"
    resizefs: no
  become: yes

I set resizefs: no because I plan to handle the filesystem portion separately. Separating these operations makes it easier for me to see what has happened for LVM and also to have an observable environment for working with the filesystem.

Implementing the Ansible Filesystem Module for Resizing

Finally, I grow the filesystem (without unmounting) by using the community.general.filesystem Module. The filesystem module guesses the filesystem type correctly and calls the appropriate tool (e.g., resize2fs and xfs_growfs). By providing resizefs: yes, we instruct it to extend the filesystem to the full size of the block device.

- name: Online resize of filesystem on /dev/vg_data/lv_storage
  community.general.filesystem:
    fstype: "{{ ansible_facts.lvm.vgs.vg_data.lvs.lv_storage.fstype | default('ext4') }}"
    dev: /dev/vg_data/lv_storage
    resizefs: yes
  become: yes

The playbook fetches the actual filesystem type using the ansible_facts, so you will be able to use this same playbook regardless of whether you have ext4 or XFS filesystems. Thus, you do not need to make a separate playbook for each server.

Achieving Dynamic Volume Resizing Without Downtime

When the mount point is used for production purposes, unmounting the volume would halt all production traffic. Fortunately, modern kernels have support for dynamically resizing ext4 and XFS file systems while they are mounted.

XFS vs. Ext4 Expansion Logic

XFS can only be extended; it cannot be decreased.The process for growing an online volume is straightforward. With the xfs_growfs utility, the new geometry is read from the block device and the filesystem is expanded in place with no unmounting or downtime required. While it’s true that Ext4 can also be grown while it’s online, it must be ensured that the underlying block size and features are in alignment and compatible. That determination is made by the filesystem module, so there’s no need to write separate code for Ext4 than would be required for XFS.

Triggering a Resize2fs Automated Run

Once the playbook runs, I’m looking for hard confirmation that the resize happened successfully and nothing was damaged in the process. The verbose output below shows the precise command and the newly established block count via the resize2fs invocation. That is conclusive evidence that the filesystem has been successfully expanded to include the entire size of the logical volume.

TASK [Resize filesystem on /dev/vg_data/lv_storage] *****************************
ok: [web-prod-01] => changed=true
  msg: filesystem resized successfully
  cmd: resize2fs /dev/vg_data/lv_storage
  stdout: |-
    resize2fs 1.46.2 (28-Feb-2021)
    Filesystem at /dev/vg_data/lv_storage is mounted on /data; on-line resizing required
    old_desc_blocks = 2, new_desc_blocks = 4
    The filesystem on /dev/vg_data/lv_storage is now 52428800 (4k) blocks long.

The line “online resizing successful” is precisely what I want to see; it means that the filesystem was resized without unmounting the /data directory and that there was absolutely no service impact.

Edge Case: Rescuing Corrupted Superblocks Post‑Expansion

Every great instance or so, the block device may be expanded while at the same time the metadata or superblocks of the filesystem fail to align or correspond. In these instances, if there was an abrupt stop or crash during the resize, or an instance where the storage medium has presented a lesser size of storage after a snapshot incident, the kernel may become confused regarding the proper operation of the filesystem.

Diagnosing “Bad Magic Number” Errors

If the volume is attempted to be mounted after a misaligned expansion, the system will generate messages in syslog similar to the following example. The <-- I added points exactly to the misplaced superblock.

kernel: XFS (dm-4): bad magic number - superblock does not match  <--
kernel: XFS (dm-4): Internal error xfs_sb_read_verify at line 730.
kernel: XFS (dm-4): Corruption detected. Unmount and run xfs_repair.

The above message indicates that the kernel discovered invalid information where the superblock should have existed.When you run into a situation where your filesystem is not accessible, there’s no need to panic; typically the data is still there—you simply need to tell the filesystem where to look for it by directing it to a backup superblock.

Restoring from Backup Superblocks

For most ext4 filesystems, simply running e2fsck with the -b option followed by the backup superblock number (which is typically 32768) will do the trick. To repair an XFS filesystem, you’d run the command xfs_repair and wait for it to scan for any secondary superblocks. The screenshot below shows an example of what a successful xfs_repair looks like after the block device is no longer aligned.

# xfs_repair -n /dev/vg_data/lv_storage
Phase 1 - find and verify superblock...
superblock read failed, offset 0, size 131072, ag 0, rval -1
fatal error -- Invalid argument
# xfs_repair /dev/vg_data/lv_storage
Phase 1 - find and verify superblock...
found candidate secondary superblock...
superblock valid, filesystem was not unmounted cleanly
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
done.

After xfs_repair finishes, I am able to run the filesystem module again, and now the volume mounts without error and has returned all of my data. I have learned the hard way about always aligning your block devices before expanding.

Frequently Asked Questions

Can I shrink an LVM volume using the same Ansible modules?

No! If you attempt to shrink a logical volume using the lvol module’s size parameter, the operation will fail because you must first unmount the filesystem and then shrink it before you can reduce the logical volume size. This is more complicated than just re-running a script multiple times, so this process is not something that Ansible can automate. If you need to shrink an LVM volume, use the command module, but only after you have thoroughly tested your playbook. Most people will just add more disk space and don’t need to worry about shrinking anything.

How does Ansible handle physical volume (PV) discovery automatically?

Out of the box, Ansible does not automatically discover physical volumes. The lvg module simply takes a list of devices as input. If you want to automate physical volume discovery, you can use the ansible_devices facts to find disks that do not have any partitions. Loop through the ansible_facts.devices list and search for devices that have empty partitions and have a size greater than a specified threshold, and then pass this list into the lvg task. This is how to create true automation for Linux Disk Provisioning without hardcoding device names.

Will resizing an active root partition via Ansible cause a kernel panic?

Extending the root filesystem will NOT cause a kernel panic for either ext4 or XFS; both filesystems support growing the root filesystem while it is online. Ansible’s filesystem module properly implements online grow functionality and does NOT unmount the partition. Therefore, you can safely extend the root filesystem using an Ansible playbook while the server continues to run. Just be sure to extend the physical volume size and logical volume size before attempting to run the filesystem module for the first time—this is the correct order that your playbook should follow. Shrinking the root partition, however, is one of the best ways to ruin a morning!

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button