Troubleshoot Ansible SSH: Fix Nodes

troubleshoot ansible unreachable host ssh connection

Understanding the Ansible Unreachable Host Error
Common Causes of SSH Connection Refused
How to Troubleshoot Ansible Unreachable Host SSH Connection
Advanced Debugging and Edge Case Workarounds
Best Practices to Prevent Connection Dropouts
Frequently Asked Questions

After a kernel patch was installed, my deployment pipeline started showing errors in relation to Ansible. All of my playbooks returned the same unreachable error message when attempting to run against a host: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Connection refused", "unreachable": true}. Even though I was able to manually connect to the host using SSH from my laptop, Ansible still showed that the connection was down because it could not establish a connection to it. After several hours of digging through odd SSH multiplexing leftovers, I eventually figured out that the connection to the host was not “down”, but merely misconfigured on my end.

What started as a simple troubleshooting evening quickly turned into an extensive investigation into all the different reasons for Ansible to throw an unreachable error. I am publishing this article so that you can complete your troubleshooting session in less than half the time that I did.

Quick Summary

The unreachable error usually indicates an issue with the network level and is not due to an SSH daemon failure.
The most common causes of silent dropouts are firewall rules on port 22, host key checking, and persistent SSH sockets.
The use of the ansible_ssh_common_args variable and adjusting the Ansible timeout settings can allow you to route around some of the more unusual network paths.
Detailed -vvvv logs and the deletion of stale ControlSockets will allow you to resolve some of the stealthy edge cases.

Understanding the Ansible Unreachable Host Error

Ansible defines a host as unreachable when it cannot establish a connection to a host using SSH in order to run an Ansible module. Unlike a failed task, this means there was never even a handshake established between the control node and target node. Ansible may determine a connection as unreachable due to multiple reasons, including blocked ports, DNS issues, and SSH configuration issues.The challenge is to identify the specific point where the SSH handshake fails. Ansible uses the native SSH (Secure Shell) client which is already present in your operating system. If the SSH client can connect with the remote machine, then Ansible will also be able to make that same connection as long as the settings in your inventory and the state of your control socket are configured correctly.

What Didn’t Work For Me

When trying to troubleshoot this connection issue, my first approach was to increase the timeout value in the ansible.cfg file to 300 seconds. All this did was make the playbook hang longer before it failed. My next attempt was to turn off global host key verification, which eliminated all the warning messages but still did not resolve the underlying connection issue caused by the Firewalld configuration having the SSH service bound to a different zone than the one the Ansible control machines were in. The solution became much clearer when I started breaking down the layers of the connection that could potentially cause the failure: Network Failure, Authentication Failure and Configuration Failure.

Common Causes of SSH Connection Refused

There are a couple of reasons you may receive an SSH connection refused message, but the most obvious cause is that a firewall is blocking access to port 22. On a default installation of Red Hat based operating systems, the firewalld application sets up the default public zone which does NOT permit access to the SSH service. Additionally, I’ve also seen instances where Cloud Security Groups allow users to connect to SSH only from listed public IP addresses. Therefore your Ansible control machine may not be in this list of allowed addresses.

You can check to see if the SSH daemon is actually listening on the appropriate network interface and port by running the following command on the target machine (or ask someone else to run it):

$ sudo ss -tlnp | grep :22
LISTEN  0  128  0.0.0.0:22  0.0.0.0:*  users:(("sshd",pid=812,fd=3))

If all you see is 127.0.0.1:22, this means that the SSH daemon is bound only to the localhost network interface. Thus, attempts from the Ansible control machine to connect remotely won’t succeed.If you see no output from the command above, then the sshd service is not running on the target system.

Strict Host Key Checking Failures

When SSH tries to establish a connection to a remote host, it will check if it can verify the remote host’s key. If it cannot, SSH will reject the connection attempt and report the failure with a message like “Host key verification failed”. The Ansible high-level output will usually report this type of failure as “unreachable”.

I have run into this when recycling cloud instances. For example, if a new instance is created and gets the same IP address as a previous instance (which has since been decommissioned), the new instance will generate a new host key that will differ from the old host key. Therefore, even though the new instance is on the same network and has the same IP address as the old instance, the old host key will still be in the Ansible local known_hosts file and the connection will be rejected.

One of the quickest ways to work around this is to pass ansible_ssh_common_args='-o StrictHostKeyChecking=no', but for production use, I recommend creating a known_hosts file in advance using either the ssh-keyscan utility or the ansible.builtin.known_hosts module.

High Network Latency and Routing Drops

When an SSH connection takes longer than the default SSH timeout to complete the connection handshake, the Ansible inventory will mark the host as unreachable. High network latency (i.e., the amount of time that passes between the client and server) due to satellite links, cross-region VPNs, or heavy workloads on jump hosts can quickly add to the amount of time it takes to create a connection. Ansible also offers a variable, timeout, that can be set to determine the maximum amount of time the SSH client will attempt to create a connection to the target host. If the connection drops intermittently, it is possible that the packet loss during the SSH key exchange caused the TCP connection to be reset and resulted in a “Connection refused” error. This could indicate the presence of a routing black hole.

How to Troubleshoot Ansible Unreachable Host SSH Connection

So how do you start troubleshooting Ansible unreachable host SSH connections in a systematic manner? I suggest taking a layered approach to your troubleshooting process.

Verifying SSH Service Status on the Target Node

The first thing you will want to do is check and see if the SSH daemon is currently awake. You cannot rely on Ansible’s returned error message for this information, as Ansible will not indicate the status of SSH, it will simply return the “refused” error. A preferred method would be to connect using a separate out-of-band method (such as a console, an IPMI connection, or a cloud shell) and execute the following:

$ systemctl status sshd
● sshd.service - OpenSSH Daemon
   Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled)
   Active: active (running) since Mon 2026-04-28 14:05:22 UTC
 Main PID: 812 (sshd)
   CGroup: /system.slice/sshd.service

If the SSH daemon is not running or is inactive, you can simply execute the systemctl command to restart it and get back to using Ansible. If the SSH daemon is currently running and you are still not able to connect via Ansible, this issue will likely be a result of a firewall blocking Ansible from connecting, so you will want to look to the firewall rules or routing for assistance.

Adjusting Inventory Timeout Settings

The default value for Ansible’s Connection Timeout can be ridiculous low from a bad Network connection perspective. In other words, the Connection Timeout may have been configured to a duration of less than one second. For this reason, I make a change to the ansible.cfg file to increase the duration to an acceptable level. The ansible.cfg file is located on your Ansible Control Machine under /etc/ansible/ansible.cfg.

[defaults]
timeout = 30

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
pipelining = True
timeout = 30

The Configuration value labeled “timeout” under the section labeled [defaults] represents the overall timeout duration of a Task and should be set to a length of time you consider acceptable. The other more important timeout value is found within the section labeled [ssh_connection]. The Connection Timeout value specified here determines how long the SSH Client will wait for an initial connection before it closes the connection. I like to set this value to around 30 or 60 seconds for the high-latency path. This should give sufficient time for establishing a connection via a high-latency path, but will also prevent Ansible from hanging indefinitely awaiting a connection.

Routing Traffic via ansible_ssh_common_args

If you are required to connect through a jump host or if you need to route the connection via a specific routing path that does not conform to the standard routing methods, the ansible_ssh_common_args is very helpful. You can place this value within your inventory file and utilize a ProxyCommand for establishing the connection.

[db_servers]
db-east.internal ansible_host=10.0.40.12 ansible_ssh_common_args='-o ProxyCommand="ssh -W %h:%p bastion.example.com"'

The ProxyCommand entry tells the SSH client underlying Ansible to tunnel all traffic through bastion.example.com. You will still provide all other connection parameters (such as user, key, and timeout) the same as if you did not define a ProxyCommand entry. I have also specified a value of -o StrictHostKeyChecking=accept-new in this context for ephemeral build slaves. You can review the Behavioral Inventory Parameters documentation for all options that can be tweaked if you need a more exotic configuration.

Advanced Debugging and Edge Case Workarounds

Once you’ve ruled out all of the usual suspects, it’s time to get serious with your debugging tools!

Unmasking Errors with the debug vvvv Flag

Using Ansible’s -vvvv option gives you a dump of the entire SSH interaction between Ansible and the remote machine, allowing you to see exactly what command it ran, and where it failed. Whenever I receive an error message that is too generic I append this option:

$ ansible-playbook -i inventory playbook.yml -vvvv
...
<10.0.40.12> ESTABLISH SSH CONNECTION FOR USER: ansible
<10.0.40.12> SSH: EXEC ssh -C -o ControlMaster=auto -o ControlPersist=60s -o Port=22 -o 'IdentityFile="/home/ansible/.ssh/id_rsa"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=publickey -o PasswordAuthentication=no -o User=ansible -o ConnectTimeout=10 -o ControlPath=/home/ansible/.ansible/cp/1f3a8b92a4 10.0.40.12 '/bin/sh -c '"'"'echo ~ && sleep 0'"'"''
<10.0.40.12> (255, '', 'ssh: connect to host 10.0.40.12 port 22: Connection refused\r\n')  <-- Connection refused at TCP level

This will tell you exactly what the target machine is, which port, and what options are being used. If you see the message Connection refused, then there is a failure in the TCP layer, and your focus should now be on network/firewall issues and not configuring SSH. If you see the message reusing existing SSH connection, then you are attempting to use a stale ControlSocket.

Resolving Stale ControlSockets After Network Changes

By default Ansible multiplexes multiple simultaneous connections with ControlMaster. A ControlSocket is stored in ~/.ansible/cp/ and it is re-used for the same remote machine across multiple plays. When the network environment changes (for example, switching VPNs or the IP address of the target changes), those ControlSockets become invalid. If you attempt to connect to the target machine after the network change, Ansible will attempt to re-use the stale ControlSocket, will hang while waiting for a response, and then declare the target unreachable with no explanation.

The solution is simple: delete the old ControlSockets and allow the next Ansible run to recreate them:

$ rm -rf ~/.ansible/cp/*

You should always do this before checking SSH keys or firewall rules for connectivity issues. I now have a handler that will remove any stale ControlSockets whenever a playbook with a connectivity related theme runs.

Best Practices to Prevent Connection Dropouts

Automating Authorized Key Distribution

Hardcoding passwords is a recipe for disaster when it comes to automating your infrastructure.I utilize the ansible.posix.authorized_key module as part of a bootstrapping play to deposit the public key from the control node onto a target node. Subsequently, every connection made afterwards will use Key-Based Authentication.

Standardizing SSH Multiplexing Configurations

Rather than allowing Ansible to set the defaults, I specify ControlMaster and ControlPersist in the SSH Configuration file for the Control User; thereby allowing both Manual SSH sessions and Ansible to use Shared Connections.

Host *
  ControlMaster auto
  ControlPath ~/.ssh/cm/%r@%h:%p
  ControlPersist 300
  IdentitiesOnly yes
  ServerAliveInterval 60

This allows both Ansible and my terminal session to run quickly and consistently. Ansible finds the System SSH Configuration by default, so you don’t have to supply it to ansible_ssh_common_args every time.

Frequently Asked Questions

Why does Ansible report a node as unreachable when manual SSH works?

A common cause is differences in your environment variables or config files. The SSH you used may have used an SSH Agent that Ansible isn’t aware of, or it may depend on the known_hosts entry that exists in the user’s Home directory while Ansible is running under a different user. Running ansible -m ping hostname -vvvv and ssh -vvv user@hostname displays exactly where the Paths diverge.

How do I safely bypass host key verification for dynamic cloud inventories?

You should only set host_key_checking = False in the ansible.cfg for your Inventory with Ephemeral Hosts. A better approach is to use ansible_ssh_common_args="-o StrictHostKeyChecking=accept-new" so that new keys are saved but not automatically accepted during all runs. This way, you encourage the convenience of Automatic Acclimation while not disabling Security.

What is the technical difference between a connection refused and a connection timeout?

A “Connection Refused” indicates that the Remote Host sent an Active TCP RST (Reset) Packet to the Client. This is likely because nothing is listening on Port 22 or a Firewall Reject Rule was applied to filter it. A “Connection Timeout” occurs when the TCP SYN Packet never receives a response, and ultimately the Client stops after the Timeout period. The first message indicates that the Host is Reachable, but Port 22 is Closed, while the second suggests that Network Latency or a Routing Drop prevented access to the IP Address altogether. Understanding these Distinctions will save you from pursuing the Incorrect Layer.

0 6 9 minutes read

Quick Summary

Understanding the Ansible Unreachable Host Error

What Didn’t Work For Me

Common Causes of SSH Connection Refused

Strict Host Key Checking Failures

High Network Latency and Routing Drops

How to Troubleshoot Ansible Unreachable Host SSH Connection

Verifying SSH Service Status on the Target Node

Adjusting Inventory Timeout Settings

Routing Traffic via ansible_ssh_common_args

Advanced Debugging and Edge Case Workarounds

Unmasking Errors with the debug vvvv Flag

Resolving Stale ControlSockets After Network Changes

Best Practices to Prevent Connection Dropouts

Automating Authorized Key Distribution

Standardizing SSH Multiplexing Configurations

Frequently Asked Questions

Why does Ansible report a node as unreachable when manual SSH works?

How do I safely bypass host key verification for dynamic cloud inventories?

What is the technical difference between a connection refused and a connection timeout?

Read Next

Analyzing and Fixing Failed to Template String Errors in Ansible Playbooks

Correcting Variable Precedence Conflicts in Complex Nested Ansible Inventories

Debugging Python Interpreter Discovery Failures on Legacy Managed Nodes

Fixing Ansible Become Sudo Password Errors and Permission Denied Failures

Configuring Persistent System Logging and Log Rotation via Ansible Playbooks

Designing Secure Credential Storage with Ansible Vault for Production Environments

Standardizing Package Management Across Hybrid Linux Environments Using Ansible Modules

Implementing SSH Key-Based Authentication for Secure Ansible Managed Node Connectivity

Automating Multi-Node Web Server Provisioning with Ansible Roles and Handlers

Configuring Persistent System Logging and Log Rotation via Ansible Playbooks

Fixing Ansible Become Sudo Password Errors and Permission Denied Failures

Leave a Reply Cancel reply