Ansible Patch Management: Production Checklist and Maturity Roadmap

Eran Goldman-Malka · November 27, 2025

Ansible DevOps

The difference between a patching automation that survives its first production incident and one that doesn’t usually comes down to what you verified before you started—not the playbook itself. Checklists are not bureaucracy; they are the accumulated cost of previous failures encoded as procedure. This final post consolidates the series into an operational reference.

Pre-Patching Checklist

Work through this before every scheduled patch window. For emergency patches, compress the timeline—but skip nothing.

Inventory and connectivity:

Inventory is current and matches the authoritative CMDB or cloud source
ansible all -m ping returns SUCCESS for all in-scope hosts
No hosts are UNREACHABLE that should be reachable
Dynamic inventory (if used) is returning the expected host count

Backups and snapshots:

VM snapshots taken for critical hosts within the past 24 hours
Database backups verified and tested (not just triggered)
Rollback procedure documented and accessible offline

Change management:

Maintenance window approved and communicated to stakeholders
Downstream dependencies identified (load balancer draining, dependent services)
On-call engineer confirmed available for the duration of the window

Playbook readiness:

Playbook tested against staging environment within the past week
--check --diff run against production inventory reviewed and approved
serial batch size set appropriately (never patch all at once)
patch_reboot_allowed variable set correctly per group
Ansible Vault password available on the control node
CI/CD secrets (SSH key, vault pass) rotated if not rotated recently

During the Patch Run

First 5 minutes:

Run --check --diff one final time immediately before applying. Confirm no unexpected changes appeared since the pre-check.
Start with the canary group or a single representative host before the full batch.

During execution:

Monitor the first batch; do not walk away
Confirm services are healthy after each batch before proceeding
Watch for failed or unreachable hosts — stop and investigate before continuing
Keep the maintenance window communication channel open

If something fails:

# Retry only failed hosts from the last run
ansible-playbook patch-linux.yml -i inventory/production/hosts.yml \
  --limit @/tmp/patch-linux.retry

Ansible automatically writes a .retry file named after the playbook when hosts fail.

Post-Patching Checklist

Immediate (within 30 minutes of completion):

All hosts returned ok or changed — no remaining failed or unreachable
Critical services validated on every patched host
Reboot-required hosts have rebooted and are back online
Application health checks passing (HTTP endpoints, synthetic monitors)

Patch level verification:

# Confirm no security updates are pending (Debian/Ubuntu)
ansible all -i inventory/production/hosts.yml \
  -m shell -a "apt-get --just-print upgrade 2>/dev/null | grep ^Inst | wc -l" \
  --become

# Check installed kernel version
ansible all -i inventory/production/hosts.yml \
  -m setup -a "filter=ansible_kernel"

Documentation and reporting:

Patch run log archived (JSON output or AWX job record)
Compliance report generated and submitted to the relevant team
Any exceptions or deferred hosts documented with reason and remediation timeline
SIEM/ticketing system updated with patch status

Common Pitfalls

Breaking changes from dist-upgrade

apt upgrade dist can install new packages or remove others to satisfy dependencies. Always review the diff output before applying to production. Pin packages that must not change using apt-mark hold.

Package dependency conflicts

A partially applied patch can leave a system with broken dependencies. If a run fails mid-way through a batch, run dpkg --configure -a or dnf distro-sync on the affected host before retrying.

Credential and privilege issues

The become user must have passwordless sudo for patching commands, or you must supply the password via Vault. Test this explicitly on a new host before adding it to a production group.

Inventory drift

Hosts decommissioned in infrastructure but not removed from inventory will appear as UNREACHABLE. This generates noise that masks real failures. Automate inventory from a source of truth (cloud API, CMDB) to prevent this.

WinRM authentication failures mid-run

Windows hosts can temporarily reject WinRM connections during update processing. Set retries: 3 and delay: 60 on the win_updates task (covered in part six) to handle transient failures without aborting the entire play.

Security Considerations

Risk	Mitigation
SSH key exposure	Store keys in a secrets manager; rotate after personnel changes
Vault password in CI logs	Write to a temp file, never echo; `always` step to delete it
Over-privileged become	Scope sudo rules to the specific commands Ansible needs
Unchecked playbook execution	Require `--check` output review before production runs
Supply chain: collection integrity	Pin collection versions in `requirements.yml`; verify checksums

Maturity Roadmap

Use this as a benchmark for where your organisation is and what the next step looks like.

Level 1 — Manual

Patches applied via RDP/SSH manually
No audit trail
No consistent timing
Reactive: patches applied only after incidents

Level 2 — Scripted

Bash/PowerShell scripts or ad-hoc Ansible commands
Some consistency, but no idempotency or error handling
Inventory managed in spreadsheets

Level 3 — Automated (where you should be after this series)

Playbooks in version control
Structured inventory (group_vars, dynamic sources)
Vault for secrets, roles for reuse
Scheduled runs with retry logic
JSON logs, basic compliance reporting

Level 4 — Integrated

AWX or Automation Platform with RBAC
CI/CD triggers with approval gates for production
SIEM integration, compliance dashboards
Drift detection running continuously

Level 5 — Compliance-driven

Patch SLAs enforced automatically (critical CVEs patched within N days)
Automated evidence generation for audits
Exception workflow integrated with ticketing system
Full traceability from CVE publication to patch verification

Series Index

Getting Started with Ansible for Patch Management
Must-Know Ansible Commands and Core Concepts
Ansible Architecture and Best Practices
Advanced Ansible Usage for Enterprise Environments
Linux Server Patching with Ansible
Windows Server Patching with Ansible
Day-to-Day Automation and Reporting
Production Checklist and Maturity Roadmap (this post)

Previous: Day-to-Day Automation and Reporting

Share: Twitter, Facebook