The difference between a patching automation that survives its first production incident and one that doesn’t usually comes down to what you verified before you started—not the playbook itself. Checklists are not bureaucracy; they are the accumulated cost of previous failures encoded as procedure. This final post consolidates the series into an operational reference.
Pre-Patching Checklist
Work through this before every scheduled patch window. For emergency patches, compress the timeline—but skip nothing.
Inventory and connectivity:
- Inventory is current and matches the authoritative CMDB or cloud source
ansible all -m pingreturns SUCCESS for all in-scope hosts- No hosts are
UNREACHABLEthat should be reachable - Dynamic inventory (if used) is returning the expected host count
Backups and snapshots:
- VM snapshots taken for critical hosts within the past 24 hours
- Database backups verified and tested (not just triggered)
- Rollback procedure documented and accessible offline
Change management:
- Maintenance window approved and communicated to stakeholders
- Downstream dependencies identified (load balancer draining, dependent services)
- On-call engineer confirmed available for the duration of the window
Playbook readiness:
- Playbook tested against staging environment within the past week
--check --diffrun against production inventory reviewed and approvedserialbatch size set appropriately (never patch all at once)patch_reboot_allowedvariable set correctly per group- Ansible Vault password available on the control node
- CI/CD secrets (SSH key, vault pass) rotated if not rotated recently
During the Patch Run
First 5 minutes:
- Run
--check --diffone final time immediately before applying. Confirm no unexpected changes appeared since the pre-check. - Start with the canary group or a single representative host before the full batch.
During execution:
- Monitor the first batch; do not walk away
- Confirm services are healthy after each batch before proceeding
- Watch for
failedorunreachablehosts — stop and investigate before continuing - Keep the maintenance window communication channel open
If something fails:
# Retry only failed hosts from the last run
ansible-playbook patch-linux.yml -i inventory/production/hosts.yml \
--limit @/tmp/patch-linux.retry
Ansible automatically writes a .retry file named after the playbook when hosts fail.
Post-Patching Checklist
Immediate (within 30 minutes of completion):
- All hosts returned
okorchanged— no remainingfailedorunreachable - Critical services validated on every patched host
- Reboot-required hosts have rebooted and are back online
- Application health checks passing (HTTP endpoints, synthetic monitors)
Patch level verification:
# Confirm no security updates are pending (Debian/Ubuntu)
ansible all -i inventory/production/hosts.yml \
-m shell -a "apt-get --just-print upgrade 2>/dev/null | grep ^Inst | wc -l" \
--become
# Check installed kernel version
ansible all -i inventory/production/hosts.yml \
-m setup -a "filter=ansible_kernel"
Documentation and reporting:
- Patch run log archived (JSON output or AWX job record)
- Compliance report generated and submitted to the relevant team
- Any exceptions or deferred hosts documented with reason and remediation timeline
- SIEM/ticketing system updated with patch status
Common Pitfalls
Breaking changes from dist-upgrade
apt upgrade dist can install new packages or remove others to satisfy dependencies. Always review the diff output before applying to production. Pin packages that must not change using apt-mark hold.
Package dependency conflicts
A partially applied patch can leave a system with broken dependencies. If a run fails mid-way through a batch, run dpkg --configure -a or dnf distro-sync on the affected host before retrying.
Credential and privilege issues
The become user must have passwordless sudo for patching commands, or you must supply the password via Vault. Test this explicitly on a new host before adding it to a production group.
Inventory drift
Hosts decommissioned in infrastructure but not removed from inventory will appear as UNREACHABLE. This generates noise that masks real failures. Automate inventory from a source of truth (cloud API, CMDB) to prevent this.
WinRM authentication failures mid-run
Windows hosts can temporarily reject WinRM connections during update processing. Set retries: 3 and delay: 60 on the win_updates task (covered in part six) to handle transient failures without aborting the entire play.
Security Considerations
| Risk | Mitigation |
|---|---|
| SSH key exposure | Store keys in a secrets manager; rotate after personnel changes |
| Vault password in CI logs | Write to a temp file, never echo; always step to delete it |
| Over-privileged become | Scope sudo rules to the specific commands Ansible needs |
| Unchecked playbook execution | Require --check output review before production runs |
| Supply chain: collection integrity | Pin collection versions in requirements.yml; verify checksums |
Maturity Roadmap
Use this as a benchmark for where your organisation is and what the next step looks like.
Level 1 — Manual
- Patches applied via RDP/SSH manually
- No audit trail
- No consistent timing
- Reactive: patches applied only after incidents
Level 2 — Scripted
- Bash/PowerShell scripts or ad-hoc Ansible commands
- Some consistency, but no idempotency or error handling
- Inventory managed in spreadsheets
Level 3 — Automated (where you should be after this series)
- Playbooks in version control
- Structured inventory (group_vars, dynamic sources)
- Vault for secrets, roles for reuse
- Scheduled runs with retry logic
- JSON logs, basic compliance reporting
Level 4 — Integrated
- AWX or Automation Platform with RBAC
- CI/CD triggers with approval gates for production
- SIEM integration, compliance dashboards
- Drift detection running continuously
Level 5 — Compliance-driven
- Patch SLAs enforced automatically (critical CVEs patched within N days)
- Automated evidence generation for audits
- Exception workflow integrated with ticketing system
- Full traceability from CVE publication to patch verification
Series Index
- Getting Started with Ansible for Patch Management
- Must-Know Ansible Commands and Core Concepts
- Ansible Architecture and Best Practices
- Advanced Ansible Usage for Enterprise Environments
- Linux Server Patching with Ansible
- Windows Server Patching with Ansible
- Day-to-Day Automation and Reporting
- Production Checklist and Maturity Roadmap (this post)
Previous: Day-to-Day Automation and Reporting
