Overview
A multi-environment configuration drift detection platform for a major Taiwanese bank's OpenShift estate. Before this system, release engineers manually compared resource manifests across SIT, UAT, and production clusters during every release window — eyeballing YAML diffs across oc get outputs, copying into spreadsheets, and filing tickets for mismatches. The process was error-prone and blocked releases on a single person's availability.
The system replaces that workflow with Ansible Tower schedules that run a library of 30+ purpose-built Ansible Roles against every OCP environment, collect the relevant Kubernetes resource kinds (namespaces, deployments, services, RBAC, secrets, network policies, and more), diff them pairwise, and auto-commit the resulting comparison report to a Gitlab repository. The next release-gate review is reading one commit instead of running oc commands by hand.
Scope: OCP core resources, Quay image catalog, Kafka topic config, Elasticsearch cluster state, and DB connectivity. Any environment difference that mattered operationally was covered by a dedicated check_* role.
My Role
On-site consultant embedded with the client's architecture team:
- Designed the split between
get_*roles (fetch state) andcheck_*roles (compare state) so the pull and the diff could evolve independently - Authored the full role library — 30+ roles covering the OCP core kinds plus platform-specific checks (Quay, Kafka, Elasticsearch, DB)
- Designed custom Ansible Tower credential types (
quay_image_compare_user,ocp_robot_account) with field-level secret flags so per-environment tokens could be injected into runs without hardcoding - Built the Ansible Tower project / template / schedule wiring and the survey-driven "which environments, which kinds" parameterization
- Integrated
git_acpso every run's output is auto-committed to a versioned Gitlab repo — release engineers read diffs as Gitlab commits - Iterated on scope and delivery artifacts across multiple review rounds as the client expanded requirements
Tech Stack
Automation
Platforms
Workflow
Architecture
The system is a library of small, composable Ansible Roles orchestrated by Ansible Tower, with Gitlab as the audit trail:
Tower Layer: Custom credential types hold per-environment OCP robot accounts and Quay tokens. Job Templates are parameterized by survey (which environments to compare, which kinds to check). Schedules run the full comparison at a chosen cadence — typically daily plus on-demand before each release.
Role Library (the core):
get_*roles pull state from OCP usingk8s_info:get_project,get_dc,get_service,get_secret,get_serviceaccount,get_rolebinding,get_networkpolicy,get_pod,get_user,get_bccheck_*roles compare collected state across environments and emit a structured diff:check_project,check_dc,check_service,check_secret,check_serviceaccount,check_rolebinding,check_networkpolicy,check_pod,check_port,check_user,check_system,check_element- Platform-specific checks:
check_quay(image catalog parity),check_kafka(topic / partition config),check_el(Elasticsearch index and template), an OCP-CHECK-DB role (connectivity and schema) - Utility roles:
ocp-login,k8s_definition,k8s_info,set_fact,change_login_url,git_acp
Driver Playbooks wire the library for Tower to call:
get_ocp_info.yml— collect state from one environmentcheck_ocp_info.yml— compare collected state across environmentscheck_quay.yml,check_system.yml,replica_set.yml— platform-specific top-level flows
Persistence Layer: Every comparison report is auto-committed to a dedicated Gitlab repository via git_acp. Gitlab becomes the audit trail — release engineers read the commit history ("what changed between today's audit and yesterday's") instead of running oc commands.
Key Challenges
1. The Comparison Scope Is Broad and Heterogeneous
OCP has a long list of resource kinds that matter for configuration parity (Projects, DCs, Services, Secrets, RoleBindings, NetworkPolicies, Pods, Users, BuildConfigs, ServiceAccounts). Each kind has its own comparison semantics — a Secret diff cares about keys but not values, a Pod diff cares about spec but not status, a RoleBinding diff cares about subjects and role refs. One monolithic "diff everything" approach would be unreadable and incorrect.
2. Credential Management Across Environments
SIT / UAT / production each have their own OCP robot account and their own Quay token. Baking these into Playbooks is a security disaster; passing them through survey extra vars is operationally brittle. Tower's credential system is the right tool, but it requires designing credential types that know which fields are secrets.
3. Platform-Specific Comparison Logic
The OCP core is the easy part — k8s_info returns it uniformly. Quay image catalogs, Kafka topic configs, and Elasticsearch cluster health all have their own APIs, their own auth models, and their own diff semantics. They cannot be jammed into the same k8s_info-based approach.
4. Comparison Output Has to be Consumable by Humans
A raw YAML diff between two clusters is unreadable during a release window. The output must surface "here is what actually differs, in the categories that matter" — not a wall of whitespace and key-reorder noise.
5. Iteration Speed
Four rounds of delivery documents over four months — scope grew (new resource kinds, new platforms), diff semantics got refined based on real release experience, role interfaces stabilized. Any architecture that made it expensive to add a new check_* would have broken down.
Solutions & Design Decisions
Split get_* from check_*
The single most impactful design decision. Each get_* role owns one resource kind's collection logic and returns a normalized fact. Each check_* role consumes those facts and emits diffs in its own format. Adding a new kind is two small roles, not a modification to one giant role. Adding a new diff dimension is a new check_*, not a risky edit to the collector.
Custom Tower Credential Types
Defined two credential types in Tower with field-level secret: true flags:
quay_image_compare_user— SIT and UAT Quay tokens, both marked secret, exposed to Playbooks assit_quay_tokenanduat_quay_tokenextra varsocp_robot_account— username + password pair, password marked secret, exposed asocp_usernameandocp_password
Playbooks never see the raw credential. Tower injects them at run time. Credential rotation is a one-place change.
Platform-Specific Roles With The Same Interface Shape
check_quay, check_kafka, check_el, and the OCP-CHECK-DB role each speak to their own platform API internally but emit diffs in the same structural format as the OCP check_* family. The orchestrating playbook doesn't care that Quay is not Kubernetes — the output looks the same.
Diff Reports Structured for Human Review
Comparison output is grouped by resource kind, then by object name, then by changed field — with same-named objects that match entirely collapsed into a single line. A release engineer looking at "what changed" reads a three-level tree, not a 10,000-line unified diff.
Gitlab Auto-Commit as Audit History
After the comparison runs, git_acp adds the report, commits with a timestamped message, and pushes to a dedicated Gitlab repo. The commit log becomes the audit trail — "what was the delta on 2021-06-18 at the pre-release window" answers itself with one git log.
Results & Impact
Release Process Impact
- Manual multi-hour audit eliminated from every release window
- Comparison now runs on a schedule plus on-demand; release engineers read the Gitlab diff instead of running
occommands - Scope coverage exceeded what a human could realistically check by hand (every kind, every namespace, every environment)
Platform Coverage
- 30+ Ansible Roles spanning OCP core kinds, Quay, Kafka, Elasticsearch, DB
- Roles are orthogonal —
get_*vscheck_*separation lets the library grow without rewrites
Operational Maturity
- Tower credential types keep secrets out of Playbooks and out of Git
- Scheduled templates mean continuous drift visibility, not just release-window spot checks
- Gitlab auto-commit gives every run a permanent, diff-able audit trail
Learnings
Collector / Checker Split Is The Real Architecture
I initially drafted this with one role per "compare this kind" that did both fetch and diff. That made adding a new environment require editing every role. Splitting get_* from check_* turned the system into a library — each role does one small thing and composition happens at the Playbook level. Everything else downstream followed from this one decision.
Tower Credential Types Are Underused
Most Ansible / Tower implementations I see treat credentials as "just extra vars with a secret flag." Defining a proper credential type — with named fields, required flags, and explicit extra-var mapping — makes the credential layer part of the platform contract, not a per-playbook convention. Rotation, auditing, and RBAC all become trivial.
Git was the right audit model for this workflow
Writing to a database is flexible but hard to review. Writing to files on disk is reviewable but ephemeral. Writing to a Git repo combines both — every run is a commit, the diff between runs is git diff, and the history is permanent. For a workflow whose output is "what changed," Git is the natural storage.
Delivery Iterations Revealed the Real Scope
The scope in the March doc and the scope in the July doc were different shapes. Platform-specific additions (Quay, Kafka, Elasticsearch) came from client feedback during rollout — "the OCP parity is great but we also need to know the Quay catalogs are synced." Locking scope early would have shipped the wrong tool.
Deep Dive
Role Library Shape
Orthogonal By Design
The get_* / check_* split is the system's defining shape. A get_* role fetches one resource kind from one cluster and returns a normalized fact. A check_* role consumes facts from multiple clusters and emits a diff. Adding a new environment does not touch any existing role; adding a new kind is two new small roles.
Resource Collection — get_* Roles
| Role | Kubernetes kind |
|---|---|
get_project | Project / Namespace |
get_dc | DeploymentConfig |
get_service | Service |
get_secret | Secret |
get_serviceaccount | ServiceAccount |
get_rolebinding | RoleBinding / ClusterRoleBinding |
get_networkpolicy | NetworkPolicy |
get_pod | Pod |
get_bc | BuildConfig |
get_user | User / Group |
Comparison — check_* Roles
| Role | What it compares |
|---|---|
check_project | Namespace presence · labels · quotas |
check_dc | Deployment spec · replicas · resources · probes |
check_service | Port spec · selector · service type |
check_port | Exposed ports at cluster edge |
check_secret | Key set (not values) · annotation drift |
check_serviceaccount | Name · secret refs · image pull secrets |
check_rolebinding | Subjects · roleRef |
check_networkpolicy | Ingress / egress rules |
check_pod | Running image versions across environments |
check_user | User / group membership |
check_system | Cluster-level parity (nodes · operators) |
Platform-Specific Roles
| Role | Platform | Comparison scope |
|---|---|---|
check_quay | Red Hat Quay | Image repository parity · tag sets |
check_kafka | Kafka | Topic config · partition / replication factor |
check_el | Elasticsearch | Indices · templates · cluster health |
check_element | Generic | Catch-all for ad-hoc element diffs |
| OCP-CHECK-DB | PostgreSQL / Oracle | Connectivity · schema parity |
Tower Credential Types
# Ansible Tower credential type definition fields: - id: ocpusername type: string label: OCPUsername - id: ocppassword type: string label: OCPPassword secret: true required: - ocpusername - ocppassword extra_vars: ocp_username: '{{ ocpusername }}' ocp_password: '{{ ocppassword }}'
fields: - id: sitquaytoken type: string label: SITquaytoken secret: true - id: uatquaytoken type: string label: UATquaytoken secret: true extra_vars: sit_quay_token: '{{ sitquaytoken }}' uat_quay_token: '{{ uatquaytoken }}'
Why Two Tokens In One Credential?
SIT and UAT Quay tokens are always used together by check_quay — you cannot diff one without the other. Packaging them as a single credential type makes that coupling explicit, prevents "ran with one but not the other" mistakes, and gives rotation a single touch point.
Driver Playbook Layout
| Playbook | Purpose |
|---|---|
get_ocp_info.yml | Collect state from a target environment (SIT / UAT / prod) |
check_ocp_info.yml | Diff the collected OCP-core state across environments |
check_quay.yml | Compare Quay image catalogs |
check_system.yml | Cluster-level parity checks |
replica_set.yml | DeploymentConfig replica-count comparison |
Output → Gitlab Via git_acp
- name: Commit comparison report to Gitlab git_acp: path: "{{ report_dir }}" comment: "OCP comparison {{ ansible_date_time.iso8601 }} · {{ envs | join(' vs ') }}" url: "{{ gitlab_repo_url }}" branch: main
The Audit Log Is `git log`
Every Tower run produces a commit. "What changed between the Friday run and Monday's run" is answered by one git diff. For a drift-detection system, Git is the right audit model — diffable, permanent, human-readable, and reusable as a release-gate artifact.