Back to Projects

OCP Cluster Comparison Automation

An Ansible Tower-based configuration drift detection system for a major Taiwanese bank's multi-environment OpenShift (OCP) estate. 30+ Ansible Roles pull resource manifests across SIT, UAT, and production clusters, compare them side-by-side, and commit the resulting diff reports to Gitlab — replacing manual release-window audits with a scheduled, repeatable workflow.

Role: Ansible / OCP Consultant
Period: 2021

Overview

A multi-environment configuration drift detection platform for a major Taiwanese bank's OpenShift estate. Before this system, release engineers manually compared resource manifests across SIT, UAT, and production clusters during every release window — eyeballing YAML diffs across oc get outputs, copying into spreadsheets, and filing tickets for mismatches. The process was error-prone and blocked releases on a single person's availability.

The system replaces that workflow with Ansible Tower schedules that run a library of 30+ purpose-built Ansible Roles against every OCP environment, collect the relevant Kubernetes resource kinds (namespaces, deployments, services, RBAC, secrets, network policies, and more), diff them pairwise, and auto-commit the resulting comparison report to a Gitlab repository. The next release-gate review is reading one commit instead of running oc commands by hand.

Scope: OCP core resources, Quay image catalog, Kafka topic config, Elasticsearch cluster state, and DB connectivity. Any environment difference that mattered operationally was covered by a dedicated check_* role.

My Role

On-site consultant embedded with the client's architecture team:

  • Designed the split between get_* roles (fetch state) and check_* roles (compare state) so the pull and the diff could evolve independently
  • Authored the full role library — 30+ roles covering the OCP core kinds plus platform-specific checks (Quay, Kafka, Elasticsearch, DB)
  • Designed custom Ansible Tower credential types (quay_image_compare_user, ocp_robot_account) with field-level secret flags so per-environment tokens could be injected into runs without hardcoding
  • Built the Ansible Tower project / template / schedule wiring and the survey-driven "which environments, which kinds" parameterization
  • Integrated git_acp so every run's output is auto-committed to a versioned Gitlab repo — release engineers read diffs as Gitlab commits
  • Iterated on scope and delivery artifacts across multiple review rounds as the client expanded requirements

Tech Stack

Automation

Ansible Tower (survey · credential types · scheduled templates)30+ Ansible Roles (get_* · check_* · utility)kubernetes.core modules (k8s_info · k8s_definition)git_acp for auto-commit to Gitlab

Platforms

OpenShift Container Platform (SIT · UAT · prod)Red Hat Quay (container registry)KafkaElasticsearch

Workflow

Scheduled template execution (daily / release-aligned)Robot account credentials per environmentPer-relationship custom credential types in Tower

Architecture

The system is a library of small, composable Ansible Roles orchestrated by Ansible Tower, with Gitlab as the audit trail:

Tower Layer: Custom credential types hold per-environment OCP robot accounts and Quay tokens. Job Templates are parameterized by survey (which environments to compare, which kinds to check). Schedules run the full comparison at a chosen cadence — typically daily plus on-demand before each release.

Role Library (the core):

  • get_* roles pull state from OCP using k8s_info: get_project, get_dc, get_service, get_secret, get_serviceaccount, get_rolebinding, get_networkpolicy, get_pod, get_user, get_bc
  • check_* roles compare collected state across environments and emit a structured diff: check_project, check_dc, check_service, check_secret, check_serviceaccount, check_rolebinding, check_networkpolicy, check_pod, check_port, check_user, check_system, check_element
  • Platform-specific checks: check_quay (image catalog parity), check_kafka (topic / partition config), check_el (Elasticsearch index and template), an OCP-CHECK-DB role (connectivity and schema)
  • Utility roles: ocp-login, k8s_definition, k8s_info, set_fact, change_login_url, git_acp

Driver Playbooks wire the library for Tower to call:

  • get_ocp_info.yml — collect state from one environment
  • check_ocp_info.yml — compare collected state across environments
  • check_quay.yml, check_system.yml, replica_set.yml — platform-specific top-level flows

Persistence Layer: Every comparison report is auto-committed to a dedicated Gitlab repository via git_acp. Gitlab becomes the audit trail — release engineers read the commit history ("what changed between today's audit and yesterday's") instead of running oc commands.

Key Challenges

1. The Comparison Scope Is Broad and Heterogeneous

OCP has a long list of resource kinds that matter for configuration parity (Projects, DCs, Services, Secrets, RoleBindings, NetworkPolicies, Pods, Users, BuildConfigs, ServiceAccounts). Each kind has its own comparison semantics — a Secret diff cares about keys but not values, a Pod diff cares about spec but not status, a RoleBinding diff cares about subjects and role refs. One monolithic "diff everything" approach would be unreadable and incorrect.

2. Credential Management Across Environments

SIT / UAT / production each have their own OCP robot account and their own Quay token. Baking these into Playbooks is a security disaster; passing them through survey extra vars is operationally brittle. Tower's credential system is the right tool, but it requires designing credential types that know which fields are secrets.

3. Platform-Specific Comparison Logic

The OCP core is the easy part — k8s_info returns it uniformly. Quay image catalogs, Kafka topic configs, and Elasticsearch cluster health all have their own APIs, their own auth models, and their own diff semantics. They cannot be jammed into the same k8s_info-based approach.

4. Comparison Output Has to be Consumable by Humans

A raw YAML diff between two clusters is unreadable during a release window. The output must surface "here is what actually differs, in the categories that matter" — not a wall of whitespace and key-reorder noise.

5. Iteration Speed

Four rounds of delivery documents over four months — scope grew (new resource kinds, new platforms), diff semantics got refined based on real release experience, role interfaces stabilized. Any architecture that made it expensive to add a new check_* would have broken down.

Solutions & Design Decisions

Split get_* from check_*

The single most impactful design decision. Each get_* role owns one resource kind's collection logic and returns a normalized fact. Each check_* role consumes those facts and emits diffs in its own format. Adding a new kind is two small roles, not a modification to one giant role. Adding a new diff dimension is a new check_*, not a risky edit to the collector.

Custom Tower Credential Types

Defined two credential types in Tower with field-level secret: true flags:

  • quay_image_compare_user — SIT and UAT Quay tokens, both marked secret, exposed to Playbooks as sit_quay_token and uat_quay_token extra vars
  • ocp_robot_account — username + password pair, password marked secret, exposed as ocp_username and ocp_password

Playbooks never see the raw credential. Tower injects them at run time. Credential rotation is a one-place change.

Platform-Specific Roles With The Same Interface Shape

check_quay, check_kafka, check_el, and the OCP-CHECK-DB role each speak to their own platform API internally but emit diffs in the same structural format as the OCP check_* family. The orchestrating playbook doesn't care that Quay is not Kubernetes — the output looks the same.

Diff Reports Structured for Human Review

Comparison output is grouped by resource kind, then by object name, then by changed field — with same-named objects that match entirely collapsed into a single line. A release engineer looking at "what changed" reads a three-level tree, not a 10,000-line unified diff.

Gitlab Auto-Commit as Audit History

After the comparison runs, git_acp adds the report, commits with a timestamped message, and pushes to a dedicated Gitlab repo. The commit log becomes the audit trail — "what was the delta on 2021-06-18 at the pre-release window" answers itself with one git log.

Results & Impact

Release Process Impact

  • Manual multi-hour audit eliminated from every release window
  • Comparison now runs on a schedule plus on-demand; release engineers read the Gitlab diff instead of running oc commands
  • Scope coverage exceeded what a human could realistically check by hand (every kind, every namespace, every environment)

Platform Coverage

  • 30+ Ansible Roles spanning OCP core kinds, Quay, Kafka, Elasticsearch, DB
  • Roles are orthogonal — get_* vs check_* separation lets the library grow without rewrites

Operational Maturity

  • Tower credential types keep secrets out of Playbooks and out of Git
  • Scheduled templates mean continuous drift visibility, not just release-window spot checks
  • Gitlab auto-commit gives every run a permanent, diff-able audit trail

Learnings

Collector / Checker Split Is The Real Architecture

I initially drafted this with one role per "compare this kind" that did both fetch and diff. That made adding a new environment require editing every role. Splitting get_* from check_* turned the system into a library — each role does one small thing and composition happens at the Playbook level. Everything else downstream followed from this one decision.

Tower Credential Types Are Underused

Most Ansible / Tower implementations I see treat credentials as "just extra vars with a secret flag." Defining a proper credential type — with named fields, required flags, and explicit extra-var mapping — makes the credential layer part of the platform contract, not a per-playbook convention. Rotation, auditing, and RBAC all become trivial.

Git was the right audit model for this workflow

Writing to a database is flexible but hard to review. Writing to files on disk is reviewable but ephemeral. Writing to a Git repo combines both — every run is a commit, the diff between runs is git diff, and the history is permanent. For a workflow whose output is "what changed," Git is the natural storage.

Delivery Iterations Revealed the Real Scope

The scope in the March doc and the scope in the July doc were different shapes. Platform-specific additions (Quay, Kafka, Elasticsearch) came from client feedback during rollout — "the OCP parity is great but we also need to know the Quay catalogs are synced." Locking scope early would have shipped the wrong tool.

Deep Dive

Role Library Shape

Orthogonal By Design

The get_* / check_* split is the system's defining shape. A get_* role fetches one resource kind from one cluster and returns a normalized fact. A check_* role consumes facts from multiple clusters and emits a diff. Adding a new environment does not touch any existing role; adding a new kind is two new small roles.

Resource Collection — get_* Roles

RoleKubernetes kind
get_projectProject / Namespace
get_dcDeploymentConfig
get_serviceService
get_secretSecret
get_serviceaccountServiceAccount
get_rolebindingRoleBinding / ClusterRoleBinding
get_networkpolicyNetworkPolicy
get_podPod
get_bcBuildConfig
get_userUser / Group

Comparison — check_* Roles

RoleWhat it compares
check_projectNamespace presence · labels · quotas
check_dcDeployment spec · replicas · resources · probes
check_servicePort spec · selector · service type
check_portExposed ports at cluster edge
check_secretKey set (not values) · annotation drift
check_serviceaccountName · secret refs · image pull secrets
check_rolebindingSubjects · roleRef
check_networkpolicyIngress / egress rules
check_podRunning image versions across environments
check_userUser / group membership
check_systemCluster-level parity (nodes · operators)

Platform-Specific Roles

RolePlatformComparison scope
check_quayRed Hat QuayImage repository parity · tag sets
check_kafkaKafkaTopic config · partition / replication factor
check_elElasticsearchIndices · templates · cluster health
check_elementGenericCatch-all for ad-hoc element diffs
OCP-CHECK-DBPostgreSQL / OracleConnectivity · schema parity

Tower Credential Types

Custom credential type: ocp_robot_account
# Ansible Tower credential type definition
fields:
  - id: ocpusername
    type: string
    label: OCPUsername
  - id: ocppassword
    type: string
    label: OCPPassword
    secret: true
required:
  - ocpusername
  - ocppassword
extra_vars:
  ocp_username: '{{ ocpusername }}'
  ocp_password: '{{ ocppassword }}'
Custom credential type: quay_image_compare_user
fields:
  - id: sitquaytoken
    type: string
    label: SITquaytoken
    secret: true
  - id: uatquaytoken
    type: string
    label: UATquaytoken
    secret: true
extra_vars:
  sit_quay_token: '{{ sitquaytoken }}'
  uat_quay_token: '{{ uatquaytoken }}'

Why Two Tokens In One Credential?

SIT and UAT Quay tokens are always used together by check_quay — you cannot diff one without the other. Packaging them as a single credential type makes that coupling explicit, prevents "ran with one but not the other" mistakes, and gives rotation a single touch point.

Driver Playbook Layout

PlaybookPurpose
get_ocp_info.ymlCollect state from a target environment (SIT / UAT / prod)
check_ocp_info.ymlDiff the collected OCP-core state across environments
check_quay.ymlCompare Quay image catalogs
check_system.ymlCluster-level parity checks
replica_set.ymlDeploymentConfig replica-count comparison

Output → Gitlab Via git_acp

Per-run commit to the audit repo
- name: Commit comparison report to Gitlab
  git_acp:
    path: "{{ report_dir }}"
    comment: "OCP comparison {{ ansible_date_time.iso8601 }} · {{ envs | join(' vs ') }}"
    url: "{{ gitlab_repo_url }}"
    branch: main

The Audit Log Is `git log`

Every Tower run produces a commit. "What changed between the Friday run and Monday's run" is answered by one git diff. For a drift-detection system, Git is the right audit model — diffable, permanent, human-readable, and reusable as a release-gate artifact.