OpenShift SRE ❤️ GitOps + Policy as Code

24 December 2023

Tags : acm, gitops, policy, openshift

Within the enterprise - deploying and managing a fleet of OpenShift clusters can be a challenge. There are multiple ways and means to achieve your goals. I will lay out my favourite patterns and methods and a few tips and tricks I commonly use. In particular the methodology around using GitOps and Policy as Code.

GitOps, Everything as Code and Kubernetes Native

Everything as Code is the practice of treating all parts of the systems as code. This means storing the configuration in a Source Code repository such as git. By storing the configuration as code, environments can be life-cycled and recreated whenever they are needed. So why go to this effort ?

(1) Traceability - storing your config in git implies controls are in place to track who/why a config changes has been made. Changes can be applied and reverted. Changes can be tracked to a single user who made the change.

(2) Repeatable - moving from one cloud provider to another should be simple in modern application development. Picking a deployment target can be like shopping around for the best price that week. By storing all things as code, systems can be re-created quickly in various providers.

(3) Tested - infrastructure and code can be rolled out, validated, promoted into production environments with confidence and assurance it behaves as expected.

(4) Phoenix Server - no more fear of a servers' configuration drifiting. If a server needs to be patched or just dies, it’s OK. We can recreate it again from the stored configuration.

(5) Shared Understanding - when cross-functional teams use Everything as Code to desribe parts of their Product they are developing together, they increase the shared understanding between Developers and Operations, they speak the same language and use the same frameworks.

So How do we do it - GitOps ?

GitOps is a pattern to manage flow of work from development to production though Git Operations. The concept behind GitOps is quite straightforward.

  • Everything as Code: Git is always the source of truth on what happens in the system

  • Deployments, tests, rollbacks are always controlled through a Git flow

  • No manual deployments/changes: If you need to make a change, you need to make a Git operation such as commit + push, or raise a pull request.

The most popular GitOps tools in use today are ArgoCD and Flux. We use ArgoCD as the GitOps controller in OpenShift. This is supported as the "RedHat OpenShift GitOps Operator". We can align how our teams use and setup GitOps and their tooling - we are following patterns written about here.

When using OpenShift, we have a strong desire to stick to Kubernetes native methods of configuring the cluster, the middleware that runs upon it, as well as the applications - all using k8s native methods. I won’t cover deploying resources outside a cluster all that much - this usually needs other tools to help provision them. Some can be configured using the Operator Pattern, some may need tools like Crossplane to provision against cloud API’s. For now, we will assume that we have a hybrid or public cloud that provides storage, compute and networking services - all made available to us.

Code Structure

Take some time to organise your code. When you scale out your configuration to multiple environments/clusters/clouds - you need to be able to scale out individual bits of your repository, especially using folders. We use Kustomize heavily - and its use of bases and overlays encourages folders as the main mechanism for growth. Helm templating is also in use - because we need the flexibility to template applications even-though there is always a level of fungibility with templating languages.

Our main goal is to keep the code maintainable and discoverable. We need new developers to be able to easily on-board to using the code repo, as we would like to keep the burden of making changes very low. There is a continual tension between having one version of a piece of code that is shared across all your environments, (making it easy to maintain) with the trade-off that the blast radius can be large if an erroneous change is made that causes failures. As our codebase matures - we can code and transition around this tension. For example, we may use copy-and-paste reuse heavily at the start of our efforts to keep the blast radius low (to a single cluster) and gradually migrate the code to a single shared artifact as we become happy with its performance over time.

I like to keep my configuration repo as a git monorepo initially. Code is stored in one simple hierarchy.

gitops-monorepo
├── applications      | Infrastructure and application configurations
├── app-of-apps       | Top level pattern to define environments/clusters
├── bootstrap-acm     | Bootstrap our HUB cluster
├── policy-collection | All day#2 config is stored as configuration policy
├── README.md         | Always provide some Help !

The top level is kept quite simple. Applications that may be deployed to different clusters are stored in the application’s folder. We use the ArgoCD app-of-apps pattern to describe Applications that are deployed to each HUB cluster. We have a bootstrap folder for our HUB cluster (which is not GitOps) - we deploy ACM and ArgoCD from here. We could make this GitOps as well - however there are often manual steps required to get the environment ready for use e.g. creating an external Vault integration, creating cloud credentials etc. All other use case e.g. spoke cluster creation, day#2 configuration, application deployments - are done via GitOps.

A common folder structure for a Kustomize based application is shown below for the infrastructure risk compliance application. Here we configure the OpenShift Compliance Operator for all our clusters. We use the base folder for common deployment artifacts to all clusters - including the operator, operator group, scan settings and tailored profiles. We can the use the overlay folder to specify environment (develop | nonprod) and cluster (east | west) specific configuration. In this case we are using the PolicyGenerator in each application definition which is configured to pull secrets from different locations in vault.

applications/compliance/
├── input
│   ├── base
│   │   ├── kustomization.yaml
│   │   ├── namespace.yaml
│   │   ├── operatorgroup.yaml
│   │   ├── scan-setting-binding.yaml
│   │   ├── subscription.yaml
│   │   └── tailored-profile.yaml
│   └── overlay
│       ├── develop
│       │   ├── east
│       │   │   ├── input
│       │   │   │   └── kustomization.yaml
│       │   │   ├── kustomization.yaml
│       │   │   └── policy-generator-config.yaml
│       │   └── west
│       │       ├── input
│       │       │   └── kustomization.yaml
│       │       ├── kustomization.yaml
│       │       └── policy-generator-config.yaml
│       └── nonprod
│           ├── east
│           │   ├── input
│           │   │   └── kustomization.yaml
│           │   ├── kustomization.yaml
│           │   └── policy-generator-config.yaml
│           └── west
│               ├── input
│               │   └── kustomization.yaml
│               ├── kustomization.yaml
│               └── policy-generator-config.yaml

The one time I break this pattern - is when considering the Production environment. Often in highly regulated industries, production must be treated separately and often has stricter change control requirements surrounding it. This may include different git flows. For small, high trust teams, trunk based development is one of the best methods to keep the flow of changes coming! Often the closer you get to production though, a change in git flow techniques is required. So for Production, pull requests only.

NonProduction repo - no PR's required, trunk based development in place.
Production repo - PR's required.

The trade-off is of course you now have two repositories, often with duplicate code, and must merge from one to the other often. You also have to handle emergency fixes etc. In practice, this is manageable as long as you follow a Software Delivery Lifecycle where changes are made in lower environments first. Your quality and change failure frequency will be better off by doing this. A common pattern is to split out separate applications into separate git repos - and include them as remote repos once they become mature and stable enough.

Hub and Spoke

Open Cluster Management is an opensource community that supports managing Kubernetes clusters at scale. Red Hat priductises this as "Advanced Cluster Management (ACM)". One of the key concepts is the support of Configuration Policy and Placement on clusters using a hub and spoke design.

The biggest benefit of deploying a HUB cluster with Spokes (managed clusters) is that scale can be achieved through the decoupling of policy based computation and decisions - which happen on the HUB cluster- and then execution - which happens on the target cluster. So execution is completely off-loaded onto the managed cluster itself. Spoke managed clusters do the work and pull configuration from the HUB independently. This means a HUB does not become a single point of failure during steady state operations and Spoke clusters can number in the hundreds or thousands achieving scale.

By introducing ArgoCD onto the HUB cluster - we can use it to deploy any application or configuration. The primary method is to package all the code as Configuration Policy. By doing this, we have fantastic visibility into each cluster, we control configration with Git and drift is kept to zero using GitOps - we like to say "if it’s not in git, it’s not real !"

Another benefit of using ArgoCD is to hydrate secrets from external vault providers like Hashicorp Vault (many others are supported). That way, any and all configuration (not just Kubernetes Secrets that can be mounted in pods) can be hydrated with values from our secrets vault provider, thus keeping secret values outside of Git itself.

There are more complex ArgoCD/ACM models available e.g. the multi-cluster pull, push models. However, the benefit here is one of simplicity - we have less moving parts to manage, so it is more anti-fragile. For each environment (develop | nonprod | prod) we deploy separate HUB clusters. That way we can test and promote configuration from the lower environments first (develop | nonprod) before getting to production.

Policy as Code

Policies are one key way for organisations to ensure software is high quality, easy to use and secure. Policy as code automates the decision-making process to codify and enforce policies in our environment. There are generally two types of policies:

  • Configuration Policy

  • Constraint Policy

ACM supports both types of policy. Because OpenShift is architected securely out of the box - there are many day#2 configurations that can be used to manage the platform in the manner required within your organisation.

Managing Operator configurations is one key way, as is applying MachineConfiguration to your cluster or introducing third party configurations. You can get a long way to configuring a secure, spec-compliant cluster without needing to use any Constraint Policy at all. The OpenSource leader in constraint policy is undoubtedly Open Policy Agent (OPA) which uses the rego language to encode constraint policy. There are many other choices that do not require the adoption of a specific language, but rather are pure yaml - Kyverno has wide adoption.

There is an open source repository that hosts example policies for Open Cluster Management.

This is a huge benefit as it provides a way to share policies from the community and vendors, as well as removing the burden of haing to write many custom policies yourself. Policies are organised under the NIST Special Publication 800-53 specification definitions.

If you follow this naming and grouping convention in your Policy annotations - then you can use the Governance Dashboard in ACM to graphically show you this structure as well.

In the above picture we have five OpenShift clusters in our environment using the NIST 800-53 convention for configuration policy. It becomes is easy to overview an environment to check on configuration drift. SRE’s can easily determine that their environment configuration is healthy. They can drill down into individual clusters, or areas of configuration across their entire fleet.

Configuration Drift nearly becomes a thing of the past ! as GitOps and ACM ensure configuration policy is applied to all clusters and environments - so troubleshooting configuration management can generally be performed by exception only saving a lot of time and effort.

Even with hundreds of policies applied across multiple clusters, the NIST grouping and policy search allows an SRE to easily find individual policies. So if we wanted to check an Access Control policy - we can see if it is applied in multiple dimensions both across clusters and down to individual cluster level.

Writing policy boilerplate can be very time-consuming. I make heavy use of the awesome PolicyGenerator tool that allows you to specify YAML config using Kustomize (or if you compile this PR you can use Helm via Kustomize as well!) and have the policy generated for you. You can see a number of PolicySets that use the PolicyGenerator that can be used straight away in your code base.

App of Apps

In our mono repo, I like to use the ArgoCD App Of Apps pattern to declaratively specify all the applications that exist in each HUB cluster. You can then drop ArgoCD Application YAML definition files into the folder to easily deploy any number of applications.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: develop-app-of-apps
  namespace: open-cluster-management-global-set
  labels:
    rht-gitops.com/open-cluster-management-global-set: policies
spec:
  destination:
    namespace: open-cluster-management-global-set
    server: 'https://kubernetes.default.svc'
  project: default
  source:
    path: app-of-apps/develop/my-dev-hub-cluster-01
    directory:
      include: "*.yaml"
    repoURL: https://git/gitops-monorepo.git
    targetRevision: main
  syncPolicy:
    automated:
      selfHeal: true
    syncOptions:
    - Validate=true

One thing to note is the careful use of syncPolicy options. I explicitly do not want to set prune: true for example, so leaving deleting turned off. You will want to tune deletion behaviour using Policy, in particular the PolicyGenerator setting called pruneObjectBehavior which can take various values such as None|DeleteAll. It is also worth setting policyAnnotations: {"argocd.argoproj.io/compare-options": "IgnoreExtraneous"} in the PolicyGenerator so that ArgoCD shows the correct sync status.

ArgoCD Vault Plugin

Managing secrets is an important concern from day zero. The two main methods in popluar use today take different approaches. The first has encrypted secrets in the codebase. The second - my preferred, is to keep secret values out of our code base altogether by using a secrets vault. There are many ways to achieve this depending on the type of vault in use and the integration points needed at scale. For the GitOps model I drew out earlier, we can make use of the ArgoCD Vault Plugin and the sidecar pattern to hydrating secrets values in all of our configuration. This has the benefit of being able to hydrate secrets values into Policy code directly as well as creating secrets for pods to mount.

My sidecar configMap for ArgoCD contains the three methods I use to call the AVP plugin using helm, Kustomize or via straight YAML. Note that Kustomize has the helm plugin enabled using these flags --enable-alpha-plugins --enable-helm build:

  helm-plugin.yaml: |
    apiVersion: argoproj.io/v1alpha1
    kind: ConfigManagementPlugin
    metadata:
      name: argocd-vault-plugin-helm
    spec:
      init:
        command: [sh, -c]
        args: ["helm dependency build"]
      generate:
        command: ["bash", "-c"]
        args: ['helm template "$ARGOCD_APP_NAME" -n "$ARGOCD_APP_NAMESPACE" -f <(echo "$ARGOCD_ENV_HELM_VALUES") . | argocd-vault-plugin generate -s open-cluster-management-global-set:team-avp-credentials -']
  kustomize-plugin.yaml: |
    apiVersion: argoproj.io/v1alpha1
    kind: ConfigManagementPlugin
    metadata:
      name: argocd-vault-plugin-kustomize
    spec:
      generate:
        command: ["sh", "-c"]
        args: ["kustomize --enable-alpha-plugins --enable-helm build . | argocd-vault-plugin -s open-cluster-management-global-set:team-avp-credentials generate -"]
  vault-plugin.yaml: |
    apiVersion: argoproj.io/v1alpha1
    kind: ConfigManagementPlugin
    metadata:
      name: argocd-vault-plugin
    spec:
      generate:
        command: ["sh", "-c"]
        args: ["argocd-vault-plugin -s open-cluster-management-global-set:team-avp-credentials generate ./"]

And from our ArgoCD ApplicationSet or Application all you need to do is specify the plugin name:

        plugin:
          name: argocd-vault-plugin-kustomize

You can read more about it here.

Hope you Enjoy! 🔫🔫🔫

Commentaires

OpenShift Install, Semi-Connected Registries and Mirror by Digest Images

12 April 2023

Tags : openshift, gitops, registries, disconnected

I have been working with disconnected OpenShift clusters quite a lot recently. One of the things you need to deal with is disconnected registries and mirror by digest images.

Quay Transparent Proxy-Pull Through Cache

There are a couple general approaches to configuring registries when disconnected. The product documentation has great depth of detail about using a Quay Mirror Registry. This is the right approach when wanting disconnected. The downside when you are testing things out in a lab is the mirror import process is both time-consuming and uses a lot of disk space.

One approach i have become fond of is a what i call a semi-connected method, where your clusters' use a Quay Transparent Proxy-Pull Through Cache to speed things up. This still uses disk space, but you don’t need to import all the images before installing a cluster.

After you install the quay mirror registry on the provisioning host, set this in your config.yaml and restart the quay pods or service:

FEATURE_PROXY_CACHE: true

This setup mimics what you would need to do when disconnected i.e. we always pull from the mirror registry when installing - but it is quicker to test as the mirror registry is connected. When configuring the OpenShift install method, the pull secret i use is just to the mirror. More on that below.

If you also set the cache timeout for your Organisations to be months or even years! then your images will hang around for a long time.

For installing OpenShift, you really need (at a minimum) two mirror organisations. I set up these two (admin is a default):

Where each Organisation points to these registries:

registry-redhat-io -> registry.redhat.io
ocp4-mirror -> quay.io/openshift-release-dev

One nice trick is that you can base64 decode your Red Hat pull-secret (you download this from cloud.redhat.com) and use those credentials in the Organisation mirror registry setup for authentication.

OCP Install Configuration

Now comes for the tricky part - configuring your OpenShift installer setup. There are a several ways to do this. The one you use depends on your install method and how you wish to control the registries.conf that gets configured for you cluster nodes.

I have been working with the Agent-based installer method for Bare Metal (i fake it on libvirt with sushy) - you can check out all the code here.

The issue i think everyone quickly discovers is that the OpenShift installer sets all mirror’s by digest to be true i.e. mirror-by-digest-only = true. If you check the installer code its here:

Setting mirror by digest to true is intentional, it helps stop image spoofing or getting an image from a moving tag.

Unfortunately not all Operators pull by digest either. In fact the deployments that are part of the openshift-marketplace do not. So after a cluster install we see Image Pull errors like this:

$ oc get pods -n openshift-marketplace
NAME                                   READY   STATUS             RESTARTS      AGE
certified-operators-d2nd9              0/1     ImagePullBackOff   0             15h
certified-operators-pqrlz              0/1     ImagePullBackOff   0             15h
community-operators-7kpbm              0/1     ImagePullBackOff   0             15h
community-operators-k662l              0/1     ImagePullBackOff   0             15h
marketplace-operator-84457bfc9-v22db   1/1     Running            4 (15h ago)   16h
redhat-marketplace-kjrt9               0/1     ImagePullBackOff   0             15h
redhat-marketplace-sqch2               0/1     ImagePullBackOff   0             15h
redhat-operators-4m4gt                 0/1     ImagePullBackOff   0             15h
redhat-operators-62z6x                 0/1     ImagePullBackOff   0             15h

And checking one of the pods we see it is trying to pull by tag:

$ oc describe pod certified-operators-d2nd9
Normal  BackOff  2m2s (x4179 over 15h)  kubelet  Back-off pulling image "registry.redhat.io/redhat/certified-operator-index:v4.12"

Unfortunately you cannot configure ImageContentSourcePolicy for mirror-by-digest-only = false so (currently) the only solution is to apply MachineConfig post your install as a day#2 thing as documented in this Knowledge Base Article

Hopefully in an upcoming OpenShift relaease (4.13 or 4.14) we will be able to use the new API’s for CRDs ImageDigestMirrorSet ImageTagMirrorSet - see Allow mirroring images by tags RFE for more details on these changes.

For now though, i use butane and MachineConfig as per the KB article at post install time to configure mirror-by-digest-only = false for my mirror registries that need it. From my git repo:

butane 99-master-mirror-by-digest-registries.bu -o 99-master-mirror-by-digest-registries.yaml
oc apply -f 99-master-mirror-by-digest-registries.yaml

This will reboot your nodes to apply the MCP, you may add or change the butane template(s) and yaml to suit the nodes you need to target e.g. masters or workers (or any other) node role. In my case it’s targeting a SNO cluster so master is fine.

All going well your marketplace pods should now pull images and run OK

$ oc get pods -n openshift-marketplace
NAME                                   READY   STATUS    RESTARTS   AGE
certified-operators-d2nd9              1/1     Running   0          16h
community-operators-k662l              1/1     Running   0          16h
marketplace-operator-84457bfc9-v22db   1/1     Running   5          16h
redhat-marketplace-kjrt9               1/1     Running   0          16h
redhat-operators-62z6x                 1/1     Running   0          16h

A word of warning when using the Assited Installer / Agent Installer method. If you try to set mirror-by-digest-only = false registries in your AgentServiceConfig using the provided ConfigMap e.g. something like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: quay-mirror-config
  namespace: multicluster-engine
  labels:
    app: assisted-service
data:
  LOG_LEVEL: "debug"
  ca-bundle.crt: |
    -----BEGIN CERTIFICATE-----
    ! Put you CA for your mirror registry here !
    -----END CERTIFICATE-----

  registries.conf: |
    unqualified-search-registries = ["registry.redhat.io", "registry.access.redhat.com", "docker.io"]

    [[registry]]
      prefix = ""
      location = "registry.redhat.io/redhat"
      mirror-by-digest-only = false
      [[registry.mirror]]
        location = "quay.eformat.me:8443/registry-redhat-io/redhat"

The registry mirror setting will get reset to mirror-by-digest-only = true by the installer.

Similarly, if you try and set MachineConfig in the ignitionConfigOverride in the InfraEnv e.g.

apiVersion: agent-install.openshift.io/v1beta1
kind: InfraEnv
...
  # User for modify ignition during discovery
  ignitionConfigOverride: '{"ignition": {"version": "3.1.0"}, "storage": {"files": [{"path": "/etc/containers/registries.conf", "mode": 420, "overwrite": true, "user": { "name": "root"},"contents": {"source": "data:text/plain;base64,dW5xd..."}}]}}'

it also gets overriden by the installer. I tried both these methods and failed 😭😭

Summary

For now, the only way to configure mirror-by-digest-only = false is via MachineConfig post-install.

You can always try and only mirror images by digest, just remember that various operators and components may not be configured this work this way.

The future looks bright with the new API’s, as this has been a long-standing issue now.

🏅Good luck installing out there !!

Commentaires

ACM & ArgoCD for Teams

17 February 2023

Tags : openshift, argocd, acm, gitops

Quickly deploying ArgoCD ApplicationSets using RHACM’s Global ClusterSet

I have written about how we can align our Tech to setup GitOps tooling so that it fits with our team structure.

How can we make these patterns real using tools like Advanced Cluster Manager (ACM) that help us deploy to a fleet of Clusters ? ACM supports Policy based deployments so we can track compliance of our clusters to the expected configuration management policy.

The source code is here - https://github.com/eformat/acm-gitops - git clone it so you can follow along.

Global ClusterSet’s

When a cluster is managed in ACM there are several resources created out of the box you can read about them here in the documentation. This includes a namespace called open-cluster-management-global-set. We can quickly deploy ApplicationSet’s in this global-namespace that generates Policy to create our team based ArgoCD instances.

We can leverage the fact that ApplicationSet’s can be associated with a Placement - that way we can easily control where our Policy and Team ArgoCD’s are deployed across our fleet of OpenShift clusters by using simple label selectors for example.

Bootstrap a Cluster Scoped ArgoCD for our Policies

We are going Bootstrap a cluster-scoped ArgoCD instance into the open-cluster-management-global-set namespace.

We will deploy our Team ArgoCD’s using ACM Policy that is generated using the PolicyGenerator tool which you can read about here from its' reference file.

Make sure to label the cluster’s where you want to deploy to with useglobal=true.

oc apply -f bootstrap-acm-global-gitops/setup.yaml

This deploys the following resources:

  • Subscription Resource - The GitOps operator Subscription, including disabling the default ArgoCD and setting cluster-scoped connections for our namespaces - see the ARGOCD_CLUSTER_CONFIG_NAMESPACES env.var that is part of the Subscription object. If your namespace is not added here, you will get namespace scoped connections for your ArgoCD, rather than all namespaces.

  • GitOpsCluster Resource - This resource provides a Connection between ArgoCD-Server and the Placement (where to deploy exactly the Application).

  • Placement Resource - We use a Placement resource for this global ArgoCD which deploys to a fleet of Clusters, where the Clusters needs to be labeled with useglobal=true.

  • ArgoCD Resource - The CR for our global ArgoCD where we will deploy Policy. We configure ArgoCD to download the PolicyGenerator binary, and configure kustomize to run with the setting:

kustomizeBuildOptions: --enable-alpha-plugins

Deploy the Team Based ArgoCD using Generated Policy

We are going to deploy ArgoCD for two teams now using the ACM PolicyGenerator.

The PolicyGenerator runs using kustomize. We specify the generator-input/ folder - that holds our YAML manifests for each ArgoCD - in this case one for fteam, one for zteam.

You can run the PolicyGenerator from the CLI to test it out before deploying - download it using the instructions here e.g.

kustomize build --enable-alpha-plugins team-gitops-policy/

We specify the placement rule placement-team-argo - where the Clusters needs to be labeled with teamargo=true.

We add some default compliance and control labels for grouping purposes in ACM Governance.

We also set the pruneObjectBehavior: "DeleteAll so that if we delete the ApplicationSet the generated Policy s deleted and all objects are removed. For this to work, we must also set the remediationAction to enforce for our Policies.

One last configuration is to set the ArgoCD IgnoreExtraneous compare option - as Policy is generated we do not want ArgoCD to be out of sync for these objects.

apiVersion: policy.open-cluster-management.io/v1
kind: PolicyGenerator
metadata:
  name: argocd-teams
placementBindingDefaults:
  name: argocd-teams
policyDefaults:
  placement:
    placementName: placement-team-argo
  categories:
    - CM Configuration Management
  complianceType: "musthave"
  controls:
    - CM-2 Baseline Configuration
  consolidateManifests: false
  disabled: false
  namespace: open-cluster-management-global-set
  pruneObjectBehavior: "DeleteAll"
  remediationAction: enforce
  severity: medium
  standards:
    - generic
  policyAnnotations: {"argocd.argoproj.io/compare-options": "IgnoreExtraneous"}
policies:
  - name: team-gitops
    manifests:
      - path: generator-input/

Make sure to label the cluster’s where you want to deploy to with teamargo=true.

To create our Team ArgoCD’s run:

oc apply -f applicationsets/team-argo-appset.yaml

To delete them, remove the AppSet

oc delete appset team-argo

Summary

You can now take this pattern and deploy it across multiple clusters that are managed by ACM. You can easily scale out the number of Team Based ArgoCD and have fine grained control over their individual configuration including third party plugins like Vault. ACM offers a single plane of glass to check if your clusters are compliant to the generated policies, and if not - take remedial action.

You can see the code in action in this video.

🏅Enjoy !!

Commentaires

Patterns with ArgoCD - Vault

04 November 2022

Tags : argocd, gitops, patterns, vault, security

Team Collaboration with ArgoCD

I have written before about collaborating using GitOps, ArgoCD and Red Hat GitOps Operator. How can we better align our deployments with our teams ?

The array of patterns and the helm chart are described in a fair bit of detail here. I want to talk about using one of these patterns at scale - hundred’s of apps across multiple clusters.

Platform Cluster ArgoCD, Tenant Team ArgoCD’s

For Product Teams working in large organisations that have a central Platform Team - this pattern is probably the most natural i think.

Figure - Platform ArgoCD, Namespaced ArgoCD per Team

It allows the platform team to control cluster and elevated privileges, activities like controlling namespaces, configuring cluster resources etc, in their Cluster Scoped ArgoCD, whilst Product Teams can control their namespaces independently of them.

  • The RedHat GitOps Operator (cluster scoped)

  • Platform Team (cluster scoped) ArgoCD instance

  • Team (namespace scoped) ArgoCD instances

When doing multi-cluster, i usually prefer to have ArgoCD "in the cluster" rather than remotely controlling a cluster. This seems better from an availability / single point of failure point of view. Of course if its a more edge use case, remote cluster connections may make sense.

OK, so the Tenant ArgoCD is deployed in namespaced mode and controls multiple namespaces belonging to a team. For each Team, a single ArgoCD instance per cluster normally suffices. You can scale up and shard the argo controllers, run in HA - not usually necessary at team scale (100 apps) - see the argocd doco if you need to do this though. There may be multiple non-production clusters - dev, test, qa etc and then you will have multiple production clusters (prod + dr etc) - each cluster have their own ArgoCD instances per Team.

All of this is controlled via gitops. A sensible code split is one git repo per team, so one for the Platform Team, one for each Product Team - i normally start with a mono repo and split later based on need or scale.

It is worth pointing out that any elevated cluster RBAC permissions that are needed by the Product Teams' are done via git PR’s into the platform team’s gitops repo. Once configured, the Tenant team is in control of their namespaces and can get on with managing their own products and tooling.

Secrets Management with ArgoCD Vault Plugin

To make this work at scale and in production within an organisation, the "batteries" for secrets management must be included! They are table stakes really. It’s fiddly, but worth the effort.

There are many ways to do secrets management beyond k8s secrets - KSOPS, External Secrets Operator etc. The method i want to talk about uses the ArgoCD Vault Plugin which i will abbreviate to AVP. It supports multiple secret backends by the way. In this case, i am going to use Hashicorp Vault and the k8s integration auth method. Setting up vault is dealt with separately but can be done on-cluster or off-cluster.

To get AVP working, you basically deploy the ArgoCD repo server with a ServiceAccount and use that secret to authenticate to Hashi Vault using k8s auth method. This way each ArgoCD instance uses the token associated with that service account to authenticate. Note that in OpenShift 4.11+ when creating new service accounts (SA), a service account token secret is no longer automatically generated.

Once done, our app secrets can be easily referenced from within source code using either annotations:

kind: Secret
apiVersion: v1
metadata:
  name: example-secret
  annotations:
    avp.kubernetes.io/path: "path/to/app-secret"
type: Opaque
data:
  password: <password-vault-key>

or directly via the full path:

  password: <path:kv/data/path/to/app-secret#password-vault-key>

We can also reference the secrets directly from our ArgoCD Application definitions. Here is an example of using helm (kustomize and plain yaml are also supported).

  source:
    repoURL: https://github.com/eformat/my-gitrepo.git
    path: gitops/my-app/chart
    targetRevision: main
    plugin:
      name: argocd-vault-plugin-helm
      env:
        - name: HELM_VALUES
          value: |
            image:
              repository: image-registry.openshift-image-registry.svc:5000/my-namespace/my-app
              tag: "1.2.3"
            password: <path:kv/data/path/to/app-secret#password-vault-key>

I also use a pattern to pass the vault annotation path down to the helm chart from the ArgoCD Application. To keep things clean (and you sane!) I normally have a Vault secret per-application (containing many KV2 - key:value pairs).

    plugin:
      name: argocd-vault-plugin-helm
      env:
        - name: HELM_VALUES
          value: |
            resources:
              limits:
                cpu: 500m         # a non secret value
            avp:
              secretPath: "kv/data/path/to/app-secret"  # use this in the annotations

This allows you to control the path to your secrets in Vault which can be configured by convention e.g. kv/data/cluster/namespace/app as an example.

ArgoCD Configuration - The Gory Details

OK, great. But how do i get there with my Team ArgoCD ? Let’s take a look in depth at the argocd-values.yaml file you might pass into the gitops-operator helm chart to bootstrap your ArgoCD.

The important bit for AVP integration is to mount the token from a service account that we have created - in this case the service account is called argocd-repo-vault and we set mountastoken to "true".

Next, we use an initContainer to download the AVP go binary and save it to a custom-tools directory. If you are doing this disconnected, the binary needs to be made available offline.

argocd_cr:
  statusBadgeEnabled: true
  repo:
    mountsatoken: true
    serviceaccount: argocd-repo-vault
    volumes:
    - name: custom-tools
      emptyDir: {}
    initContainers:
    - name: download-tools
      image: registry.access.redhat.com/ubi8/ubi-minimal:latest
      command: [sh, -c]
      env:
        - name: AVP_VERSION
          value: "1.11.0"
      args:
        - >-
          curl -Lo /tmp/argocd-vault-plugin https://github.com/argoproj-labs/argocd-vault-plugin/releases/download/v\${AVP_VERSION}/argocd-vault-plugin_\${AVP_VERSION}_linux_amd64 && chmod +x /tmp/argocd-vault-plugin && mv /tmp/argocd-vault-plugin /custom-tools/
      volumeMounts:
      - mountPath: /custom-tools
        name: custom-tools
    volumeMounts:
    - mountPath: /usr/local/bin/argocd-vault-plugin
      name: custom-tools
      subPath: argocd-vault-plugin

We need to create the glue between our ArgoCD Applications' and how they call/use the AVP binary. This is done using the configManagementPlugins stanza. Note we use three methods, one for plain YAML, one for helm charts, one for kustomize. The plugin name: is what we reference from our ArgoCD Application.

  configManagementPlugins: |
    - name: argocd-vault-plugin
      generate:
        command: ["sh", "-c"]
        args: ["argocd-vault-plugin -s team-ci-cd:team-avp-credentials generate ./"]
    - name: argocd-vault-plugin-helm
      init:
        command: [sh, -c]
        args: ["helm dependency build"]
      generate:
        command: ["bash", "-c"]
        args: ['helm template "$ARGOCD_APP_NAME" -n "$ARGOCD_APP_NAMESPACE" -f <(echo "$ARGOCD_ENV_HELM_VALUES") . | argocd-vault-plugin generate -s team-ci-cd:team-avp-credentials -']
    - name: argocd-vault-plugin-kustomize
      generate:
        command: ["sh", "-c"]
        args: ["kustomize build . | argocd-vault-plugin -s team-ci-cd:team-avp-credentials generate -"]

We make use of environment variables set within the AVP plugin for helm so that the namespace and helm values from the ArgoCD Application are set correctly. See the AVP documentation for full details of usage.

One thing to note, is the team-ci-cd:team-avp-credentials secret. This specifies how the AVP binary connects and authenticates to Hashi Vault. It is a secret that you need to set up. An example as follows for a simple hashi vault in-cluster deployment:

export AVP_TYPE=vault
export VAULT_ADDR=https://vault-active.hashicorp.svc:8200   # vault url
export AVP_AUTH_TYPE=k8s                              # kubernetes auth
export AVP_K8S_ROLE=argocd-repo-vault                 # vault role (service account name)
export VAULT_SKIP_VERIFY=true
export AVP_MOUNT_PATH=auth/$BASE_DOMAIN-$PROJECT_NAME

cat <<EOF | oc apply -n ${PROJECT_NAME} -f-
---
apiVersion: v1
stringData:
  VAULT_ADDR: "${VAULT_ADDR}"
  VAULT_SKIP_VERIFY: "${VAULT_SKIP_VERIFY}"
  AVP_AUTH_TYPE: "${AVP_AUTH_TYPE}"
  AVP_K8S_ROLE: "${AVP_K8S_ROLE}"
  AVP_TYPE: "${AVP_TYPE}"
  AVP_K8S_MOUNT_PATH: "${AVP_MOUNT_PATH}"
kind: Secret
metadata:
  name: team-avp-credentials
  namespace: ${PROJECT_NAME}
type: Opaque
EOF

I am leaving out the gory details of Vault/ACL setup which are documented elsewhere, however to create the auth secret in vault from the argocd-repo-vault ServiceAccount token, i use this shell script:

export SA_TOKEN=$(oc -n ${PROJECT_NAME} get sa/${APP_NAME} -o yaml | grep ${APP_NAME}-token | awk '{print $3}')
export SA_JWT_TOKEN=$(oc -n ${PROJECT_NAME} get secret $SA_TOKEN -o jsonpath="{.data.token}" | base64 --decode; echo)
export SA_CA_CRT=$(oc -n ${PROJECT_NAME} get secret $SA_TOKEN -o jsonpath="{.data['ca\.crt']}" | base64 --decode; echo)

vault write auth/$BASE_DOMAIN-${PROJECT_NAME}/config \
  token_reviewer_jwt="$SA_JWT_TOKEN" \
  kubernetes_host="$(oc whoami --show-server)" \
  kubernetes_ca_cert="$SA_CA_CRT"

Why Do All of This ?

The benefit of all this gory configuration stuff:

  • we can now store secrets safely in a backend vault at enterprise scale

  • we have all of our ArgoCD’s use these secrets consistently with gitops in a multi-tenanted manner

  • we keep secrets values out of our source code

  • we can control all of this with gitops

It also means that the platform an product teams, can manage secrets in a safely consistent manner - but separately i.e. each team manages their own secrets and space in vault. This method also works if you are using the enterprise Hashi vault that uses namespaces - you can just set the env.var into your ArgoCD Application like so.

    plugin:
      name: argocd-vault-plugin-kustomize
      env:
        - name: VAULT_NAMESPACE
          value: "my-team-apps"

Tenant team’s are now fully in control of their namespaces and secrets and can get on with managing their own applications, products and tools !

Commentaires