Cross-Cluster NetworkPolicy

原生的EKS不支持NetworkPolicy,要借助一些插件比如Calico;即使支持了NetworkPolicy,也只能在一个EKS集群生效,如果多个EKS集群之间要进行通讯(VPC Peering或tgw),依然不能进行流量控制。例如,怎么允许clusterA => nsA=>podA仅允许与clusterB=>nsB=>podB通讯,而不能与其他的pod通讯。

CiliumNetworkPolicy扩展了K8s的NetworkPolicy,支持跨集群的policy(cross-cluster network policy)。下面是一个例子:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-cross-cluster
  namespace: my-client
spec:
  endpointSelector:
    matchLabels:
      app: my-client
      io.cilium.k8s.policy.cluster: cluster-1
  egress:
    - toEndpoints:
        - matchLabels:
            app: my-server
            io.cilium.k8s.policy.cluster: cluster-2

Launch a new EKS cluster with Cilium CNI

This is the easiest way for applying Cilium CNI on a clean EKS cluster where only EKS add-ons including Amazon VPC CNI exist.

There is no worries for downtime when deleting VPC CNI (aws-node daemonset) and then installing Cilium CNI on the cluster. More details can be referred to the section of EKS in https://docs.cilium.io/en/stable/gettingstarted/k8s-install-helm/.

The downside is to migrate applications from the EKS cluster with Amazon VPC CNI to this new one with Cilium CNI. Most of times, this would be a pretty complicated process if zero downtime is required, especially for production environments.

CNI chaining

Another option is to install Cilium CNI in CNI chaining mode which allows Cilium CNI working together with Amazon VPC CNI.

the AWS VPC CNI plugin is responsible for setting up the virtual network devices as well as for IP address management (IPAM) via ENIs. After the initial networking is setup for a given pod, the Cilium CNI plugin is called to attach eBPF programs to the network devices set up by the AWS VPC CNI plugin in order to enforce network policies, perform load-balancing and provide encryption.

Comparing to the 3rd option of replacing Amazon VPC CNI with Cilium CNI on the running EKS cluster, it only requires to refresh all the running Pods instead of nodes in the cluster after installing Cilium CNI. It has less effort and is easier to keep all applications zero-downtime when restarting Pods since they have multiple replicas and are guarded by PDB as well.

The drawback of this option is some advanced Cilium features may be limited, such as:

For more detailed instruction of installing CNI chaining with Amazon VPC CNI: https://docs.cilium.io/en/v1.12/gettingstarted/cni-chaining-aws-cni/

Replacing Amazon VPC CNI

Replacing Amazon VPC CNI with Cilium CNI on a running EKS cluster is a bit more complicated than the other two approaches.

This was inspired from how they migrated Meltwater’s production Kubernetes clusters — from the AWS VPC CNI plugin to Cilium .

The idea is to constrain Amazon VPC CNI (aws-node Daemonset) running on existing nodes of the EKS cluster and ensure Cilium CNI (cilium Daemonset) running on newly launched nodes only.

Then, the nodes with Amazon VPC CNI are drained and removed from the EKS cluster gradually and the Pods get rescheduled to newly launched nodes where only Cilium CNI runs.

Once all the nodes with Amazon VPC CNI get removed, Cilium CNI would be the only CNI in the EKS cluster. Therefore, the Amazon VPC CNI (aws-node Daemonset) is able to be deleted safely.

The biggest benefit of this alternative is that all Cilium CNI features are available without any limit. Of course, it needs more manual steps since all the nodes needs to be refreshed.

Steps for Replacing Amazon VPC CNI

Label Existing Nodes

At first, all running nodes in the EKS cluster should be labelled with cni-plugin=aws in order to constrain the aws-node Daemonset only runs on these nodes on the next step.

Here is my script to adding the label for all nodes in a loop with checking if the label already being added.

for NODE in $(kubectl get node --output=jsonpath={.items..metadata.name}); do
    LABELLED=$(kubectl get node $NODE -o json | jq '.metadata.labels | has("cni-plugin")')
    if [ "$LABELLED" = "true" ]; then
        LABEL_VALUE=$(kubectl get node $NODE -o json | jq -r '.metadata.labels.cni-plugin')
        echo "Node $NODE already labelled: cni-plugin=$LABEL_VALUE"
    else
        kubectl label nodes $NODE cni-plugin=aws
    fi
done

After running the script, it can be verified with following commands that all nodes are labelled correctly.

# verify all nodes are labelled correctly
LABELLED_NODE_COUNT=$(kubectl get node -l cni-plugin=aws -o json | jq '.items | length')
ALL_NODE_COUNT=$(kubectl get node -o json | jq '.items | length')

if [ ! "$LABELLED_NODE_COUNT" = "$ALL_NODE_COUNT" ]; then
    echo "Not all nodes labelled."
    echo "Labelled nodes: $LABELLED_NODE_COUNT, total: $ALL_NODE_COUNT"
    exit 1
fi

echo "All nodes are labelled."

Patch aws-node Daemonset

The nodeAffinity field in the Daemonset can be used for specifying which node the Daemonset Pods can run on, so the aws-node Daemonset can be patched as:

kubectl patch daemonset aws-node -n kube-system --patch "$(cat aws-node-patch.yaml)"

Here is the content of aws-node-patch.yaml:

spec:
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  # This is to limit aws-node only running on the nodes
                  # with label cni-plugin=aws
                  - key: cni-plugin
                    operator: In
                    values:
                      - aws
                  # These below are what aws-node already has
                  - key: kubernetes.io/os
                    operator: In
                    values:
                      - linux
                  - key: kubernetes.io/arch
                    operator: In
                    values:
                      - amd64
                      - arm64
                  - key: eks.amazonaws.com/compute-type
                    operator: NotIn
                    values:
                      - fargate

Install Cilium CNI

With replacing Amazon VPC CNI, Cilium CNI needs to do the similar jobs that VPC CNI does for allocating AWS ENI IP addresses for each pods, so it needs to set eni.enabled=true and tunnel=disabled.

Meanwhile, the Daemonset of Cilium CNI needs to run on the node without the label cni-plugin=aws on the contrary to Amazon VPC CNI.

helm install cilium cilium/cilium \
    --version 1.12.4 \
    --namespace kube-system \
    -f cilium-values.yaml

Here is the contents of cilium-value.yaml.

eni:
  enabled: true
ipam:
  mode: eni
egressMasqueradeInterfaces: eth0
tunnel: disabled
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        # This is to limit cilium only running on the nodes
        # WITHOUT label cni-plugin=aws
        - key: cni-plugin
          operator: NotIn
          values:
          - aws

For more installing options, please refer to https://docs.cilium.io/en/v1.12/gettingstarted/k8s-install-helm/

After running the helm install, the cilium command can be used to check the installation status.

cilium status --wait

And the pod count of cilium Daemonset is supposed to be 0, but it might be other values if a new node is launched in the meantime. So I use following commands to double check:

# validate cilium daemonset only runs on unlabelled nodes
DESIRED_COUNT=$(kubectl get daemonset cilium -n kube-system -o json | jq '.status.desiredNumberScheduled')
UNLABELLED_NODE_COUNT=$(kubectl get nodes -l '!cni-plugin' -o json | jq '.items | length')

if [ ! "$DESIRED_COUNT" = "$UNLABELLED_NODE_COUNT" ]; then
    echo "cilium daemonset should ONLY run on unlabelled nodes."
fi

Refresh Nodes

Now, the nodes with the label cni-plugin=aws can be deleted one by one and the Pods would be rescheduled automatically.

For removing a node from a Kubernetes cluster safely, each node with the label cni-plugin=aws needs to be cordoned, drained and deleted with kubectl and then terminated from the auto scaling group with following commands.

NODE="<node_name>"
REGION="<EKS_cluster_region>"

echo "cordon the node: $NODE"
kubectl cordon $NODE

echo "drain the node: $NODE"
kubectl drain $NODE --delete-local-data --ignore-daemonsets --force

echo "remove the node from EKS cluster: $NODE"
kubectl delete node $NODE

NODE_ID=$(aws ec2 describe-instances \
    --region=$REGION \
    --filter Name=private-dns-name,Values=$NODE --no-cli-pager \
    | jq -r ".Reservations[0].Instances[0].InstanceId")

echo "take the node out of autoscaling group: $NODE ($NODE_ID)"
aws autoscaling terminate-instance-in-auto-scaling-group \
    --region=$REGION \
    --instance-id $NODE_ID \
    --should-decrement-desired-capacity \
    --no-cli-pager

Several things need to be verified in the process of refreshing nodes:

  • Check the pod count of aws-node Daemonset get decreased after deleting a labelled node;
  • Check the pod count of cilium Daemonset get increased if a new node is launched and added to the cluster;
  • Run cilium status to check the status of cilium Daemonset, cilium-operator, cluster pods, etc;
  • Check the status of application Pods rescheduled to the new nodes;

Delete aws-node Daemonset

When all lablled node get deleted from the EKS cluster, the Amazon VPC CNI (aws-node Daemonset) is good to be deleted as well.

# validate aws-node daemonset has 0 pods
DESIRED_COUNT=$(kubectl get daemonset aws-node -n kube-system -o json | jq '.status.desiredNumberScheduled')

if [ ! "$DESIRED_COUNT" = "0" ]; then
    echo "aws-node still runs on some nodes, please referesh all nodes before deleting it."
    exit 1
fi

echo "No pod of aws-node is running, safely delete it."
kubectl delete daemonset aws-node -n kube-system

As Amazon VPC CNI is an add-on of the EKS cluster, now it can be removed from AWS console as well.

After completing all the migration steps, the EKS cluster runs on top of Cilium CNI only and all Cilium features are in hands.

Kubernetes

Cilium

Cni

AWS