Prefix Mode

Amazon VPC CNI 支持网络前缀模式(Prefix Mode),这样每个网卡的secondary ip不再是分配一个IP,而是分配一个28位的prefix:

illustration of two worker subnets, comparing ENI secondary IPvs to ENIs with delegated prefixes

和不使用前缀模式时一样,使用前缀模式时,CNI也会提前预热好IP池。

在使用前缀模式时,注意分配的/28地址段必须是连续的,如果子网中的IP使用率比较高,没有连续的/28地址段的话,此时会报错:

failed to allocate a private IP/Prefix address: InsufficientCidrBlocks: There are not enough free cidr blocks in the specified subnet to satisfy the request.

接下来我们将进行Prefix Mode实验;在不开启Prefix Mode时,在一个m5.large节点上创建30个Pod,由于m5.large最多支持29个IP,此时有些pod会变成Pending状态;然后我们再使用Prefix Mode,我们会发现30个pod都能成功创建在该节点上。

实验

1. Deploy an application with 30 Pod replicase on a single <strong>m5.large</strong> node

image-20221121200945450

kongpingfan:~/environment/app $ kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=ip-192-168-10-88.us-west-2.compute.internal 
NAMESPACE                  NAME                                                   READY   STATUS    RESTARTS   AGE    IP              NODE                                          NOMINATED NODE   READINESS GATES
default                    redis-replica-txf8g                                    2/2     Running   0          77d    192.168.2.155   ip-192-168-10-88.us-west-2.compute.internal   <none>           <none>
istio-system               kiali-5db6985fb5-2gd8n                                 1/1     Running   0          75d    192.168.12.37   ip-192-168-10-88.us-west-2.compute.internal   <none>           <none>
kube-system                aws-node-bm9lm                                         1/1     Running   0          284d   192.168.10.88   ip-192-168-10-88.us-west-2.compute.internal   <none>           <none>
kongpingfan:~/environment/app $ kubectl label node ip-192-168-10-88.us-west-2.compute.internal   prefixmode=on
node/ip-192-168-10-88.us-west-2.compute.internal labeled
kubectl create ns apps
cat <<EOF >./deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: placeholder
  namespace: apps
spec:
  replicas: 30
  selector:
    matchLabels:
      app: placeholder
  template:
    metadata:
      labels:
        app: placeholder
    spec:
      terminationGracePeriodSeconds: 0
      nodeSelector:
        prefixmode: 'on'
      containers:
        - name: placeholder
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.6
          resources:
            requests:
              cpu: 50m
              memory: 100Mi
EOF

🚀 Deploy the application:

kubectl apply -f manifest.yml

🚀 View the Pods:

export NODE_NAME=$(kubectl get nodes -l prefixmode='on' -o json | jq -r '.items[].metadata.name')
kubectl get pods -A -o wide | grep -E "${NODE_NAME}|$"

You should, after some delay, see that out of 30 replicas, most (25+) were successfully placed on the target m5.large node, while some remain in a Pending state:

image-20221121202324116

🚀 You can verify the node instance type by calling:

kubectl get nodes -l role=apps,type=large -o json | \
 jq -r '.items[].metadata.labels["beta.kubernetes.io/instance-type"]'
kongpingfan:~/environment/app $ kubectl get pods -A -o wide | grep  "${NODE_NAME}" | wc -l
29

There are some (3 to 5) Pending replicas because m5.large allows 29 Pods to be scheduled, and each node already has an instance of VPC CNI and kube-proxy, as well as possibly up to 2 CoreDNS replicas. This leaves space for 25 - 27 additional Pods. You can observe the Pods on a node by executing:

export NODE_NAME=$(kubectl get nodes -l role=apps,type=large -o json | jq -r '.items[].metadata.name')
kubectl describe node ${NODE_NAME}

You should see the node’s Pods:

🚀 Verify the number of allocatable Pods on the apps-large-mng node group’s single instance (should be 29):

kubectl get nodes -l role=apps,type=large -o json | jq -r '.items[].status.allocatable.pods'  

🚀 Verify the number of ENIs attached to the instance (should be 3):

export INSTANCE_ID=$(kubectl get nodes -l role=apps,type=large -o json | \
 jq -r '.items[].spec.providerID | split("/") | last')

aws ec2 describe-instances --instance-ids ${INSTANCE_ID} | \
 jq -r '.Reservations[].Instances[].NetworkInterfaces | length'

🚀 Verify the number of allocated IP addresses on the instance (should be 3 * 10 = 30)

export INSTANCE_ID=$(kubectl get nodes -l role=apps,type=large -o json | \
 jq -r '.items[].spec.providerID | split("/") | last')

aws ec2 describe-instances --instance-ids ${INSTANCE_ID} | \
 jq -r '.Reservations[].Instances[].NetworkInterfaces[].PrivateIpAddresses | length'

As expected the entire 29 allocatable Pod spots has been filled, which required 3 ENIs, each with 10 available IP addresses, where out of these 30 addresses, 1 is reserved for the instance’s primary IP address.

2. Configure VPC CNI to allow all 30 Pods to fit on a single <strong>m5.large</strong> node

🚀 Verify the VPC CNI configuration for ENABLE_PREFIX_DELEGATION is still set to false, by executing:

kubectl describe daemonset -n kube-system aws-node | grep 'ENABLE_PREFIX_DELEGATION'

🚀 Update the VPC CNI configuration:

kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true

🚀 Verify that ENABLE_PREFIX_DELEGATION is set to true:

kubectl describe daemonset -n kube-system aws-node | grep 'ENABLE_PREFIX_DELEGATION'

For the configuration to be applied to our apps-large-mng node group, we will have to create a new one and delete the current one. That will require us, to avoid disruptions to our application, to create a new node group first, with a new name.

Since we want out config.yml file to reflect the state of the cluster, update it by renaming the node group:


# ... rest of the cluster configuration ...
managedNodeGroups:
  # ... other node group configuration ...
  - name: apps-large-mng-001 # <-- HERE
    privateNetworking: true
    instanceType: m5.large
    minSize: 0
    desiredCapacity: 1
    maxSize: 3
    labels:
      role: apps
      type: large
  # ... other node group configuration ...

🚀 Trigger creation/deletion of the node groups:

eksctl create nodegroup --config-file=config.yml --include=apps-large-mng-001
eksctl delete nodegroup --cluster=${CLUSTER_NAME} --name=apps-large-mng --wait
eksctl create nodegroup \
  --cluster eks-kpf \
  --region us-west-2 \
  --name my-mng \
  --node-type m5.large \
  --nodes 1 \
  --nodes-min 1 \
  --nodes-max 2 \
  --tags prefixmode='on'

kubectl label node ip-192-168-10-88.us-west-2.compute.internal   prefixmode=off  --overwrite

kubectl delete -f deploy.yaml

The process of creation of a new node group should take a couple of minutes. You can observe its progress by executing kubectl get nodes

``

and waiting for the amount of nodes to become 5, then 4 again.

🚀 Once the creationg and deletion are done, verify that all Pods have been scheduled:

export NODE_NAME=$(kubectl get nodes -l prefixmode='on' -o json | jq -r '.items[].metadata.name')

kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=${NODE_NAME}

image-20221121203608281

The result should look similar to this:

List of all applicaiton Pods

🚀 Verify the number of allocatable Pods on the simple-mng single instance (should be 110):

kubectl get nodes -l role=apps,type=large -o json | jq -r '.items[].status.allocatable.pods'  

🚀 Verify the number of ENIs attached to the instance (should be 1):

export INSTANCE_ID=$(kubectl get nodes -l role=apps,type=large -o json | \
 jq -r '.items[].spec.providerID | split("/") | last')

aws ec2 describe-instances --instance-ids ${INSTANCE_ID} | \
 jq -r '.Reservations[].Instances[].NetworkInterfaces | length'

🚀 Verify the number of allocated IP addresses on the instance (should be 3)

export INSTANCE_ID=$(kubectl get nodes -l role=apps,type=large -o json | \
 jq -r '.items[].spec.providerID | split("/") | last')

aws ec2 describe-instances --instance-ids ${INSTANCE_ID} | \
 jq -r '.Reservations[].Instances[].NetworkInterfaces[].Ipv4Prefixes | length'

With prefixing enabled, a single ENI on an m5.large instance can have 9 prefixes (with 1 slot reserved for the ENI primary IP address). This means that a single ENI can sustain up to `9 * 16 = 144 Pods.

With the practical limit for managed node groups of instances with less than 30 vCPU set to 110, this means that there should never be a need for an additional ENI on an m5.large instance.