Amazon VPC CNI 支持网络前缀模式(Prefix Mode)
,这样每个网卡的secondary ip
不再是分配一个IP,而是分配一个28位的prefix:
和不使用前缀模式时一样,使用前缀模式时,CNI也会提前预热好IP池。
在使用前缀模式时,注意分配的/28
地址段必须是连续的,如果子网中的IP使用率比较高,没有连续的/28
地址段的话,此时会报错:
failed to allocate a private IP/Prefix address: InsufficientCidrBlocks: There are not enough free cidr blocks in the specified subnet to satisfy the request.
接下来我们将进行Prefix Mode
实验;在不开启Prefix Mode
时,在一个m5.large节点上创建30个Pod,由于m5.large最多支持29个IP,此时有些pod会变成Pending
状态;然后我们再使用Prefix Mode
,我们会发现30个pod都能成功创建在该节点上。
1. Deploy an application with 30 Pod replicase on a single <strong>m5.large</strong> node
kongpingfan:~/environment/app $ kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=ip-192-168-10-88.us-west-2.compute.internal
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default redis-replica-txf8g 2/2 Running 0 77d 192.168.2.155 ip-192-168-10-88.us-west-2.compute.internal <none> <none>
istio-system kiali-5db6985fb5-2gd8n 1/1 Running 0 75d 192.168.12.37 ip-192-168-10-88.us-west-2.compute.internal <none> <none>
kube-system aws-node-bm9lm 1/1 Running 0 284d 192.168.10.88 ip-192-168-10-88.us-west-2.compute.internal <none> <none>
kongpingfan:~/environment/app $ kubectl label node ip-192-168-10-88.us-west-2.compute.internal prefixmode=on
node/ip-192-168-10-88.us-west-2.compute.internal labeled
kubectl create ns apps
cat <<EOF >./deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: placeholder
namespace: apps
spec:
replicas: 30
selector:
matchLabels:
app: placeholder
template:
metadata:
labels:
app: placeholder
spec:
terminationGracePeriodSeconds: 0
nodeSelector:
prefixmode: 'on'
containers:
- name: placeholder
image: public.ecr.aws/eks-distro/kubernetes/pause:3.6
resources:
requests:
cpu: 50m
memory: 100Mi
EOF
🚀 Deploy the application:
kubectl apply -f manifest.yml
🚀 View the Pods:
export NODE_NAME=$(kubectl get nodes -l prefixmode='on' -o json | jq -r '.items[].metadata.name')
kubectl get pods -A -o wide | grep -E "${NODE_NAME}|$"
You should, after some delay, see that out of 30 replicas, most (25+) were successfully placed on the target m5.large node, while some remain in a Pending
state:
🚀 You can verify the node instance type by calling:
kubectl get nodes -l role=apps,type=large -o json | \
jq -r '.items[].metadata.labels["beta.kubernetes.io/instance-type"]'
kongpingfan:~/environment/app $ kubectl get pods -A -o wide | grep "${NODE_NAME}" | wc -l
29
There are some (3 to 5) Pending
replicas because m5.large allows 29 Pods to be scheduled, and each node already has an instance of VPC CNI and kube-proxy, as well as possibly up to 2 CoreDNS replicas. This leaves space for 25 - 27 additional Pods. You can observe the Pods on a node by executing:
export NODE_NAME=$(kubectl get nodes -l role=apps,type=large -o json | jq -r '.items[].metadata.name')
kubectl describe node ${NODE_NAME}
You should see the node’s Pods:
🚀 Verify the number of allocatable Pods on the apps-large-mng
node group’s single instance (should be 29
):
kubectl get nodes -l role=apps,type=large -o json | jq -r '.items[].status.allocatable.pods'
🚀 Verify the number of ENIs attached to the instance (should be 3
):
export INSTANCE_ID=$(kubectl get nodes -l role=apps,type=large -o json | \
jq -r '.items[].spec.providerID | split("/") | last')
aws ec2 describe-instances --instance-ids ${INSTANCE_ID} | \
jq -r '.Reservations[].Instances[].NetworkInterfaces | length'
🚀 Verify the number of allocated IP addresses on the instance (should be 3 * 10 = 30
)
export INSTANCE_ID=$(kubectl get nodes -l role=apps,type=large -o json | \
jq -r '.items[].spec.providerID | split("/") | last')
aws ec2 describe-instances --instance-ids ${INSTANCE_ID} | \
jq -r '.Reservations[].Instances[].NetworkInterfaces[].PrivateIpAddresses | length'
As expected the entire 29 allocatable Pod spots has been filled, which required 3 ENIs, each with 10 available IP addresses, where out of these 30 addresses, 1 is reserved for the instance’s primary IP address.
2. Configure VPC CNI to allow all 30 Pods to fit on a single <strong>m5.large</strong> node
🚀 Verify the VPC CNI configuration for ENABLE_PREFIX_DELEGATION
is still set to false
, by executing:
kubectl describe daemonset -n kube-system aws-node | grep 'ENABLE_PREFIX_DELEGATION'
🚀 Update the VPC CNI configuration:
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
🚀 Verify that ENABLE_PREFIX_DELEGATION
is set to true
:
kubectl describe daemonset -n kube-system aws-node | grep 'ENABLE_PREFIX_DELEGATION'
For the configuration to be applied to our apps-large-mng
node group, we will have to create a new one and delete the current one. That will require us, to avoid disruptions to our application, to create a new node group first, with a new name.
Since we want out config.yml
file to reflect the state of the cluster, update it by renaming the node group:
# ... rest of the cluster configuration ...
managedNodeGroups:
# ... other node group configuration ...
- name: apps-large-mng-001 # <-- HERE
privateNetworking: true
instanceType: m5.large
minSize: 0
desiredCapacity: 1
maxSize: 3
labels:
role: apps
type: large
# ... other node group configuration ...
🚀 Trigger creation/deletion of the node groups:
eksctl create nodegroup --config-file=config.yml --include=apps-large-mng-001
eksctl delete nodegroup --cluster=${CLUSTER_NAME} --name=apps-large-mng --wait
eksctl create nodegroup \
--cluster eks-kpf \
--region us-west-2 \
--name my-mng \
--node-type m5.large \
--nodes 1 \
--nodes-min 1 \
--nodes-max 2 \
--tags prefixmode='on'
kubectl label node ip-192-168-10-88.us-west-2.compute.internal prefixmode=off --overwrite
kubectl delete -f deploy.yaml
The process of creation of a new node group should take a couple of minutes. You can observe its progress by executing kubectl get nodes
``
and waiting for the amount of nodes to become 5, then 4 again.
🚀 Once the creationg and deletion are done, verify that all Pods have been scheduled:
export NODE_NAME=$(kubectl get nodes -l prefixmode='on' -o json | jq -r '.items[].metadata.name')
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=${NODE_NAME}
The result should look similar to this:
🚀 Verify the number of allocatable Pods on the simple-mng
single instance (should be 110
):
kubectl get nodes -l role=apps,type=large -o json | jq -r '.items[].status.allocatable.pods'
🚀 Verify the number of ENIs attached to the instance (should be 1
):
export INSTANCE_ID=$(kubectl get nodes -l role=apps,type=large -o json | \
jq -r '.items[].spec.providerID | split("/") | last')
aws ec2 describe-instances --instance-ids ${INSTANCE_ID} | \
jq -r '.Reservations[].Instances[].NetworkInterfaces | length'
🚀 Verify the number of allocated IP addresses on the instance (should be 3
)
export INSTANCE_ID=$(kubectl get nodes -l role=apps,type=large -o json | \
jq -r '.items[].spec.providerID | split("/") | last')
aws ec2 describe-instances --instance-ids ${INSTANCE_ID} | \
jq -r '.Reservations[].Instances[].NetworkInterfaces[].Ipv4Prefixes | length'
With prefixing enabled, a single ENI on an m5.large instance can have 9 prefixes (with 1 slot reserved for the ENI primary IP address). This means that a single ENI can sustain up to `9 * 16 = 144 Pods.
With the practical limit for managed node groups of instances with less than 30 vCPU set to 110, this means that there should never be a need for an additional ENI on an m5.large instance.