Amazon Elastic Kubernetes Service (Amazon EKS ) helps customers move their container-based workloads to the AWS Cloud. Amazon EKS manages the Kubernetes control plane so customers don’t need to worry about scaling and maintaining Kubernetes components, such as etcd and application programming interface (API) servers. As a declarative and reconciling system, Kubernetes publishes various events to keep users informed of activities in the cluster, such as spinning up and tearing down pods, deployments, namespaces, and more. Amazon EKS keeps the Kubernetes upstream default event time to live (TTL) to 60 minutes, which can’t be changed. This default setting is a good balance of storing enough history for immediate troubleshooting without filling up the etcd database and risk causing API server degraded performance.
Events in any distributed management system bring lot of useful information. In the world of container orchestration and especially with Kubernetes, events are specific objects that show useful information about what is happening inside a pod, namespace, or container. There are certain use cases where it may be useful to have a history of Kubernetes events beyond the default 60-minute window. We also cover how to filter Kubernetes events (such as only pod events, node events, and so on) to be exported to Amazon CloudWatch because not every type of Kubernetes resource may be required for longer duration. Readers can take this example and modify it based on their requirements.
As mentioned, the Amazon EKS event TTL limit is set to the upstream default 60 minutes. Some customers have shown interest in increasing this configuration to keep a longer history for debugging purposes. However, EKS does not allow configuration of this setting, as events filling up the etcd database may cause stability and performance issues. After 60 minutes, etcd purges the events during the last hour. In this post, we provide a solution to capture the events beyond 60 minutes, using Amazon CloudWatch as the example destination. We also cover an example use case for examining events stored in CloudWatch.
You need the following to complete the walkthrough:
Let’s start by setting a few environment variables:
export EKS_CLUSTER_NAME=controlplane-events-cluster
export AWS_REGION=<region>
Create a cluster using eksctl
:
eksctl create cluster --name $EKS_CLUSTER_NAME --region $AWS_REGION --managed
Creating a cluster can take up to 10 minutes. When the cluster is ready, proceed to the next steps.
Once the Amazon EKS cluster is up and running, it’s time to manage the control plane events. As explained in the previous section, this post provides various examples to show how to manage control plane events with Amazon EKS. We achieve this by creating a Kubernetes deployment, which the underlying pod inside that deployment tracks activities from the Amazon EKS control plane and persists the events inside CloudWatch . All the source code is provided for the users to try and modify based on their needs (such as changing the Python-based control plane events application, type of events, and so on). The source code and dependencies are containerized using the Docker file. The resultant container image is pushed to a public repository (Amazon Elastic Container Registry (Amazon ECR ) in this case). In this post, the container image from the public Amazon ECR gets deployed to the Kubernetes cluster. But as stated above, this solution can be customized to selected events of interest, if needed.
The architecture of the solution overview is provided in the following diagram:
Based on the previous diagram, the sequence of steps required to achieve the result are the following:
Let’s work on these step by step.
Step 1: Get the source code from GitHub.
All the source code is available in this GitHub repo .
mkdir control-plane-events-app && cd $_
git clone https://github.com/aws-samples/eks-event-watcher.git
cd eks-event-watcher
As explained in the source code, we connect to the Kubernetes API server and watch for events. The events include pod, namespace, node, service, and persistent volume claim events. The Python script prints out the event results and eventually gets pushed to CloudWatch by container insights (which we discuss later). If you wanted to change the Python script to suit your needs, you can containerize it with Docker file, as explained in the Readme file in the GitHub repo. You have the flexibility to customize this solution to selected events of interest.
How does this deployment capture the events under the hood?
As you could see from the event_watcher
Python script, the control plane events application (such as a pod) loads the Kubernetes configuration and then the script checks for the pod events every 60 minutes by default. You can change the Helm chart deployment in the GitHub repo. The script captures the control plane events and pushes to CloudWatch with container insights, thus completing the persistence layer. This event lookup runs every hour. The high-level flow of this approach is given in the following diagram:
Step 2: Create container insights.
CloudWatch Container Insights are used here to collect, aggregate, and summarize metrics and logs from containerized applications. Container insight, which is available for Amazon EKS, collects performance data at every layer of the performance stack. Follow this link to enable container insights with Fluent Bit.
Step 3: Deploy the container image to the Kubernetes cluster and validate the deployment.
Let’s deploy the control plane events application by using the following Helm command. As stated in the previous section, this creates the necessary cluster role, cluster role binding, and deployment to ensure the events are accessible for the default service account. This also takes the container image from the public Amazon ECR and pushes it into the Kubernetes/Amazon EKS cluster:
helm install cpe-chart helm/cpe-chart
Now the event_watcher
should be in action and start to collect control plane events every hour from now on. Let’s validate the deployment:
kubectl get deployment cpe
The typical output is similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGE
cpe 1/1 1 1 108s
Step 4: Perform operations on the cluster.
Let’s perform some operations on the cluster to see if those events persisted (as verified in Step 5 through CloudWatch).
Let’s create some nginx
pods:
kubectl run nginx --image=nginx
pod/nginx created
kubectl run nginx98 --image=nginx
pod/nginx98 created
Let’s expose one of the deployed pods as a service:
kubectl expose pod nginx98 --port=80
service/nginx98 exposed
Let’s delete one of the pods as well:
kubectl delete pod nginx
pod "nginx" delete
Let’s try to deploy a mega-pod that could not be deployed due to resource constraints. Try the below command to see the CPU counts on the cluster nodes:
kubectl get nodes -o json|jq -Cjr '.items[] | .metadata.name," ",.metadata.labels."beta.kubernetes.io/instance-type"," ",.status.capacity.cpu, "cpu\n"'|sort -k3 -r
Here we use three t3.small
instances with two vCPUs each and the output shows, as provided in the following code. Please note that your node name in the output below might be different:
ip-192-168-164-222.us-east-2.compute.internal t3.small 2cpu
ip-192-168-128-39.us-east-2.compute.internal t3.small 2cpu
ip-192-168-112-200.us-east-2.compute.internal t3.small 2cpu
Now try to deploy a resource-heavy pod which requests five vCPUs, as provide in the following code. This is in perpetual pending state due to insufficient resources. We can check why it failed in the next step:
kubectl apply -f k8s_utils/mega-pod.yaml
pod/mega-pod created
As a final step, let’s try some taints and tolerations. First, let’s taint all the nodes in the following command with key1=value1
. Copy the output of the following command and run it from your terminal.
kubectl get nodes --no-headers|awk ' {print "kubectl taint node " $1 " key1=value1:NoSchedule" }'
Let’s deploy an intolerable pod that doesn’t match the above taints. This pod is in a perpetual pending state:
kubectl apply -f k8s_utils/intolerable-pod.yaml
pod/intolerable-pod created
The typical status of all the pods will be like the following:
kubectl get pod
NAME READY STATUS RESTARTS AGE
cpe-6cb595544c-k7s6w 1/1 Running 0 9m
intolerable-pod 0/1 Pending 0 8s
mega-pod 0/1 Pending 0 28s
nginx98 1/1 Running 0 59s
Step 5: Verify control plane events with container insights through CloudWatch.
Now it’s time to verify the events from the above operations done in Step 4 (and others from the cluster) persisted. Under the hood, as previously discussed, we achieve this using Fluent Bit with container insights to Amazon CloudWatch.
Log into the AWS console and head to Amazon CloudWatch
. Select Log Groups. There should be three log groups for this cluster (such as application, data plane, and host under /aws/containerinsights/<cluster-name>
) as shown in the following diagram.
Check the container logs under the application to verify the events’ persistence. Find to the cpe-xx
link to see the control plane events application in action. In the following diagram, we see that the nginx
pod creation and deletion events are captured.
Also, we see some node events as shown in the following diagram. In a typical real-world scenario, this could be a scale-down event where the node is removed, as provided in the example.
As we discussed in the previous step, the mega-pod
could not deploy due to resource constraints and the intolerable-pod
could not deploy due to the taints. You can search for FailedScheduling
in the text box to see these in action as shown in the following diagram.
We can see the service event as well. This is the service that was exposed in Step 4.
Also, some namespace and persistent volume claims events are captured here. If you install Prometheus
, you can see events like this as well, which are provided in the following example.
To test the performance of the control plane events application’s event persistence when hundreds of pods are running, we deploy 100 nginx
pods in addition to the existing workloads. As you could see from the following Grafana snapshot, there are more than 160 pods running in this example Amazon EKS cluster.
When checking from the pod’s log of our control plane events application, we see the last nginx
pod event as shown in the following code:
$ kubectl logs cpe-7dd548bf76-vhmv5
Event: ADDED Pod nginx100
We get the same information from CloudWatch Container Insights as shown in the following diagram.
Since the payload is very small, this event capture mechanism works even with hundreds of pods running in the system.
To avoid incurring future charges, delete the resources:
helm uninstall cpe-chart
kubectl delete service nginx98
kubectl delete po intolerable-pod mega-pod nginx98
eksctl delete cluster --name $EKS_CLUSTER_NAME --region $AWS_REGION
In this post, we showed you how to capture Amazon EKS control plane events, without being affected by the event TTL limitation of 60 minutes. This was accomplished by the creation of a custom deployment, which queries the API server for events every hour. Consequently, there is minimal load on the API server control plane, and the events were persistent.