A life of Packet in Amazon EKS

11 min readJun 1, 2024

Understanding the flow of packet in Amazon Elastic Kubernetes Service (EKS) can help in build a well-architected application and can also aid in troubleshooting.

Kubernetes Network Model

Kubernetes does not provide a native Pod networking solution. It only specify the network requirements:

Every pod should gets its own unique cluster-wide IP address
Pods can communicate with all other pods on any other node without Network Address Translation (NAT)
Agents on a node (e.g., system daemons, Kubelet) can communicate with all pods on that node

Containers, within a pod, communicates with each other using loopback interface, which is also called localhost communication.

When a container is created, a network namespace is created for them. To enable communication, the Pod Network namespace is then connected to root/host network namespace.

Pod and Root Network Namespace connection

Pod network namespace gets connected to Linux root namespace through a pair call veth pair and a corresponding veth interface.

The way a veth pair connects to Linux IP stack varies across different solutions, hence Kubernetes does not own configuration of POD connectivity instead it delegate this function to Container Network Interface (CNI). There are default CNI plugins like loopback, bridge, IPVLAN as well as third party like Calico, Cilium, etc.

CNI plugin is basically an executable file which runs on each node and is responsible for various tasks. It add/delete network interface on the pod, calls the IPAM plugins to assign an IP to the pod’s network interface and set up the veth pair to connect the pod network namespace to the host.

In Amazon EKS, Amazon handles the management of the Kubernetes control plane components, while customers deploy their pods on worker nodes within their VPC.

EKS worker nodes are EC2 instances.

EKS uses VPC CNI. VPC CNI performs the following key tasks:

setup / decommission Pod connectivity
leverages EC2 secondary IP mode or EC2 prefix mode feature to assign an IP address to the pod from the VPC CIDR (pod IPs are mapped to an ENI as secondary IPs)
manages the ENIs of the node, to make sure IPs are available for Pods
configures respective routing tables, routing entries and policy based routing rules in the node root network namespaces
configures routing / ARP entries in the pod network namespace

(ARP entry for the default GW IP is configured as a manual entry in the pod by VPC CNI)

# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    ...
3: eth0@if39: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    ...
    inet 192.168.1.161/32 ...
    ...

# ip route
default via 169.254.1.1 dev eth0 proto kernel
169.254.1.1 dev eth0 proto kernel scope link ...
...

# arp -n
Address                  HWtype  HWaddress           
169.254.1.1              ether   06:5e:c8:ae:85:be

When we check the IPs of the node, the default gateway is the MAC address listed in the ARP table of the Pod. This MAC address actually corresponds to the veth interface within the node’s root namespace.

$ ip a | grep eni69... -A 3
39: eni69ce....if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ..
    link/ether 06:5e:c8:ae:85:be

(Interface ID on node is actual veth ID).

Scenario I: Traffic within node (Pod-to-Pod on same node)

Pod 51 generates traffic destined for Pod 61 on same node.

In below figure, Pod 51 IP is the secondary IP on eni1 & POD 61 IP is secondary IP on eni2.

Pop-to-pod communication within the same node

For packet destined for pod61, the node performs a route lookup in its policy based routing table. It matches the rule entry which ask to look at main routing table. In this main routing table the traffic matches the entry and gets forwarded to pod61.

The source MAC is of Pod 51 (MAC A) & destination MAC is veth MAC (MAC B). The node performs a route lookup in its policy based routing rules tables, matches the entry, and looks up the main routing table.

$ ip rule show
0:      from all lookup local
512:  from all to 192.168.1.51 lookup main
512:  from all to 192.168.1.61 lookup main
...
1536:

$ ip route show table main
default via 192.168.1.1 dev eth0 proto kernel
169.254.169.254 dev eth0 ...
192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.50
192.168.1.51 ...
192.168.1.61 dev eni69ec... scope link

In main routing table, traffic matches an entry and is forwarded to Pod 61 through respective veth interface, with the source MAC as veth MAC (MAC C) and destination MAC as pod MAC (MAC D).

Scenario II: Traffic generated by pod destined for another node within the same VPC (Pod-to-pod across nodes)

The node performs a route lookup in its policy based routing table and matches an entry that requires to perform an root lookup in main routing table..

$ ip rule show
0:      from all lookup local
512:  from all to 192.168.1.51 lookup main

$ ip route show table main
default 
...
192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.50
...

The pod 82 IP is on the same subnet as the eni1/eth0 interface of the node, so traffic gets forwarded by the node (source and destination MAC are eni’s MACs)

The receiving node performs the route lookup and matches an entry and perform a route lookup in main routing table and traffic get forwarded.

$ ip rule show
0:      from all lookup local
512:  from all to 192.168.1.81 lookup main

$ ip route show table main
default 
...
192.168.1.81 dev enid3c0... scope link
...

Return Flow (Pod 81 respond to request from pod 51)

The node performs a route lookup in its policy based routing table

$ ip rule show
...
1536:  from all to 192.168.1.81 lookup 2

$ ip rule show table 2
default via 192.168.1.1 dev eth1

If the pod 81 IP is a secondary IP of a secondary ENI, the traffic will always be sent to default gateway (VPC implicit router), even if the source & destination are on the same subnet.

The default GW then forward the traffic to other node, where the receiving node performs a route lookup in its policy routing table and matches the traffic to the main routing table.

SERVICE

Pods are ephemeral, and may get a new IP every time they start. Kubernetes Service is an abstraction which groups the pods based on label selector.

Kubernetes Service provides a frontend for the pods, referred to as endpoints in service terminology. The endpoint controller is responsible for keeping the list of all these endpoint.

Kubernetes service support TCP/UDP/SCTP

Any inbound request to the DNS name or the IP address of the service is forwarded to one of the pods that is part of that service, just like a load balancer.

The Kube-proxy agent watches the K8s API for new services and endpoints created on each node. It opens random ports, listens for traffic, and redirects the traffic to the randomly generated service endpoints. Unique IP addresses are assigned and configured via the etcd database. Service uses iptables to route traffic, but could leverage other technologies to provide access to resources.

Labels are used to determine which pods should receive traffic from a service. These Labels can be dynamically updated for an object, which may affect which pods continue to connect to a service.

There are different types of Kubernetes Service:

ClusterIP
NodePort
load balancer

and each of them addresses different communication pattern.

ClusterIP is default service type. It is used to expose the applications on a service virtual IP address that is internal to that cluster. This virtual IP is the same on each node and is not announced externally, hence it is internal to that cluster. Access to this service is confined within the cluster. When we post a manifest of ClusterIP to Kubernetes API, kube-proxy on each node watches the API and detects that a new service is to be configured. It then configures the virtual IP address of service and set of forwarding rules in Linux Iptables on each node. At this stage service becomes accessible from within the cluster.

K8s has its own DNS embedded, which automatically assigns an internal DNS name to this service using predefined format. (K8s DNS is also implemented as ClusterIP service backed by two DNS pods).

When a request comes to the service virtual IP on a given node ,it will be destination NATed to the IP address of one of the pods belonging to that service. The algorithm for this load balancing is round-robin. (Kube-proxy can also be implemented using other data plane options like ipvs or eBPF, instead of iptables, in order to enhance the characteristics of the load balancing algo or decision)

Packet Flow for ClusterIP

Pod51 initiates traffic destined towards the service VIP. (pod 51 will not know the service VIP in advance, it will first have to resolve the service’s DNS name to its VIP. Thus, pod 51 generates a DNS request for respective service DNS name. This DNS request is sent to K8s DNS service VIP, which is then load balanced to one of the DNS pods associated with the DNS service. Finally, pod 51 gets the VIP of the K8s service).

iptables on Node51 performs three tasks:

i - load balances traffic to pod 71

ii - applies DNAT (swaps virtual IP with Pod 71 IP)

iii - marks the packet (for return traffic)

(Note: iptables do not differentiate between local or remote POD which are part of the same service)

iptables is a stateful engine so it marks the traffic. node 51 then forwards the traffic to node 71.

On node 71, the traffic gets forwarded locally to pod 71.

Upon return traffic, iptables on node 51 does:

i- identifies this traffic as an existing flow

ii- applies SNAT (service VIP)

2. Node Port is used to expose service to external resources. It is built on top of ClusterIP, it inherits all its characteristics while incorporating additional features.

apiVersion: v1
kind: Service
spec:
  type: NodePort

When we post manifest with a NodePort service to API server, kube-proxy watches API server and configures forwarding rules on each node, along with NAT rules. Additionally, it configures few rules for two reasons:

1- to process the incoming request on a specific port on the node

2- to forward those request to actual service

Port configured is same on all the nodes and has to be from a specific range.

Request comes from external client.

client source IP -> Node port

Client IP -> Node 51 IP

VPC implicit router MAC -> MAC W

Iptables on node 51 performs four tasks:

load balances traffic to pod 71 (this could happen even if there is a local pod part of the service)
applies DNAT (swaps Node 51 ip with pod 71 ip)
applies SNAT (swaps client IP with node 51 IP)
marks the packet (for return traffic)

Node 51 forwards the traffic to Node 71

Node Source Port → Por port

Node 51 IP → Pod 71 IP

MAC W → MAC Y

Node 71 forwards the traffic locally to pod 71.

Node Source port → Pod port

Node 51 IP → Pod 71 IP

MAC F → MAC E

Returns traffic follows the same path

Pod port → Node source port

Pod 71 IP → Node 51 IP

MAC E → MAC F

Pod port → node source port

Pod 71 IP → Node 51 IP

MAC Y — MAC W

Iptables in Node 51

Identifies this traffic as an existing flow
applies DNAT (swaps Node 51 IP with client IP)
applies SNAT (swap Pod 71 IP with Node 51 IP)

Node 51 sends traffic back to client.

Overall downside of NodePort is that you are supposed to send request explicitly on the node port (hence needs to keep track of node IPs).

3. ServiceType: Load Balancer

Built on top of NodePort, it exposes services externally.

In this case K8s expect an external component to detect the new service and to configure a load balancer on the fly so that LB start forwarding the request from clients to the nodes.

That external component is a custom controller provided by respective solution provider. E.g. AWS Load Balancer Controller

K8s service is a layer 4 construct, it does not address L7 load balancing.

So, whenever we configure a service type of load balancer then AWS load balancer controller provisions an AWS NLB. NLB performs health check against the node port on each node and also by adding metadata into the manifest in form of annotations.

apiVersion: v1
kind: Service
metadata:
  name: service1
  annotations: 
    kubernetes.io/aws-load-balancer-name: nlb
    kubernetes.io/aws-load-balancer-nlb-target-type: instance

You can configure NLB with specific parameters like name, public/private, enabling logging, etc.

Request comes from an external client or service

Client Source Port → NLB Listener port

Client IP → NLB Listener IP

load balancer has all the nodes in its target group and performs health checks on the NodePort for each node.

Assuming NLB forwards traffic to Node 51. It receives traffic on its node port.

NLB Source Port → Node Port

Client IP → 192. (Node IP)

NLN MAC → MAC W

iptables flow remains same: 1/ load balancing 2/DNAT 3/ SNAT 4/ packet marking

Therefore, there is form of tromboning happening here because of how K8s works. How to handle such sub-optimal routing when the node does not have any local pod associated with that service.

Also, traffic is getting S-NATed to achieve flow symmetry. (This causes cross-AZ traffic)

The external traffic policy is a K8s feature that can be configured in the spec section of service manifest. When set to “local”, the load balancer will send traffic only to the nodes which have pods as part of that service. It ensure that traffic would only be forwarded to the pods local to the node. Also traffic is no longer S-NATed, allowing he application to see the client IP.

Flow => client send a request to NLB, this time NLB does an additional health check, an additional health check port is used to verify if a node has any pods part of the service. If it does not have any, then health check fails for that node and it does not receive any traffic from the load balancer.

Here there is still iptables involved which applies DNAT. How to make it more efficient?

apiVersion: v1
kind: Service
metadata:
  name:
  annotation:
    kubernetes.io/aws-load-balancer-nlb-target-type: ip

In this mode the pods that are part of the service becomes targets in the target group of NLB. Neither NodePort nor iptables is used.

(Preserve Client attributes of NLB can be used to retain Source IP).

INGRESS

Ingress is different mechanism than K8s service. It implements L7 load balancing and routing rules such as source name or path based routing definitions.

It allows to expose existing K8s services through HTTP/S rules.

Conclusion

Amazon EKS VPC CNI networking provides a combination of features and components to ensure efficient communication within the cluster and with external resources, providing flexibility, scalability, and reliability for containerized applications.

A life of Packet in Amazon EKS

Written by Hemant Rawat