Kubernetes HA under X-ray

Fabrizio Pandini
Heptio
Published in
9 min readApr 12, 2018

--

<Guest Post from Fabrizio Pandini>

Setting up a reliable, highly available Kubernetes cluster requires:

  • Creating a redundant, reliable storage layer with clustered etcd
  • Starting a replicated, load balanced Kubernetes control plane

This post focuses on the second task, looking closely at two possible alternatives for creating a load balanced Kubernetes control plane.

The first option taken into consideration is the solution implemented in Kubernetes The Hard Way by Kelsey Hightower, one of the best known guides for setting up a Kubernetes cluster.

Then, an alternative approach that allows easy scale up/down of control plane node instances is discussed, as well as the trade-off it implies, and instructions about how to test it using Wardroom, a great tool available from the Heptio Labs repository.

High Availability in Kubernetes the Hard Way

The API server

One of the key parameters for the kube-apiserver is the advertise-address flag.

In KTHW each instance of the kube-apiserver uses its own public IP address for advertise itself to other nodes in the cluster, or in other words, each kube-apiserver uses the public IP address as its own identity (see here).

Then, to implement a reliable, load balanced control plane, all the kube-apiserver instances are added to the target-pool for the Kubernetes front-end load balancer, as shown in the following schema:

The glue that makes this configuration works is the Kubernetes API server serving certificate, that is passed to all the instances of kube-apiserver with --tls-cert-file=/var/lib/kubernetes/kubernetes.pem.

In KTHW this certificate is configured for accepting requests addressed to all the IP addresses defined into the schema above. In fact, if you look at the command that creates such certificate (see here) you will see the following list of hostnames/addresses:

${LB_ADDRESS},
10.240.0.10,10.240.0.11,10.240.0.12,
127.0.0.1,
10.32.0.1,kubernetes.default

As you might have already noticed, the kube-apiserver that serves the certificate is configured to answer requests addressed to three additional hostnames/addresses. Why?

Scheduler and Controller-manager

In KTHW the instances of kube-scheduler and kube-controller-manager on each master uses --master=http://127.0.0.1:8080 to interact specifically with the local instance of the kube-apiserver server.

The last bit for implementing a reliable scheduler and a reliable controller-manager is to set the --leader-elect=true flag on both of them; this will ensure that only one instance of the scheduler and controller-manager are actively scheduling/controlling at once (see here and here).

The Kubernetes Service

There are two entries into the kube-apiserver serving certificate not analyzed yet, the IP address 10.32.0.1 and the DNS name, kubernetes.default.

Well, as you you might already know, Kubernetes automatically creates a service named Kubernetes in the default namespaces, that is kubernetes.default according to the internal DNS notation; this service provides a well-known, portable, in-cluster way that pods can use to access the APIserver.

The Kubernetes Service is assigned the first address in the service address space defined by --service-cluster-ip-range flag of the kube-apiserver (see here), that in KTHW is 10.32.0.1.

If you look at the definition of the Kubernetes Service you can see that such IP is used; you will notice also that this is a service without selectors, and in fact in KTWH there are no Pods behind this service, but systemd managed binaries; however this service gets the endpoint property configured by the API server.

$ kubectl describe service kubernetesName: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP: 10.32.0.1
Port: https 443/TCP
TargetPort: 6443/TCP
Endpoints: 10.240.0.10:6443,10.240.0.11:6443,10.240.0.12:6443
Session Affinity: ClientIP
Events: <none>

Recap of part 1

As discussed, one of the pivotal elements of the KTHW highly available Kubernetes cluster is the API server certificate.

To properly define such certificate you need to know in advance the target layout of your cluster (e.g. the IPs of the master nodes).

The downside of this approach is the fact that certificates can’t be changed easily. If the IP addresses or DNS names of your control plane nodes change it will require recreating the API server certificate, copying it to all of the control plane nodes, and then restarting all of the kube-apiserver instances.

This can be an acceptable trade-off in many cases, but this probably isn’t the best solution if you would prefer more dynamic management of your control plane nodes because;

  • you want to be able to quickly replace a control plane node in the case of failure
  • you would like to easily replace your control plane nodes when a new release of Kubernetes is available, instead of managing the risk and/or complexity of performing in-place upgrade of the existing nodes
  • you want to increase or decrease the number of master nodes over time, providing the ability to adapt your cluster to the relevance/criticality/size of workloads

But how we can achieve a more dynamic highly available configuration in Kubernetes?

Another approach to High Availability in Kubernetes

The API server

In order to make a cluster control plane dynamic, we should reduce the number of elements that are hard-coded in the API server certificate, and more specifically remove the IP addresses of the individual kube-apiserver instances. Accordingly we will create the certificate with the following hostnames/IP addresses only:

${LB_ADDRESS},
127.0.0.1,
10.32.0.1,kubernetes.default

This will provide us a certificate that can be used without any changes even if the setup of your control plane nodes changes by adding, replacing or removing control plane instances.

But this is not enough.

To completely remove the masters IP addresses from our Kubernetes setup we have to configure the --advertise-address=$LB_ADDRESS for all the kube-apiserver, or in other words, we should give to all the kube-apiserver the same identity.

The resulting schema is the following:

Let’s clarify some elements about this configuration:

1) The --advertise-address flag should have a stable address or VIP.

2) Each instance of kube-apiserver will continue to listen to different addresses, because of--bind-address=0.0.0.0 flag value, that means to listen to all the local network interfaces. Accordingly, e.g.

  • kube-apiserver on master 1 will be listening for external calls on 10.240.0.10 like in KTWH even though it now advertises itself as $LB_ADDRESS.
  • kube-apiserver on master 2 will be listening on 10.240.0.11
  • kube-apiserver on master 3 will be listening on 10.240.0.12

As a consequence, it is still possible to interact with each local instance of the kube-apiserver through the IP addresses of control plane nodes, but for the request to be validated by the new API serving certificate the Host header of any request must be set to a valid hostname for the certificate. This is an acceptable trade off because most load balancers can be configured to do this for you.

3) The new API server certificate, with the reduced list of hostnames, cannot be used also for securing etcd like in KTHW. As a consequence a dedicated certificate for etcd will be required for implementing this approach to high availability.

This is an acceptable trade off, and if your objective is to implement an highly dynamic controlplane setup, you should consider moving etcd to a dedicated set of machines, decoupling the lifecycle of etcd from the lifecycle of the master nodes.

Scheduler and Controller-manager

The scheduler and kube-controller-manager can be configured the same way as they are in KTHW, that is:

--leader-elect=true
--master=http://127.0.0.1:8080

A possible small variation of this approach is to bind the kube-apiserver to the secure address https://127.0.0.1:6443, and to omit the --insecure-bind-address argument. Be aware that using a secure connection will require that the scheduler and controller-manager have their own credentials via dedicated kubeconfig files.

The Kubernetes service

The Kubernetes service is managed by the API server.

If your cluster is configured using the same --advertise-address=$LB_ADDRESS for all instances of the kube-apiserver the Kubernetes Service will register only one endpoint.

$ kubectl describe svc kubernetesName: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP: 10.32.0.1
Port: https 443/TCP
TargetPort: 6443/TCP
Endpoints: $LB_ADDRESS:6443
Session Affinity: ClientIP
Events: <none>

What are the effects of this configuration?

As you might have guessed, with the above configuration all communication to the Kubernetes Service are sent to the front-end load balancer and not balanced by local iptables rules like other services.

If this is an acceptable trade-off mostly depends on the type workloads. In many cases, generic workloads deployed inside Kubernetes do not interact with the API server and so they are not impacted at all.

In turn, if you rely on Kubernetes specific components that leverage Kubernetes Service, such as custom controllers, you should consider if the reliance on an external load balancer to access the Kubernetes Service is acceptable or not. If it is not, then further work will need to be done to design your network infrastructure.

Demo time! Enter HeptioLabs and Wardroom

Heptio Labs is a container of experimental projects used by Heptio engineers in their day-to-day work, and — in accordance with to the open source DNA of Heptio — such projects are developed openly and collaboratively with the community.

Among many other little gems you can find in the Heptio Labs repository there is Wardroom, a tool that leverages packer and Ansible for creating golden images for Kubernetes deployments across a wide variety of operating systems as well as image formats (CentOS 7.4 and Ubuntu for operating systems and AMI for image format currently, more to follow).

For the sake of this post, we are going to use only a small part of wardroom, the content of wardroom/swizzle directory, that is a sample implementation that showcase how one might further leverage the Wardroom/Ansible playbooks to quickly deploy a local, highly available Kubernetes cluster on Vagrant. To bootstrap your cluster, simply run:

git clone https://github.com/heptiolabs/wardroom.git
cd wardroom/swizzle
python provision.py

After creating the necessary vagrant boxes, 3 masters and 1 node, the provisioning process starts and the Ansible playbook executes following macro steps:

  • Initializes a randomly chosen primary master node using kubeadm, configured as defined above.
  • Scales up the control plane, by copying the certificates created by kubeadm on the other masters, and then completes the bootstrap of second and third master node invoking kubeadm again (in this case kubeadm uses the provided certificates instead of creating new ones)
  • Joins a worker node

As soon as the provision script completes, you connect to any master node, e.g.

vagrant ssh [master 1]

Once connected, run following command to setup kubectl for your user:

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Then you can check that all the nodes in the new Kubernetes cluster are ready (this will take few minutes):

kubectl get nodes — watch

Once everything is up and running now you can check the key settings of the approach to high availability in Kubernetes described in previous paragraph (please note that following parameters should be equal on all master nodes):

  • The list of IPs into the api server certificate, that is expected to contain only the load balancer address, that in this case is 10.10.10.3, and the address of the kubernetes service, that is 10.96.0.1; in turn, the IP addresses of the three master nodes should be missing.
sudo openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout | grep IP
  • The configuration of the --advertise-address flags in the api server manifest, that is expected to refer to the load balancer address, that in this case is 10.10.10.3
sudo cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep advertise-address
  • The list of endpoint serving of the kubernetes service that should contain only the address of the load balancer
kubectl describe service kubernetes
  • The configuration of the --leader-elect flags for the scheduler and the controller manager
sudo cat /etc/kubernetes/manifests/kube-scheduler.yaml | grep leader-electsudo cat /etc/kubernetes/manifests/kube-controller-manager.yaml | grep leader-elect

1) If you want to further investigate how leader election completed in this cluster, take a look at Leader election in Kubernetes control plane — #HeptioProTip

2) Please note that in this demo cluster you won’t see the --master flag into above manifests, because kubeadm setup a secure connection between control plane components using a dedicated kubeconfig file

Conclusions

Designing highly available distributed systems is complex, but one of the strengths of Kubernetes is that it provides a high level of customisation giving you the flexibility to implement solutions that are adapted to your needs.

For instance, the approach to high availability in Kubernetes described above, shows one possible solution for implementing a dynamic and highly available setup of your control plane nodes that allows for the easy replacement of a control plane node or the ability to scale up/down the number of control plane instances.

Heptio Labs provides a really interesting set of tools and resources for day-to-day work with Kubernetes, such as Wardroom, which not only can help create golden images for Kubernetes deployments, but is also a source of many useful scripts for bootstrapping a cluster.

We hope you enjoyed this Post! Stay tuned @heptio for more tips, information, and important announcements about Kubernetes.

--

--