Optimize Kubernetes HA failover time

In this short article I want to show how you can optimize RKE2 Kubernetes HA failover time. In the default configuration of a RKE2 Kubernetes High Availability Cluster, Workloads are migrating from a failed Node only after 5 minutes. To drop the failover time to 30 seconds we need to edit the following config.yaml file:

Repeat the following steps on ALL Nodes in your RK2 Kubernetes HA Cluster !!!**

(If the file or folder does not exist please create it)

sudo mkdir -p /etc/rancher/rke2/
sudo touch /etc/rancher/rke2/config.yaml
sudo nano /etc/rancher/rke2/config.yaml

APPEND the following code to the config.yaml file and save it:

kube-apiserver-arg:
  - '--default-not-ready-toleration-seconds=30'
  - '--default-unreachable-toleration-seconds=30'
kube-controller-manager-arg:
  - '--node-monitor-period=2s'
  - '--node-monitor-grace-period=16s'
  - '--pod-eviction-timeout=30s'
kubelet-arg:
  - '--node-status-update-frequency=4s'
  - '--max-pods=200'

Please reboot each Node after each other to apply the new settings.

Previous Post Next Post

Add a comment