In this short article I want to show how you can optimize RKE2 Kubernetes HA failover time. In the default configuration of a RKE2 Kubernetes High Availability Cluster, Workloads are migrating from a failed Node only after 5 minutes. To drop the failover time to 30 seconds we need to edit the following config.yaml file
Repeat the following steps on ALL Nodes in your RK2 Kubernetes HA Cluster !!!
(If the file or folder does not exist please create it)
sudo mkdir -p /etc/rancher/rke2/
sudo touch /etc/rancher/rke2/config.yaml
sudo nano /etc/rancher/rke2/config.yaml
APPEND the following code to the config.yaml file and save it:
kube-apiserver-arg:
- '--default-not-ready-toleration-seconds=30'
- '--default-unreachable-toleration-seconds=30'
kube-controller-manager-arg:
- '--node-monitor-period=2s'
- '--node-monitor-grace-period=16s'
- '--pod-eviction-timeout=30s'
kubelet-arg:
- '--node-status-update-frequency=4s'
- '--max-pods=200'
Please reboot each Node after each other to apply the new settings.