In this short article I want to show how you can optimize RKE2 Kubernetes HA failover time. In the default configuration of a RKE2 Kubernetes High Availability Cluster, Workloads are migrating from a failed Node only after 5 minutes. To drop the failover time to 30 seconds we need to edit the following config.yaml file:
Repeat the following steps on ALL Nodes in your RK2 Kubernetes HA Cluster !!!**
(If the file or folder does not exist please create it)
sudo mkdir -p /etc/rancher/rke2/
sudo touch /etc/rancher/rke2/config.yaml
sudo nano /etc/rancher/rke2/config.yaml
APPEND the following code to the config.yaml
file and save it:
kube-apiserver-arg:
- '--default-not-ready-toleration-seconds=30'
- '--default-unreachable-toleration-seconds=30'
kube-controller-manager-arg:
- '--node-monitor-period=2s'
- '--node-monitor-grace-period=16s'
- '--pod-eviction-timeout=30s'
kubelet-arg:
- '--node-status-update-frequency=4s'
- '--max-pods=200'
Please reboot each Node after each other to apply the new settings.