Computer Science
Algorithm
Data Processing
Digital Life
Distributed System
Distributed System Infrastructure
Replace A Dead Node in My High Available Cluster (2025)
Machine Learning
Operating System
Android
Linux
MacOS
Tizen
Windows
iOS
Programming Language
C++
Erlang
Go
Scala
Scheme
Type System
Software Engineering
Storage
UI
Flutter
Javascript
Virtualization
Life
Life in Guangzhou (2013)
Recent Works (2013)
东京之旅 (2014)
My 2017 Year in Review (2018)
My 2020 in Review (2021)
十三年前被隔离的经历 (2022)
A Travel to Montreal (2022)
My 2022 in Review (2023)
Travel Back to China (2024)
A 2-Year Reflection for 2023 and 2024 (2025)
Projects
Bard
Blog
RSS Brain
Scala2grpc
Comment Everywhere (2013)
Fetch Popular Erlang Modules by Coffee Script (2013)
Psychology
耶鲁大学心理学导论 (2012)
Thoughts
Chinese
English

Replace A Dead Node in My High Available Cluster

Posted on 01 Feb 2025, tagged Kubernetesinfrastructurehigh availability

In my previous blogs [1][2], I’ve introduced my high available cluster setup. It works really well: when 1 of the 3 nodes is down, the service either continue to be online, or can be recovered rather quickly (in the case I set service replica to 1 to not wasting the resource). However, in the beginning of this year, one node is down not because of regular updates or temporary shutdown for maintenance, but because its system disk is dead. While it’s annoying to replace the disk and bring it back, it’s actually a good opportunity to verify a dead node can be replaced in my setup. So I will note the steps down in this article. This will be a short one but it shows how easy it is.

What Has Lost?

The dead disk is the system disk. It has the OS, but also has the data for CockroachDB and ElasticSearch. However, since the data for CockroachDB and ElasticSearch is replicated across the cluster, it can be recovered from other machines.

The machine also has a separate disk for CephFS but that disk is not lost. The data in CephFS is also replicated so should be able to recover from other machines as well even if it’s dead. But it may need additional setup, like changing the disk uuid in Rook’s Kubernetes manifests.

Why Not Recover From Backup?

First of all, I don’t backup that often because I don’t feel the need considering the data is replicated. Another reason is, I setup this machine based on the usage of offsite online backup. Then I repurposed it to use in this HA cluster. I want to change the secure boot setup because the threat model is different so it doesn’t need such complex boot setup, which is not supported very well by mainstream Linux without TPM 2.0.

Since the data can all be recovered from other machines automatically, it would be easier to just install a fresh OS and some basic infrastructure so that all the service deployments and data can be auto recovered. This also simulates a dead node situation, so that I have more confidence for recovering from such failures in the future.

How to Recover?

Okay, here we are for the actual recovery steps. It’s very simple:

First, install the OS. Configure basic things like network IP address, ssh, etc. Install things like prometheus-node-exporter if you are using it on other machines.

Next step is to let the node to join our Kubernetes cluster. Before that, we can remove the old dead node in the Kubernetes cluster by using the command kubectl delete node ....

Then install k3s: Copy the config file under /etc/rancher/k3s/config.yaml from another machine and adjust the node IP and network interface config. Make sure the config has something like server: https://...:6443 so it will join the existing cluster instead of creating a new cluster. Check the k3s versions on other machines by using kubectl get nodes -o wide. Then install it with curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.31.1+k3s1 sh - assuming v1.31.1+k3s1 is the version.

After k3s is installed, the k3s service should be enabled by default and the node should join our cluster automatically. If the hostname and IP address are the same as the dead machine, the Kubernetes cluster should automatically reschedule the services on to this machine. If there are some failed containers, check the log to see if it’s because the local directory for the storage is missing. In my case, I need to create the local directory for CockroachDB and ElasticSearch, and set the owner to 1000 for ElasticSearch.

At last, we need to make sure CephFS is working. Make sure ceph and rbd can be loaded with modprobe. If so, add them to /etc/modules-load.d to load on boot:

cat /etc/modules-load.d/ceph.conf
ceph
cat /etc/modules-load.d/rbd.conf
rbd