我们采用 etcdctl 和 etcdutl 工具进行 etcd 数据库的备份与恢复。
官网下载地址:https://github.com/etcd-io/etcd/releases
1. Etcd 数据库数据备份
1.1. 物理节点裸部署
1.1.1. 二进制安装 etcdctl 和 etcdutl
安装脚本:install_etcdctl.sh
#!/bin/bash
# 安装版本
etcd_ver=v3.5.17
# 安装目录
etcd_dir=/software_path/etcd
DOWNLOAD_URL=https://github.com/etcd-io/etcd/releases/download
# Download
if [ ! -d $etcd_dir ];then
mkdir -p $etcd_dir
fi
wget ${DOWNLOAD_URL}/${etcd_ver}/etcd-${etcd_ver}-linux-amd64.tar.gz
mv etcd-${etcd_ver}-linux-amd64.tar.gz ${etcd_dir}
cd $etcd_dir
tar -xzvf ${etcd_dir}/etcd-${etcd_ver}-linux-amd64.tar.gz
# Install
ln -s ${etcd_dir}/etcd-${etcd_ver}-linux-amd64/etcdctl /usr/local/sbin/etcdctl
ln -s ${etcd_dir}/etcd-${etcd_ver}-linux-amd64/etcdutl /usr/local/sbin/etcdutl
ShellScript验证安装结果:
$ etcdctl version
etcdctl version: 3.5.17
API version: 3.5
$ etcdutl version
etcdutl version: 3.5.17
API version: 3.5
ShellScript1.1.2. 使用 etcdctl 备份
如果是单节点 Kubernetes 我们只需要对其的 etcd 数据库进行快照备份, 如果是多主多从的集群,我们则需依次备份多个 master 节点中 etcd,防止在备份时etc数据被更改!
在所有 etcd 数据库节点执行下述命令:
# 把当前节点的 etcd 数据导出为快照
ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key snapshot save etcdbackupfile.db
ShellScript自动备份脚本:etcd_backup.sh
#!/bin/bash
# 备份目录
backup_dir="/var/lib/etcd_db_bak"
# 时间戳
DATE=`date +"%Y%m%d%H%M"`
# 判断目录是否存在,不在则创建
if [ ! -d $backup_dir ];then
mkdir $backup_dir
fi
# 执行数据库备份
ETCDCTL_API=3 /usr/local/sbin/etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key snapshot save $backup_dir/etcdbackupfile_$DATE.db
ShellScript设置定时任务:
crontab -e
30 18 * * * /var/lib/etcd_db_bak/etcd_backup.sh
50 23 * * * find /var/lib/etcd_db_bak/ -mtime +5 -name "*.db" -exec rm -rf {} \;
ShellScript1.2. Docker 容器部署
1.3. Kubernetes CronJob 部署
# etcd-database-backup.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-database-backup
annotations:
descript: "etcd数据库定时备份"
spec:
schedule: "*/5 * * * *" # 表示每5分钟运行一次
jobTemplate:
spec:
template:
spec:
containers:
- name: etcdctl
image: registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.5.5-0
env:
- name: ETCDCTL_API
value: "3"
- name: ETCDCTL_CACERT
value: "/etc/kubernetes/pki/etcd/ca.crt"
- name: ETCDCTL_CERT
value: "/etc/kubernetes/pki/etcd/healthcheck-client.crt"
- name: ETCDCTL_KEY
value: "/etc/kubernetes/pki/etcd/healthcheck-client.key"
command:
- /bin/sh
- -c
- |
export RAND=$RANDOM
etcdctl --endpoints=https://192.168.12.107:2379 snapshot save /backup/etcd-107-${RAND}-snapshot.db
etcdctl --endpoints=https://192.168.12.108:2379 snapshot save /backup/etcd-108-${RAND}-snapshot.db
etcdctl --endpoints=https://192.168.12.109:2379 snapshot save /backup/etcd-109-${RAND}-snapshot.db
volumeMounts:
- name: "pki"
mountPath: "/etc/kubernetes"
- name: "backup"
mountPath: "/backup"
imagePullPolicy: IfNotPresent
volumes:
- name: "pki"
hostPath:
path: "/etc/kubernetes"
type: "DirectoryOrCreate"
- name: "backup"
hostPath:
path: "/storage/dev/backup" # 数据备份目录
type: "DirectoryOrCreate"
nodeSelector: # 将Pod绑定在主节点之中,否则只能将相关证书放在各个节点能访问的nfs共享存储中
node-role.kubernetes.io/master: ""
restartPolicy: Never
EOF
YAML2. Etcd 数据库数据恢复
停掉所有 Master 机器的 kube-apiserver 和 etcd ,然后在利用备份进行恢复该节点的etcd 数据。
# 停掉 kube-apiserver 和 etcd 静态 Pod
mv /etc/kubernetes/manifests/ /etc/kubernetes/manifests-backup/
# 在该节点上删除 /var/lib/etcd
mv /var/lib/etcd /var/lib/etcd.bak
mkdir /var/lib/etcd
# 利用快照进行恢复,在多个节点的备份中选择一个最大的依次在多个节点上恢复数据
# 如果采用不同的备恢复数据可能导致 etcd 数据不一致
ETCDCTL_API=3 etcdctl snapshot restore /var/lib/etcd_db_bak/etcdbackupfile.db --data-dir=/var/lib/etcd --name=k8s-master-01-c-201 --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --initial-cluster-token=etcd-cluster-0 --initial-cluster=k8s-master-01-c-201=https://192.168.2.201:2380,k8s-master-02-r-202=https://192.168.2.202:2380,k8s-master-03-u-203=https://192.168.2.203:2380 --initial-advertise-peer-urls=https://192.168.2.201:2380
ETCDCTL_API=3 etcdctl snapshot restore /var/lib/etcd_db_bak/etcdbackupfile.db --data-dir=/var/lib/etcd --name=k8s-master-02-r-202 --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --initial-cluster-token=etcd-cluster-0 --initial-cluster=k8s-master-01-c-201=https://192.168.2.201:2380,k8s-master-02-r-202=https://192.168.2.202:2380,k8s-master-03-u-203=https://192.168.2.203:2380 --initial-advertise-peer-urls=https://192.168.2.202:2380
ETCDCTL_API=3 etcdctl snapshot restore /var/lib/etcd_db_bak/etcdbackupfile.db --data-dir=/var/lib/etcd --name=k8s-master-03-u-203 --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --initial-cluster-token=etcd-cluster-0 --initial-cluster=k8s-master-01-c-201=https://192.168.2.201:2380,k8s-master-02-r-202=https://192.168.2.202:2380,k8s-master-03-u-203=https://192.168.2.203:2380 --initial-advertise-peer-urls=https://192.168.2.203:2380
mv /etc/kubernetes/manifests-backup/ /etc/kubernetes/manifests/
ShellScriptetcdctl 常见命令:
# etcd 集群节点状态查看主从节点
ETCDCTL_API=3 etcdctl endpoint status --endpoints=https://192.168.2.201:2379 --endpoints=https://192.168.2.202:2379 --endpoints=https://192.168.2.203:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out table
# etcd 集群成员列表
ETCDCTL_API=3 etcdctl member list --endpoints=https://192.168.2.201:2379 --endpoints=https://192.168.2.202:2379 --endpoints=https://192.168.2.203:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key
# etcd 集群节点健康信息筛选出不健康的节点
ETCDCTL_API=3 etcdctl endpoint health --endpoints=https://192.168.2.201:2379 --endpoints=https://192.168.2.202:2379 --endpoints=https://192.168.2.203:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key
ShellScript3. 高可用 master 节点故障恢复
在正常 master 节点查看 etcd 列表:
# etcd 集群成员列表
ETCDCTL_API=3 etcdctl member list --endpoints=https://192.168.2.201:2379 --endpoints=https://192.168.2.202:2379 --endpoints=https://192.168.2.203:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key
1f378db48f1e54f0, started, k8s-master-03-u-203, https://192.168.2.203:2380, https://192.168.2.203:2379, false
a64cd8ddf3d578de, started, k8s-master-02-r-202, https://192.168.2.202:2380, https://192.168.2.202:2379, false
f7c171b2cb4e7820, started, k8s-master-01-c-201, https://192.168.2.201:2380, https://192.168.2.201:2379, false
ShellScript在正常 master 节点查看 ectd 状态:
# etcd 集群节点健康信息筛选出不健康的节点
ETCDCTL_API=3 etcdctl endpoint health --endpoints=https://192.168.2.201:2379 --endpoints=https://192.168.2.202:2379 --endpoints=https://192.168.2.203:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key
https://192.168.2.203:2379 is healthy: successfully committed proposal: took = 10.685441ms
https://192.168.2.201:2379 is healthy: successfully committed proposal: took = 11.972997ms
https://192.168.2.202:2379 is unhealthy: failed to commit proposal: context deadline exceeded
ShellScript在正常 master 节点移除不健康 etcd 数据库:
ETCDCTL_API=3 etcdctl --key /etc/kubernetes/pki/apiserver-etcd-client.key --cert /etc/kubernetes/pki/apiserver-etcd-client.crt --cacert /etc/kubernetes/pki/etcd/ca.crt member remove a64cd8ddf3d578de
ShellScript清理故障 master 节点数据:
sudo mv /etc/kubernetes/ /etc/kubernetes-backup/
sudo mkdir /etc/kubernetes/
# 在该节点上删除 /var/lib/etcd
sudo mv /var/lib/etcd /var/lib/etcd.bak
sudo mkdir /var/lib/etcd
ShellScript重新加入集群:
# 在健康 master 节点执行下列步骤:
kubeadm init phase upload-certs --upload-certs
...
[upload-certs] Using certificate key:
ee7cdf97abe2993b9c66cbcfa175b468f4ce5a23e11d477c5b902775a7a36e77
kubeadm token create --print-join-command
kubeadm join api-server:8443 --token ab2u13.b20gyt91bdz5eqxy --discovery-token-ca-cert-hash sha256:efb37c407e4eaef751d402eed838e44b2defeb4c45e03b6d2151e62ca915e0f7
kubeadm join api-server:8443 --token ab2u13.b20gyt91bdz5eqxy --discovery-token-ca-cert-hash sha256:efb37c407e4eaef751d402eed838e44b2defeb4c45e03b6d2151e62ca915e0f7 --control-plane --certificate-key ee7cdf97abe2993b9c66cbcfa175b468f4ce5a23e11d477c5b902775a7a36e77
ShellScript如果加入集群报错,重新彻底清理再重新加入:
kubeadm reset -f
cd /tmp # 有时候在当前目录下可能与要卸载的包重名的而导致卸载报错,可以切个目录
rm -rf ~/.kube/
rm -rf /etc/kubernetes/
rm -rf /etc/cni
rm -rf /opt/cni
rm -rf /var/lib/etcd
rm -rf /var/etcd
rm -rf /run/flannel
rm -rf /opt/cni
rm -rf /etc/cni/net.d
rm -rf /run/xtables.lock
systemctl stop kubelet
yum remove kube* -y
for i in `df |grep kubelet |awk '{print $NF}'`;do umount -l $i ;done # 先卸载所有kubelet挂载否则下条命令无法删除
rm -rf /var/lib/kubelet
rm -rf /etc/systemd/system/kubelet.service.d
rm -rf /etc/systemd/system/kubelet.service
rm -rf /usr/bin/kube*
iptables -F
reboot # 重新启动,从头再来
yum install -y kubelet-1.30* kubeadm-1.30* kubectl-1.30*
systemctl enable kubelet && systemctl start kubelet && systemctl status kubelet
ShellScript