Pages

Tuesday, November 19, 2024

How to bring down the kubernetes cluster?

 To safely bring down a Kubernetes cluster, follow these steps. The exact procedure can vary depending on how the cluster was set up (e.g., using **kubeadm**, **managed Kubernetes**, or custom deployment). Below are the general steps for a cluster created with **kubeadm**, with additional notes for other environments:

---

### 1. **Prepare for Cluster Shutdown**

- Notify stakeholders and users about the planned downtime.

- Backup critical data such as etcd snapshots and configuration files.

- Ensure no active workloads need to run during the shutdown (e.g., scale down critical apps or reschedule workloads).

---


### 2. **Scale Down Resources**

To avoid workload disruptions during a graceful shutdown:

- Scale down all deployments, statefulsets, or workloads to zero replicas:

  ```bash

  kubectl scale deployment --all --replicas=0

  kubectl scale statefulset --all --replicas=0

  ```

- Optionally, delete non-critical resources:

  ```bash

  kubectl delete pod --all -n <namespace>

  ```

---

### 3. **Drain the Nodes**

Before shutting down nodes, drain them to ensure workloads are properly evicted:

- For each node in the cluster:

  ```bash

  kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

  ```

- Verify nodes are in a drained state:

  ```bash

  kubectl get nodes

  ```


---


### 4. **Stop the Control Plane**

For clusters created with **kubeadm**:

- On the **control plane nodes** (masters), stop Kubernetes services:

  ```bash

  sudo systemctl stop kubelet

  sudo systemctl stop docker

  ```


---


### 5. **Shut Down Worker Nodes**

- Stop Kubernetes and container runtime services on all worker nodes:

  ```bash

  sudo systemctl stop kubelet

  sudo systemctl stop docker

  ```

- Alternatively, power off the worker nodes if Kubernetes services are not managed.


---


### 6. **Delete the Cluster Configuration (Optional)**

- For permanent cluster removal:

  ```bash

  kubeadm reset

  ```

- Clean up configuration and cache files:

  ```bash

  sudo rm -rf /etc/kubernetes/ /var/lib/etcd/ /var/lib/kubelet/ ~/.kube/

  ```


---


### For Managed Kubernetes (e.g., GKE, EKS, AKS)

- Use the provider's interface to delete the cluster:

  - **GKE:**  

    ```bash

    gcloud container clusters delete <cluster-name>

    ```

  - **EKS:**  

    ```bash

    eksctl delete cluster --name <cluster-name>

    ```

  - **AKS:**  

    ```bash

    az aks delete --name <cluster-name> --resource-group <resource-group>

    ```


---


### 7. **Verify Shutdown**

- Ensure all Kubernetes nodes are powered off.

- Check logs for any errors during shutdown:

  ```bash

  sudo journalctl -u kubelet

  ```


---


### 8. **Post-Shutdown Cleanup**

- Remove any associated resources (e.g., load balancers, storage volumes).

- Archive logs or etcd snapshots if needed for recovery or auditing.


By following these steps, you can safely and systematically bring down a Kubernetes cluster.


What is the safest technique to bring down and recover a Kubernetes cluster?

Bringing down and recovering a Kubernetes cluster safely requires careful planning to avoid data loss and ensure workloads can be restored. Here’s a comprehensive guide to safely bring down and recover a Kubernetes cluster:


---


### **Safest Technique to Bring Down a Kubernetes Cluster**


#### 1. **Pre-Shutdown Preparation**

   - **Notify Stakeholders**: Inform all users and teams about the planned downtime.

   - **Backup Critical Data**:

     - Take an etcd snapshot (if using an etcd backend):

       ```bash

       ETCDCTL_API=3 etcdctl snapshot save snapshot.db \

         --endpoints=https://127.0.0.1:2379 \

         --cacert=/etc/kubernetes/pki/etcd/ca.crt \

         --cert=/etc/kubernetes/pki/etcd/server.crt \

         --key=/etc/kubernetes/pki/etcd/server.key

       ```

     - Backup Kubernetes configuration files and manifests:

       ```bash

       tar czvf k8s-backup.tar.gz /etc/kubernetes /var/lib/kubelet /var/lib/etcd ~/.kube/

       ```

     - Backup Persistent Volume (PV) data if necessary.

   - **Document Cluster Details**: Record node IPs, roles, and custom configurations.


---


#### 2. **Gracefully Scale Down Workloads**

   - Scale down all workloads to avoid disruptions:

     ```bash

     kubectl scale deployment --all --replicas=0

     kubectl scale statefulset --all --replicas=0

     ```

   - Safely delete non-essential pods:

     ```bash

     kubectl delete pod --all -n <namespace>

     ```


---


#### 3. **Drain Nodes**

   - Drain workloads from nodes to ensure proper eviction:

     ```bash

     kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

     ```

   - Repeat for all worker nodes.


---


#### 4. **Stop Kubernetes Services**

   - On **worker nodes**:

     ```bash

     sudo systemctl stop kubelet

     sudo systemctl stop docker

     ```

   - On **control plane nodes**:

     ```bash

     sudo systemctl stop kubelet

     sudo systemctl stop docker

     ```


---


#### 5. **Verify Cluster Shutdown**

   - Ensure all nodes and services are stopped:

     ```bash

     sudo systemctl status kubelet

     sudo systemctl status docker

     ```

   - Confirm no workloads are running.


---


### **Safest Technique to Recover a Kubernetes Cluster**


#### 1. **Prepare for Recovery**

   - Restore backups of etcd, configuration files, and Persistent Volume data if needed.

   - Ensure all hardware or virtual machine resources are operational.


---


#### 2. **Restore the Control Plane**

   - If using an etcd snapshot:

     ```bash

     ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \

       --data-dir=/var/lib/etcd

     ```

     Replace `/var/lib/etcd` with your etcd data directory.

   - Restore Kubernetes configuration files:

     ```bash

     tar xzvf k8s-backup.tar.gz -C /

     ```


---


#### 3. **Start Kubernetes Services**

   - Start **control plane services**:

     ```bash

     sudo systemctl start docker

     sudo systemctl start kubelet

     ```

   - Verify control plane status:

     ```bash

     kubectl get nodes

     ```


---


#### 4. **Rejoin Worker Nodes**

   - On each worker node, restart services:

     ```bash

     sudo systemctl start docker

     sudo systemctl start kubelet

     ```

   - If necessary, rejoin the nodes to the cluster using the original `kubeadm join` command or create a new token on the control plane:

     ```bash

     kubeadm token create --print-join-command

     ```


---


#### 5. **Scale Up Workloads**

   - Scale up deployments and statefulsets to their previous replicas:

     ```bash

     kubectl scale deployment --all --replicas=<original-replica-count>

     kubectl scale statefulset --all --replicas=<original-replica-count>

     ```

   - Verify application functionality.


---


#### 6. **Post-Recovery Validation**

   - Ensure all nodes are healthy:

     ```bash

     kubectl get nodes

     kubectl get pods --all-namespaces

     ```

   - Validate application functionality and Persistent Volume mounts.


---


### **Best Practices for Safety**

- Use automation tools like **Velero** for cluster and Persistent Volume backups.

- Regularly test disaster recovery procedures in a non-production environment.

- Keep Kubernetes and etcd versions consistent across backups and recovery.

- Document your cluster architecture and recovery steps.


By following this structured process, you can ensure a smooth and safe shutdown and recovery of your Kubernetes cluster.

No comments:

Post a Comment