Kubernetes: Troubleshooting Scheduler

Table of Contents

Overview

This morning i received report from developer, he cant update image container image from our application even the CI/CD report success. after checking the cluster i checked the deployment file is already using desired image version, however when i checked the pods it still used the old one.

Troubleshooting Process

In kubernetes the component for managing container or schedule update container to nodes is scheduler. i check the logs there was an expired certificate connecting to api server.

ak8s.io/client-go/informers/factory.go:134: Failed to watch *v1.CSINode: failed to list *v1.CSINode: Get certificate has expired or not yet valid kube-scheduler

I checked the cert expiration for kubernetes with this command

kubeadm certs check-expiration

from the output i got most of the component kubernetes is expired, then i tried to renew.

sudo kubeadm certs renew all

after that its important to restart all component control plane to ensure is use new certificate

  mv /etc/kubernetes/manifests/*.yaml /tmp/
  sleep 20
  sudo mv /tmp/*.yaml /etc/kubernetes/manifests/

when the controller start again, i check all the component controler plane is running now

kubectl get pods -n kube-system

the next thing is checked is the node server, unfortunetly there was an error status not ready for some nodes, so i cordon the node and push to rejoin again.

kubectl get nodes

kubectl cordon kube-node04 kube-node05 kube-node08

generate kube token again to rejoin the cordon node

kubeadm token create --print-join-command

kubeadm reset
# join the node to the cluster again
kubeadm join 1.2.3.4:6443 --token ry7kio.i7k2    --discovery-token-ca-cert-hash sha256:250fd

the node is starting again now uncordon the node again to we can schedule to that node.

kubectl uncordon kube-node04 kube-node05 kube-node08

Test create dummy deployment to make sure the scheduler and node is working.

kubectl create deployment mynginx  --image=nginx --replicas=3  --namespace=develop