kubernetes-reliability.md



Overview
Distributed system such as Kubernetes are designed to be resilient to the
failures.  More details about Kubernetes High-Availability (HA) may be found at
Building High-Availability Clusters
To have a simple view the most of parts of HA will be skipped to describe
Kubelet<->Controller Manager communication only.
By default the normal behavior looks like:


Kubelet updates it status to apiserver periodically, as specified by
--node-status-update-frequency. The default value is 10s.


Kubernetes controller manager checks the statuses of Kubelet every
–-node-monitor-period. The default value is 5s.


In case the status is updated within --node-monitor-grace-period of time,
Kubernetes controller manager considers healthy status of Kubelet. The
default value is 40s.


Kubernetes controller manager and Kubelet work asynchronously. It means that
the delay may include any network latency, API Server latency, etcd latency,
latency caused by load on one's master nodes and so on. So if
--node-status-update-frequency is set to 5s in reality it may appear in
etcd in 6-7 seconds or even longer when etcd cannot commit data to quorum
nodes.


Failure
Kubelet will try to make nodeStatusUpdateRetry post attempts. Currently
nodeStatusUpdateRetry is constantly set to 5 in
kubelet.go.
Kubelet will try to update the status in
tryUpdateNodeStatus
function. Kubelet uses http.Client() Golang method, but has no specified
timeout. Thus there may be some glitches when API Server is overloaded while
TCP connection is established.
So, there will be nodeStatusUpdateRetry * --node-status-update-frequency
attempts to set a status of node.