kubernetes-reliability.md
-
Nguyen Hai Truong authored
Although it is spelling mistakes, it might make affect while reading. Signed-off-by:
Nguyen Hai Truong <truongnh@vn.fujitsu.com>
Nguyen Hai Truong authoredAlthough it is spelling mistakes, it might make affect while reading. Signed-off-by:
Nguyen Hai Truong <truongnh@vn.fujitsu.com>
Overview
Distributed system such as Kubernetes are designed to be resilient to the failures. More details about Kubernetes High-Availability (HA) may be found at Building High-Availability Clusters
To have a simple view the most of parts of HA will be skipped to describe Kubelet<->Controller Manager communication only.
By default the normal behavior looks like:
-
Kubelet updates it status to apiserver periodically, as specified by
--node-status-update-frequency
. The default value is 10s. -
Kubernetes controller manager checks the statuses of Kubelet every
–-node-monitor-period
. The default value is 5s. -
In case the status is updated within
--node-monitor-grace-period
of time, Kubernetes controller manager considers healthy status of Kubelet. The default value is 40s.
Kubernetes controller manager and Kubelet work asynchronously. It means that the delay may include any network latency, API Server latency, etcd latency, latency caused by load on one's master nodes and so on. So if
--node-status-update-frequency
is set to 5s in reality it may appear in etcd in 6-7 seconds or even longer when etcd cannot commit data to quorum nodes.
Failure
Kubelet will try to make nodeStatusUpdateRetry
post attempts. Currently
nodeStatusUpdateRetry
is constantly set to 5 in
kubelet.go.
Kubelet will try to update the status in
tryUpdateNodeStatus
function. Kubelet uses http.Client()
Golang method, but has no specified
timeout. Thus there may be some glitches when API Server is overloaded while
TCP connection is established.
So, there will be nodeStatusUpdateRetry
* --node-status-update-frequency
attempts to set a status of node.