Skip to content
Snippets Groups Projects
Unverified Commit 0e971a37 authored by Yuhao Zhang's avatar Yuhao Zhang Committed by GitHub
Browse files

Offline control plane recover (#10660)

* ignore_unreachable for etcd dir cleanup

ignore_errors ignores errors occur within "file" module. However, when
the target node is offline, the playbook will still fail at this task
with node "unreachable" state. Setting "ignore_unreachable: true" allows
the playbook to bypass offline nodes and move on to proceed recovery
tasks on remaining online nodes.

* Re-arrange control plane recovery runbook steps

* Remove suggestion to manually update IP addresses

The suggestion was added in 48a18284 4
years ago. But a new task added 2 years ago, in
ee0f1e9d, automatically update API
server arg with updated etcd node ip addresses. This suggestion is no
longer needed.
parent 4e52fb7a
No related branches found
No related tags found
No related merge requests found
......@@ -3,11 +3,6 @@
To recover from broken nodes in the control plane use the "recover\-control\-plane.yml" playbook.
* Backup what you can
* Provision new nodes to replace the broken ones
* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups
* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups
Examples of what broken means in this context:
* One or more bare metal node(s) suffer from unrecoverable hardware failure
......@@ -19,8 +14,12 @@ __Note that you need at least one functional node to be able to recover using th
## Runbook
* Backup what you can
* Provision new nodes to replace the broken ones
* Move any broken etcd nodes into the "broken\_etcd" group, make sure the "etcd\_member\_name" variable is set.
* Move any broken control plane nodes into the "broken\_kube\_control\_plane" group.
* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups
* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups
Then run the playbook with ```--limit etcd,kube_control_plane``` and increase the number of ETCD retries by setting ```-e etcd_retries=10``` or something even larger. The amount of retries required is difficult to predict.
......@@ -35,7 +34,6 @@ The playbook attempts to figure out it the etcd quorum is intact. If quorum is l
## Caveats
* The playbook has only been tested with fairly small etcd databases.
* If your new control plane nodes have new ip addresses you may have to change settings in various places.
* There may be disruptions while running the playbook.
* There are absolutely no guarantees.
......
......@@ -39,6 +39,7 @@
delegate_to: "{{ item }}"
with_items: "{{ groups['broken_etcd'] }}"
ignore_errors: true # noqa ignore-errors
ignore_unreachable: true
when:
- groups['broken_etcd']
- has_quorum
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment