Offline control plane recover (#10660)

* ignore_unreachable for etcd dir cleanup ignore_errors ignores errors occur within "file" module. However, when the target node is offline, the playbook will still fail at this task with node "unreachable" state. Setting "ignore_unreachable: true" allows the playbook to bypass offline nodes and move on to proceed recovery tasks on remaining online nodes. * Re-arrange control plane recovery runbook steps * Remove suggestion to manually update IP addresses The suggestion was added in 48a18284 4 years ago. But a new task added 2 years ago, in ee0f1e9d, automatically update API server arg with updated etcd node ip addresses. This suggestion is no longer needed.

Offline control plane recover (#10660)
0e971a37 · Yuhao Zhang · GitHub · 4e52fb7a · 0e971a37 · 0e971a37
Unverified Commit 0e971a37 authored Jan 22, 2024 by Yuhao Zhang Committed by GitHub Jan 22, 2024
--- a/docs/recover-control-plane.md
+++ b/docs/recover-control-plane.md
@@ -3,11 +3,6 @@
 To recover from broken nodes in the control plane use the "recover\-control\-plane.yml" playbook.
-* Backup what you can
-* Provision new nodes to replace the broken ones
-* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups
-* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups
 Examples of what broken means in this context:
 * One or more bare metal node(s) suffer from unrecoverable hardware failure
@@ -19,8 +14,12 @@ __Note that you need at least one functional node to be able to recover using th
 ## Runbook
+* Backup what you can
+* Provision new nodes to replace the broken ones
 * Move any broken etcd nodes into the "broken\_etcd" group, make sure the "etcd\_member\_name" variable is set.
 * Move any broken control plane nodes into the "broken\_kube\_control\_plane" group.
+* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups
+* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups
 Then run the playbook with ```--limit etcd,kube_control_plane``` and increase the number of ETCD retries by setting ```-e etcd_retries=10``` or something even larger. The amount of retries required is difficult to predict.
@@ -35,7 +34,6 @@ The playbook attempts to figure out it the etcd quorum is intact. If quorum is l
 ## Caveats
 * The playbook has only been tested with fairly small etcd databases.
-* If your new control plane nodes have new ip addresses you may have to change settings in various places.
 * There may be disruptions while running the playbook.
 * There are absolutely no guarantees.

--- a/roles/recover_control_plane/etcd/tasks/main.yml
+++ b/roles/recover_control_plane/etcd/tasks/main.yml
@@ -39,6 +39,7 @@
  delegate_to: "{{ item }}"
  with_items: "{{ groups['broken_etcd'] }}"
  ignore_errors: true  # noqa ignore-errors
+  ignore_unreachable: true
  when:
    - groups['broken_etcd']
    - has_quorum