diff --git a/docs/recover-control-plane.md b/docs/recover-control-plane.md index 0b80da271dc4e3f10f44cd69990e4800f6eab6a2..9174789cfa0fb78d2ce1d0b53a98e633da2aadb9 100644 --- a/docs/recover-control-plane.md +++ b/docs/recover-control-plane.md @@ -3,11 +3,6 @@ To recover from broken nodes in the control plane use the "recover\-control\-plane.yml" playbook. -* Backup what you can -* Provision new nodes to replace the broken ones -* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups -* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups - Examples of what broken means in this context: * One or more bare metal node(s) suffer from unrecoverable hardware failure @@ -19,8 +14,12 @@ __Note that you need at least one functional node to be able to recover using th ## Runbook +* Backup what you can +* Provision new nodes to replace the broken ones * Move any broken etcd nodes into the "broken\_etcd" group, make sure the "etcd\_member\_name" variable is set. * Move any broken control plane nodes into the "broken\_kube\_control\_plane" group. +* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups +* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups Then run the playbook with ```--limit etcd,kube_control_plane``` and increase the number of ETCD retries by setting ```-e etcd_retries=10``` or something even larger. The amount of retries required is difficult to predict. @@ -35,7 +34,6 @@ The playbook attempts to figure out it the etcd quorum is intact. If quorum is l ## Caveats * The playbook has only been tested with fairly small etcd databases. -* If your new control plane nodes have new ip addresses you may have to change settings in various places. * There may be disruptions while running the playbook. * There are absolutely no guarantees. diff --git a/roles/recover_control_plane/etcd/tasks/main.yml b/roles/recover_control_plane/etcd/tasks/main.yml index 66dbc8b6deca2a76502880b93cd3afbad03ca291..599f56b15060611d1098369db234e9af52ffe1dd 100644 --- a/roles/recover_control_plane/etcd/tasks/main.yml +++ b/roles/recover_control_plane/etcd/tasks/main.yml @@ -39,6 +39,7 @@ delegate_to: "{{ item }}" with_items: "{{ groups['broken_etcd'] }}" ignore_errors: true # noqa ignore-errors + ignore_unreachable: true when: - groups['broken_etcd'] - has_quorum