How to enable Disaster Recovery for MetalSoft¶
The Global Controller can be configured in an active-standby manner to support a disaster recovery scenario.
RPO and RTO¶
The current setup provides an RPO of 1h (if the backup cron is set to 1h) and an RTO of 15 minutes. Note that during this time the services under management are not affected (servers, switches etc.) however telemetry information will not be collected.
Note also that any data loss is normally corrected by MetalSoft on the next run. For example if any switch configuration is performed on a switch, such as a VLAN is created but with no correlation in MetalSoft, upon the next execution MetalSoft will delete unwanted VLAN objects from the switch bringing the switches in sync with the new global controller state.
Steady state:¶
In the “steady state” setup the DNS record of the site points to the controller in DCA. The controller in DC-B has all the kubernetes “deployments” (containers) down with the exception of the traefik (gateway microservice) that forwards traffic to site A. This helps compensate for when DNS converges too slowly and stray requests still go to site B in the event of a failure.
Failover state:¶
In the “failover state” the DNS record points to the second datacenter, the “deployments” (containers) are up, the database is loaded from backup and the site controllers reconnect to the the second site.
Sync procedure¶
The setup uses the same periodic backup files to ‘sync’ the remote database. MetalSoft provides backup scripts that do a dump of the databases that can then be shipped over to the secondary site.
Triggering the failover¶
The failover is triggered manually by an administrator on one of the Site B kubernetes head nodes or could be configured to be done automatically via an witness site. The dr_failover_to_dr_site.sh
script is executed on the stand-by controller to bring up the deployments and switch the DNS record. MetalSoft provides those together with the software binaries. Note that the DNS record switching is environment specific and should be done by the client.
Usage:
./dr_failover_to_dr_site.sh k8s_namespace <controller_ip> [kubectl_custom_command]
Example:
./dr_failover_to_dr_site.sh k8s_namespace 10.22.33.11
Where 10.22.33.11
is the IP of the local traefik.
Testing the failover¶
To test the failover the process is similar to the normal DR setup with the distinction that the main controller can also be brought down gracefully using the same dr_failover_to_dr_site.sh
script.
Usage:
./dr_failover_to_dr_site.sh k8s_namespace -0 [kubectl_custom_command] - Scale controller to 0 replicas to bring it down
./dr_failover_to_dr_site.sh k8s_namespace -1 [kubectl_custom_command] - // Scale controller to 1 replicas to bring it up as is without doing any backup imports