Considerations for Disaster Recovery of Metworx Environments


Scope

This document provides considerations for customers when designing a Disaster Recovery solution for their Metworx Environment.

Definitions

Recovery Point Objective (RPO): The maximum targeted period in which data (transactions) might be lost from an IT service due to a major incident.

Recovery Time Objective (RTO): The targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.

AWS Availability Zone(AZ): A group of logical data centers within close proximity is an availability zone.

AWS Region: A physical location around the world where AWS clusters data centers. Each AWS Region consists of multiple, isolated, and physically separate AZ's within a geographic area.

Scopes of Disaster

Metworx environments operate fully in AWS and are dependent on AWS infrastructure. There are two types outages to consider for disaster recovery activities:

  1. AWS Availability Zone (AZ) that hosts Metworx workflows is out of service.
  2. Complete AWS Region, that hosts Metworx GUI and Metworx workflows, is out of service due to a catastrophic event.

The following instructions are high level customer actions that would be needed in the event of disaster:

Availabilty Zone-wide Outage

In the event that an availability zone that hosts customer's workflows is down on AWS side, the customer can configure the same subnet (in the same or different VPC) as existing workflow and switch the networking configuration for the organization (this requires OrgAdmin credentials in Metworx).

Data residing on disks that was not mounted at the time of the disaster will be unavailable.

Data residing on disks that was mounted at the time of the disaster can be restored from the most recent snapshot backups.

Modifying AZ in Organization Configuration

It is highly advisable to have a standby subnet in the same VPC/Region, but a different AZ that workflows can run in pre-configured ahead of time.

Region-Wide Outage

In the unlikely event of a complete and prolonged AWS region outage, these are the high level steps for disaster recovery:

  1. MetrumRG will recover Metworx GUI in another AWS Region. The original DNS Records that customers are using to access Metworx will point to the recovered instance. Shared AWS environments will be restored to an alternate AWS Region of MetrumRG's choosing within the same general geographic region of the failed region(i.e in the case us-east-1 fails, the alternate environment will be within the US). Customers that have dedicated Metworx environments can choose an alternate region.
  2. Customer will be able to create workflows. It is customer responsibility to recover their project data (MetrumRG do not have any access to the customer's data on Metworx system). Exception to that is fully managed Metworx Environments.

    Responsibilities and RTO

    There are several components of Metworx System to recover in the event of the failure.

Component Description Who is responsible RTO for the AZ Failure RTO for the Region Failure RPO
Metworx GUI Environment Metworx Console Application that is used to create/managed Metworx Workflows MetrumRG N/A. Application is redundant within AZ 24 hrs 4 hr
Project data Project related data, Input/output files for projects stored on /data filesystem While Metworx maintains daily snapshots of the data, it is customer responsibility to also backup their project data within their repositories or filesystem backup Customer Specific Customer Specific N/A