Considerations for Disaster Recovery of Metworx Environments


Overview

This document provides considerations for customers when designing Disaster Recovery solution for their Metworx Environment.

Definitions

Recovery Point Objective (RPO): maximum targeted period in which data (transactions) might be lost from an IT service due to a major incident

Recovery Time Objective (RTO): targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity

AWS Availability Zone(AZ): group of logical data centers within close proximity is an availability zone

AWS Region: a physical location around the world where AWS Clusters data centers. Each AWS Region consists of multiple, isolated, and physically separate AZ's within a geographic area.

Scopes of Disaster

Metworx environments fully run in AWS and are dependent on AWS infrastructure. There are two types outages to consider for disaster recovery activities:

  1. AWS Availability Zone (AZ) that hosts Metworx workflows is out of service.
  2. Complete AWS Region, that hosts Metworx GUI and Metworx workflows, is out of service due to catastrophic events.

These are high level customer actions that would be needed in the event of disaster

AZ-wide Outage

In the event that an availability zone that hosts customer's workflows is down on AWS side, the customer can configure the same subnet (in the same or different VPC) as existing workflow and switch the networking configuration for the organization (this requires OrgAdmin credentials in Metworx).
The data on the disks that was not mounted at the time of the disaster will be available The data on the disk that was mounted at the time of the disaster can be restored from the most recent snapshot backups Modifying AZ in Organization Configuration It is Highly advisable to have a standby subnet in the same VPC/Region, but a different AZ that workflow run in pre-configure ahead of time

Region-Wide Outage

In the unlikely event of a complete, prolonged, AWS region outage these are high level steps for Disaster Recovery

  1. MetrumRG will recover Metworx GUI in another AWS Region. The original DNS Records that customers are using to access Metworx will point to the recovered instance. Shared AWS environments will be restored to an alternate AWS Region of MetrumRG's choosing within the same general geographic region of the failed region(i.e in the case us-east-1 fails, the alternate environment will be within the US). Customers that have dedicated Metworx environments can choose an alternate region.
  2. Customer will be able to create workflows. It is customer responsibility to recover their project data (MetrumRG do not have any access to the customer's data on Metworx system). Exception to that is fully managed Metworx Environments.

    Responsibilities and RTO

    There are several components of Metworx System to recover in the event of the failure.

Component Description Who is responsible RTO for the AZ Failure RTO for the Region Failure RPO
Metworx GUI Environment Metworx Console Application that is used to create/managed Metworx Workflows MetrumRG NA. Application is redundant within AZ 24 hrs 4 hr
Project data Project related data, Input/output files for projects stored on /data filesystem While Metworx maintains daily snapshots of the data, it is customer responsibility to also backup their project data within their repositories or filesystem backup Customer Specific Customer Specific NA