Skip to main content

Operations

Support, stability, and dependency info

High-availability Namespaces are in Public Preview for Temporal Cloud.

No audits, updates, intros, re-org. Converted to markdown automatically from Google Doc rich text, so there are many errors

How do you trigger failovers and observe Workflow Executions? This section provides how-to instructions for the following operations tasks:

Triggering failovers

Failovers happen automatically in Temporal when a regional outage or disaster affects a multi-region Namespace. You can also trigger a failover based on custom alerts or for testing purposes. This section explains how to manually trigger a failover and what to expect afterward.

Regular failover testing ensures your app can handle disruptions and continue running smoothly in production. Whether responding to incident warnings or conducting tests, follow the steps in the next sections to move your active Namespace to its standby region and learn how to handle failovers effectively.

For details on how Temporal detects conditions and triggers failovers automatically, see Failovers.

Check Your Replication Lag

Always check the metric replication lag before initiating a failover. A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress.

Performing manual failovers

You can trigger a failover manually using the Temporal Cloud Web UI or the tcld CLI, depending on your preference and setup. The following table outlines the steps for each method:

MethodInstructions
Temporal Cloud Web UI1. Visit the Namespace page on the Temporal Cloud Web UI.
2. Navigate to your Namespace details page and select the Trigger a failover option from the menu.
3. After confirming, the failover will be initiated.
Temporal tcld CLITo manually trigger a failover, run the following command in your terminal:
tcld namespace failover \
    --namespace <namespace_id>.<account_id> \
    --region <target_region>

Post-failover event information

After any failover, whether triggered by you or by Temporal, event information appears in both the Temporal Cloud Web UI (on the Namespace detail page) and in your audit logs. The audit log entry for Failover uses the "operation": "FailoverNamespace" event. After failover, the Namespace is active in the new region.

You don't need to monitor Temporal Cloud's failover response in real-time. Whenever there is a failover event, users with the Account Owner and Global Admin roles automatically receive an alert email.

Failbacks

After Temporal-initiated failovers, Temporal Cloud shifts Workflow Execution processing back to the original region that was active before the incident (a "failback") once the incident is resolved.

Reasons to test failing over

Microservices and external dependencies will fail at some point. Testing failovers ensures your app can handle these failures effectively. Temporal recommends regular and periodic failover testing for mission-critical applications in production. By testing in non-emergency conditions, you verify that your app continues to function even when parts of the infrastructure fail.

Safety First

If this is your first time performing a failover test, run it with a test-specific namespace and application. This helps you gain operational experience before applying it to your production environment. Practice runs help ensure the process runs smoothly during real incidents in production.

Trigger testing can:

  • Validate multi-region deployments: In multi-region setups, failover testing ensures your app can run from another region when the primary region experiences outages. This maintains high availability in mission-critical deployments. Manual testing confirms the failover mechanism works as expected, so your system handles regional outages or disasters effectively.

  • Assess replication lag: Monitoring replication lag between regions is crucial in multi-region setups. Check the lag before initiating a failover to avoid rolling back Workflow progress. Manual testing helps you practice this critical step and understand its impact. When there's no real incident, the switch over (recovery) should happen almost instantly.

  • Assess recovery time: Manual testing helps you measure actual recovery time. You can check if it meets your expected Recovery Time Objective (RTO) of 20 minutes or less, as stated in the Multi-region Namespace SLA.

  • Identify potential issues: Failover testing uncovers problems not visible during normal operation. This includes issues like backlogs and capacity planning and how external dependencies behave during a failover event.

  • Validate fault-oblivious programming: Temporal uses a "fault-oblivious programming" model, where your app doesn’t need to explicitly handle many types of failures. Testing failovers ensures that this model works as expected in your app.

  • Operational readiness: Regular testing familiarizes your team with the failover process, improving their ability to handle real incidents when they arise.

Testing failovers regularly ensures your Temporal-based applications remain resilient and reliable, even when infrastructure fails.

Metrics

Replication lag refers to the transmission delay of Workflow updates and history events from the active region to the standby region. A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress, so always check the metric replication lag before initiating a failover. Temporal Cloud emits three replication lag-specific metrics. The following samples demonstrate how you can use these metrics to explore replication lag.

P99 replication lag histogram

histogram_quantile(0.99, sum(rate(temporal_cloud_v0_replication_lag_bucket[$__rate_interval])) by (temporal_namespace, le))

Average replication lag

sum(rate(temporal_cloud_v0_replication_lag_sum[$__rate_interval])) by (temporal_namespace)
/
sum(rate(temporal_cloud_v0_replication_lag_count[$__rate_interval])) by (temporal_namespace)

Monitoring and observability

You can view and alert on key cloud metrics using the Web UI, the 'tcld' CLI utility, and Temporal Cloud APIs. For example, during the process of adding a region to a Namespace, you can see the progress of Workflow replication. Errors -- if any occur -- will also surface in the Namespace Web UI.

info

You may notice that multi-region Namespace shows twice (2x) the Action count in temporal_cloud_v0_total_action_count. This doubling happens due to regional replication.

Auditing operational events

Temporal Cloud provides several ways to audit events:

  • When Temporal triggers failovers, the audit log updates with details. Look specifically for "operation": "FailoverNamespace" in the logs.
  • You can set alerts for Temporal-initiated failover events.
  • After a failover, you can check that the Namespace is active in the new region using the Temporal Cloud Web UI.

Worker Deployment

Enabling the multi-region Namespace does not require specific Worker configuration. The process is invisible to the Workers. When a Namespace fails over to the standby region, the DNS redirection orchestrated by Temporal ensures that your existing Workers continue to poll the Namespace without interruption. More details are available in the Routing section below.

info
  • When a Namespace fails over to a standby region, Workers will be communicating cross-region.

  • In case of a complete regional outage, Workers in the original region may fail alongside the original Namespace. To keep Workflows moving during this level of outage, deploy a second set of Workers to your standby region.

Routing

When using multi-region for a Namespace, the Namespace's DNS record <ns>.<acct>.<tmprl_domain> targets a regional DNS record in the format <region>.region.<tmprl_domain>. In this format, <region> is the currently active region for your Namespace. Clients resolving the Namespace’s DNS record are directed to connect to the active region for that Namespace, thanks to the regional DNS record.

During failover, Temporal Cloud changes the target of the Namespace DNS record from one region to another. Namespace DNS records are configured with a 15 seconds TTL. Any DNS cache should re-resolve the record within this delay. As a rule of thumb, DNS reconciliation takes no longer than twice (2x) the TTL. Clients should converge to the newly targeted region within, at, most a 30-second delay.

info

Some networking configuration is required for failover to be transparent to clients and workers when using PrivateLink. This section describes how to configure routing for multi-region Namespaces for PrivateLink customers only.

PrivateLink customers may need to change certain configurations for multi-region Namespace use. Routing configuration depends on networking setup and use of PrivateLink. You may need to:

  • override a DNS zone; and
  • ensure the network connectivity between the two regions.

Customer side solution example

When using PrivateLink, you connect to Temporal Cloud using IP addresses local to your network. The region.<tmprl_domain> zone is configured in the Temporal systems as an independent zone. This allows you to override it to make sure traffic is routed internally for the regions in use. You can check the Namespace's active region using the Namespace record CNAME, which is public.

To set up the DNS override, you override specific regions to target the relevant IP addresses (e.g. aws-us-west-1.region.tmprl.cloud to target 192.168.1.2). Using AWS, this can be done using a private hosted zone in Route53 for region.<tmprl_domain>. Link that private zone to the VPCs you use for Workers. Private Link is not yet offered for GCP multi-region Namespaces.

When your Workers connect to the Namespace, they first resolve the <ns>.<acct>.<tmprl_domain> record. This targets <active>.region.<tmprl_domain> using a CNAME. Your private zone overrides that second DNS resolution, leading traffic to reach the internal IP you're using.

Consider how you'll configure Workers to run in this scenario. You might set Workers to run in both regions at all times. Alternately, you could establish connectivity between the regions to redirect Workers once failover occurs.

The following table lists Temporal's available regions, PrivateLink endpoints, and DNS record overrides. The sa-east-1 region listed here is not yet available for use with multi-region Namespaces.

RegionPrivateLink Service NameDNS Record Override
ap-northeast-1com.amazonaws.vpce.ap-northeast-1.vpce-svc-08f34c33f9fb8a48aaws-ap-northeast-1.region.tmprl.cloud
ap-northeast-2com.amazonaws.vpce.ap-northeast-2.vpce-svc-08c4d5445a5aad308aws-ap-northeast-2.region.tmprl.cloud
ap-south-1com.amazonaws.vpce.ap-south-1.vpce-svc-0ad4f8ed56db15662aws-ap-south-1.region.tmprl.cloud
ap-south-2com.amazonaws.vpce.ap-south-2.vpce-svc-08bcf602b646c69c1aws-ap-south-2.region.tmprl.cloud
ap-southeast-1com.amazonaws.vpce.ap-southeast-1.vpce-svc-05c24096fa89b0ccdaws-ap-southeast-1.region.tmprl.cloud
ap-southeast-2com.amazonaws.vpce.ap-southeast-2.vpce-svc-0634f9628e3c15b08aws-ap-southeast-2.region.tmprl.cloud
ca-central-1com.amazonaws.vpce.ca-central-1.vpce-svc-080a781925d0b1d9daws-ca-central-1.region.tmprl.cloud
eu-central-1com.amazonaws.vpce.eu-central-1.vpce-svc-073a419b36663a0f3aws-eu-central-1.region.tmprl.cloud
eu-west-1com.amazonaws.vpce.eu-west-1.vpce-svc-04388e89f3479b739aws-eu-west-1.region.tmprl.cloud
eu-west-2com.amazonaws.vpce.eu-west-2.vpce-svc-0ac7f9f07e7fb5695aws-eu-west-2.region.tmprl.cloud
sa-east-1com.amazonaws.vpce.sa-east-1.vpce-svc-0ca67a102f3ce525aaws-sa-east-1.region.tmprl.cloud
us-east-1com.amazonaws.vpce.us-east-1.vpce-svc-0822256b6575ea37faws-us-east-1.region.tmprl.cloud
us-east-2com.amazonaws.vpce.us-east-2.vpce-svc-01b8dccfc6660d9d4aws-us-east-2.region.tmprl.cloud
us-west-2com.amazonaws.vpce.us-west-2.vpce-svc-0f44b3d7302816b94aws-us-west-2.region.tmprl.cloud
Learn more about multi-region Namespaces

If you have more questions or feedback about this feature, reach out to the product team.