Skip to main content

Ensure OpenSearch Domains Have Fault-Tolerant Data Nodes

Overview

This check verifies that your Amazon OpenSearch Service domains are configured for fault tolerance. A fault-tolerant domain requires:

  • At least 3 data nodes to maintain quorum and data availability
  • Zone Awareness enabled to distribute data across multiple Availability Zones

Without these settings, a single node or zone failure could cause data loss or service outages.

Risk

If your OpenSearch domain lacks fault tolerance:

  • A single node failure could make data shards unavailable
  • An Availability Zone outage could take down your entire cluster
  • Write operations may fail during node recovery
  • Data inconsistency can occur during rebalancing
  • Your search and analytics applications may experience downtime

Remediation Steps

Prerequisites

  • Access to the AWS Console with permissions to modify OpenSearch domains, OR
  • AWS CLI configured with appropriate credentials
  • Your domain must support at least 3 nodes (check instance type limits)

AWS Console Method

  1. Sign in to the AWS Console and navigate to Amazon OpenSearch Service
  2. Select your domain from the list
  3. Click Edit domain
  4. Under Cluster configuration:
    • Set Number of data nodes to 3 or more
    • Enable Zone Awareness
    • Set Availability Zones to 3 (recommended) or 2
  5. Review your changes and click Submit

Note: Changes may take several minutes to complete. The domain status will show "Processing" during the update.

AWS CLI (optional)

Update your domain to enable fault tolerance:

aws opensearch update-domain-config \
--domain-name <your-domain-name> \
--cluster-config '{
"InstanceCount": 3,
"ZoneAwarenessEnabled": true,
"ZoneAwarenessConfig": {
"AvailabilityZoneCount": 3
}
}' \
--region us-east-1

Replace <your-domain-name> with your actual domain name.

For 2 Availability Zones (if 3-AZ is not available in your region):

aws opensearch update-domain-config \
--domain-name <your-domain-name> \
--cluster-config '{
"InstanceCount": 3,
"ZoneAwarenessEnabled": true,
"ZoneAwarenessConfig": {
"AvailabilityZoneCount": 2
}
}' \
--region us-east-1

Check the current configuration:

aws opensearch describe-domain \
--domain-name <your-domain-name> \
--query 'DomainStatus.ClusterConfig' \
--region us-east-1
CloudFormation (optional)

Use this template to create or update an OpenSearch domain with fault-tolerant configuration:

AWSTemplateFormatVersion: '2010-09-09'
Description: OpenSearch domain with fault-tolerant configuration

Parameters:
DomainName:
Type: String
Description: Name of the OpenSearch domain
Default: my-opensearch-domain

InstanceType:
Type: String
Description: Instance type for data nodes
Default: r6g.large.search

Resources:
OpenSearchDomain:
Type: AWS::OpenSearchService::Domain
Properties:
DomainName: !Ref DomainName
EngineVersion: OpenSearch_2.11
ClusterConfig:
InstanceType: !Ref InstanceType
InstanceCount: 3
ZoneAwarenessEnabled: true
ZoneAwarenessConfig:
AvailabilityZoneCount: 3
DedicatedMasterEnabled: true
DedicatedMasterType: m6g.large.search
DedicatedMasterCount: 3
EBSOptions:
EBSEnabled: true
VolumeType: gp3
VolumeSize: 100
NodeToNodeEncryptionOptions:
Enabled: true
EncryptionAtRestOptions:
Enabled: true
DomainEndpointOptions:
EnforceHTTPS: true
TLSSecurityPolicy: Policy-Min-TLS-1-2-2019-07

Outputs:
DomainArn:
Description: ARN of the OpenSearch domain
Value: !GetAtt OpenSearchDomain.Arn
DomainEndpoint:
Description: Endpoint of the OpenSearch domain
Value: !GetAtt OpenSearchDomain.DomainEndpoint

Deploy the stack:

aws cloudformation deploy \
--template-file opensearch-fault-tolerant.yaml \
--stack-name opensearch-fault-tolerant \
--parameter-overrides DomainName=<your-domain-name> \
--region us-east-1
Terraform (optional)
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}

provider "aws" {
region = "us-east-1"
}

variable "domain_name" {
description = "Name of the OpenSearch domain"
type = string
default = "my-opensearch-domain"
}

variable "instance_type" {
description = "Instance type for data nodes"
type = string
default = "r6g.large.search"
}

resource "aws_opensearch_domain" "main" {
domain_name = var.domain_name
engine_version = "OpenSearch_2.11"

cluster_config {
instance_type = var.instance_type
instance_count = 3
zone_awareness_enabled = true

zone_awareness_config {
availability_zone_count = 3
}

dedicated_master_enabled = true
dedicated_master_type = "m6g.large.search"
dedicated_master_count = 3
}

ebs_options {
ebs_enabled = true
volume_type = "gp3"
volume_size = 100
}

encrypt_at_rest {
enabled = true
}

node_to_node_encryption {
enabled = true
}

domain_endpoint_options {
enforce_https = true
tls_security_policy = "Policy-Min-TLS-1-2-2019-07"
}

tags = {
Environment = "production"
}
}

output "domain_arn" {
description = "ARN of the OpenSearch domain"
value = aws_opensearch_domain.main.arn
}

output "domain_endpoint" {
description = "Endpoint of the OpenSearch domain"
value = aws_opensearch_domain.main.endpoint
}

Apply the configuration:

terraform init
terraform apply -var="domain_name=<your-domain-name>"

Verification

After making changes, verify your domain is fault-tolerant:

  1. In the AWS Console, navigate to Amazon OpenSearch Service
  2. Select your domain
  3. Under Cluster configuration, confirm:
    • Number of data nodes is 3 or more
    • Zone Awareness is enabled
    • Availability Zones shows 2 or 3
CLI verification
aws opensearch describe-domain \
--domain-name <your-domain-name> \
--query 'DomainStatus.ClusterConfig.{InstanceCount:InstanceCount,ZoneAwarenessEnabled:ZoneAwarenessEnabled,AvailabilityZoneCount:ZoneAwarenessConfig.AvailabilityZoneCount}' \
--region us-east-1

Expected output for a fault-tolerant domain:

{
"InstanceCount": 3,
"ZoneAwarenessEnabled": true,
"AvailabilityZoneCount": 3
}

Additional Resources

Notes

  • Node count multiples: Use node counts in multiples of 3 when using 3 Availability Zones (e.g., 3, 6, 9) to ensure even distribution across zones.
  • Replica shards: Configure your indices with at least 1 replica shard to take full advantage of fault tolerance.
  • Dedicated master nodes: For production workloads, enable dedicated master nodes (3 is recommended) to improve cluster stability.
  • VPC considerations: If your domain is in a VPC, ensure you have subnets in at least as many Availability Zones as your ZoneAwarenessConfig specifies.
  • Cost impact: Enabling fault tolerance increases costs due to additional nodes and cross-zone data transfer. Plan your capacity accordingly.
  • Update time: Domain configuration changes can take 15-30 minutes to complete. During this time, the domain remains available but may have reduced performance.