Skip to main content

SageMaker Training Jobs Network Isolation Enabled

Overview

This check verifies that Amazon SageMaker training jobs have network isolation enabled. When network isolation is turned on, training containers cannot make inbound or outbound network calls during execution. This is a critical security control for machine learning workloads.

Risk

Without network isolation, training code running in SageMaker can reach the internet or internal services. This creates several serious risks:

  • Data exfiltration - Attackers could steal your datasets and trained model artifacts
  • Supply-chain compromise - Malicious code could be downloaded from untrusted sources during training
  • Command and control - Compromised containers could establish connections to attacker infrastructure or abuse your compute resources

These threats affect the confidentiality and integrity of your ML workloads, with potential availability impact as well.

Remediation Steps

Prerequisites

  • AWS account access with permissions to create SageMaker training jobs
  • Understanding of your training job requirements (some jobs legitimately need network access)

Important: Network isolation cannot be added to existing training jobs. You must create new compliant jobs and retire non-compliant ones.

AWS Console Method

SageMaker training jobs are typically created through code (SDK, CLI, or notebooks) rather than the console. However, if you use SageMaker Studio or the console wizard:

  1. Sign in to the AWS Management Console
  2. Navigate to Amazon SageMaker
  3. When creating a new training job, look for the Network section
  4. Enable Network isolation (or "Isolate training container")
  5. Complete the rest of your training job configuration
  6. Submit the training job

For jobs created via notebooks or scripts, see the CLI and SDK sections below.

AWS CLI (optional)

List Existing Training Jobs

First, identify your training jobs to understand which ones need remediation:

# List all training jobs in us-east-1
aws sagemaker list-training-jobs \
--region us-east-1 \
--output table

Check Network Isolation Status

For each training job, check if network isolation is enabled:

# Replace <training-job-name> with your actual job name
aws sagemaker describe-training-job \
--training-job-name <training-job-name> \
--region us-east-1 \
--query '{JobName: TrainingJobName, NetworkIsolation: EnableNetworkIsolation, Status: TrainingJobStatus}'

Create a New Training Job with Network Isolation

Since you cannot modify existing jobs, create a new one with network isolation enabled:

aws sagemaker create-training-job \
--training-job-name my-isolated-training-job \
--role-arn arn:aws:iam::123456789012:role/SageMakerExecutionRole \
--algorithm-specification '{
"TrainingImage": "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-training-image:latest",
"TrainingInputMode": "File"
}' \
--input-data-config '[{
"ChannelName": "training",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": "s3://my-bucket/training-data/",
"S3DataDistributionType": "FullyReplicated"
}
}
}]' \
--output-data-config '{
"S3OutputPath": "s3://my-bucket/output/"
}' \
--resource-config '{
"InstanceType": "ml.m5.large",
"InstanceCount": 1,
"VolumeSizeInGB": 30
}' \
--stopping-condition '{
"MaxRuntimeInSeconds": 86400
}' \
--enable-network-isolation \
--region us-east-1

The key flag is --enable-network-isolation.

Batch Check All Training Jobs

To audit all training jobs for network isolation status:

# List all training jobs and check their network isolation status
for job in $(aws sagemaker list-training-jobs --region us-east-1 --query 'TrainingJobSummaries[].TrainingJobName' --output text); do
echo "Checking: $job"
aws sagemaker describe-training-job \
--training-job-name "$job" \
--region us-east-1 \
--query '{Job: TrainingJobName, NetworkIsolation: EnableNetworkIsolation}'
done
Python SDK (Boto3)

Check Existing Training Jobs

import boto3

sagemaker = boto3.client('sagemaker', region_name='us-east-1')

# List training jobs
response = sagemaker.list_training_jobs()

for job in response['TrainingJobSummaries']:
job_name = job['TrainingJobName']
details = sagemaker.describe_training_job(TrainingJobName=job_name)

isolation_enabled = details.get('EnableNetworkIsolation', False)
print(f"{job_name}: Network Isolation = {isolation_enabled}")

Create Training Job with Network Isolation

import boto3

sagemaker = boto3.client('sagemaker', region_name='us-east-1')

response = sagemaker.create_training_job(
TrainingJobName='my-isolated-training-job',
RoleArn='arn:aws:iam::123456789012:role/SageMakerExecutionRole',
AlgorithmSpecification={
'TrainingImage': '123456789012.dkr.ecr.us-east-1.amazonaws.com/my-image:latest',
'TrainingInputMode': 'File'
},
InputDataConfig=[{
'ChannelName': 'training',
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': 's3://my-bucket/training-data/',
'S3DataDistributionType': 'FullyReplicated'
}
}
}],
OutputDataConfig={
'S3OutputPath': 's3://my-bucket/output/'
},
ResourceConfig={
'InstanceType': 'ml.m5.large',
'InstanceCount': 1,
'VolumeSizeInGB': 30
},
StoppingCondition={
'MaxRuntimeInSeconds': 86400
},
EnableNetworkIsolation=True # Critical setting
)

print(f"Training job ARN: {response['TrainingJobArn']}")

Using SageMaker Python SDK (Higher-Level)

from sagemaker.estimator import Estimator

estimator = Estimator(
image_uri='123456789012.dkr.ecr.us-east-1.amazonaws.com/my-image:latest',
role='arn:aws:iam::123456789012:role/SageMakerExecutionRole',
instance_count=1,
instance_type='ml.m5.large',
output_path='s3://my-bucket/output/',
enable_network_isolation=True # Enable network isolation
)

estimator.fit({'training': 's3://my-bucket/training-data/'})
CloudFormation (optional)

Note: AWS CloudFormation does not have a native AWS::SageMaker::TrainingJob resource because training jobs are ephemeral workloads, not persistent infrastructure. Training jobs are typically launched from code, notebooks, or CI/CD pipelines.

For infrastructure-as-code approaches with SageMaker, consider:

  1. Define compliant training job configurations in your codebase
  2. Use AWS Service Catalog to provide approved training job templates
  3. Implement guardrails via SCP or IAM policies (see below)

Enforce Network Isolation via IAM Policy

You can create an IAM policy that denies creation of training jobs without network isolation:

AWSTemplateFormatVersion: '2010-09-09'
Description: IAM policy to enforce network isolation on SageMaker training jobs

Resources:
EnforceNetworkIsolationPolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
ManagedPolicyName: EnforceSageMakerNetworkIsolation
Description: Denies creation of SageMaker training jobs without network isolation
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: DenyTrainingJobsWithoutNetworkIsolation
Effect: Deny
Action:
- sagemaker:CreateTrainingJob
Resource: '*'
Condition:
Bool:
'sagemaker:NetworkIsolation': 'false'

Outputs:
PolicyArn:
Description: ARN of the enforcement policy
Value: !Ref EnforceNetworkIsolationPolicy

Attach this policy to IAM roles used for creating SageMaker training jobs.

Terraform (optional)

Note: The AWS Terraform provider does not have an aws_sagemaker_training_job resource because training jobs are ephemeral workloads launched via API, not persistent infrastructure.

Enforce Network Isolation via IAM Policy

Use Terraform to create an IAM policy that enforces network isolation:

terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = ">= 4.0"
}
}
}

provider "aws" {
region = "us-east-1"
}

# IAM policy that denies training jobs without network isolation
resource "aws_iam_policy" "enforce_network_isolation" {
name = "EnforceSageMakerNetworkIsolation"
description = "Denies creation of SageMaker training jobs without network isolation"

policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "DenyTrainingJobsWithoutNetworkIsolation"
Effect = "Deny"
Action = ["sagemaker:CreateTrainingJob"]
Resource = "*"
Condition = {
Bool = {
"sagemaker:NetworkIsolation" = "false"
}
}
}
]
})
}

# Attach to your SageMaker execution role
resource "aws_iam_role_policy_attachment" "enforce_isolation" {
role = "YourSageMakerExecutionRole" # Replace with your role name
policy_arn = aws_iam_policy.enforce_network_isolation.arn
}

output "policy_arn" {
description = "ARN of the enforcement policy"
value = aws_iam_policy.enforce_network_isolation.arn
}
Service Control Policy (SCP) for Organization-Wide Enforcement

If you use AWS Organizations, you can enforce network isolation across all accounts:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RequireNetworkIsolationForSageMakerTrainingJobs",
"Effect": "Deny",
"Action": "sagemaker:CreateTrainingJob",
"Resource": "*",
"Condition": {
"Bool": {
"sagemaker:NetworkIsolation": "false"
}
}
}
]
}

Apply this SCP to organizational units (OUs) or accounts where you want to enforce network isolation.

Verification

After creating a training job with network isolation, verify the setting:

  1. Go to Amazon SageMaker in the AWS Console
  2. Navigate to Training > Training jobs
  3. Select your training job
  4. Check that Network isolation shows as Enabled
CLI Verification
aws sagemaker describe-training-job \
--training-job-name <your-training-job-name> \
--region us-east-1 \
--query 'EnableNetworkIsolation'

Expected output: true

Additional Resources

Notes

  • Immutable setting: Network isolation cannot be changed on existing training jobs. You must create new jobs with the correct setting.

  • Trade-offs: Enabling network isolation means your training code cannot:

    • Download packages from PyPI, npm, or other repositories during training
    • Access external APIs or data sources
    • Communicate with other AWS services directly (except S3 via VPC endpoint)
  • When network access is needed: If your training workflow legitimately requires network access, implement defense-in-depth controls instead:

    • Use VPC with private subnets
    • Configure VPC endpoints for AWS services
    • Restrict egress with security groups
    • Pre-package all dependencies in your container image
    • Apply least-privilege IAM permissions
  • Pre-package dependencies: The best practice is to include all required libraries and dependencies in your container image so network access is unnecessary during training.