Skip to main content

SageMaker Training Jobs VPC Settings Configured

Overview

This check verifies that Amazon SageMaker training jobs are configured to run within a Virtual Private Cloud (VPC). When VPC settings are enabled, SageMaker places elastic network interfaces in your VPC subnets, ensuring that training data and model artifacts remain within your private network.

Without VPC configuration, training containers use public networking with unrestricted internet access, which can expose sensitive data and bypass your network security controls.

Risk

If SageMaker training jobs are not configured with VPC settings:

  • Data exfiltration: Training data or model artifacts could be sent to unauthorized external destinations
  • Malware exposure: Training containers could download malicious code from the internet
  • Compliance violations: Sensitive data may traverse public networks, violating regulatory requirements
  • Reduced visibility: Network traffic cannot be monitored or controlled through VPC flow logs and security groups
  • Weakened security posture: You lose the ability to apply fine-grained network controls to ML workloads

Remediation Steps

Prerequisites

  • AWS account access with permissions to create SageMaker training jobs
  • An existing VPC with at least one private subnet
  • A security group configured for SageMaker workloads
  • (Optional) VPC endpoints for S3, ECR, and SageMaker API if using network isolation
VPC Endpoint Setup (recommended for production)

For training jobs with network isolation, create VPC endpoints for required services:

  1. S3 Gateway Endpoint: For accessing training data and storing outputs
  2. ECR Endpoints: For pulling container images (both ecr.api and ecr.dkr)
  3. SageMaker API Endpoint: For SageMaker service communication
  4. CloudWatch Logs Endpoint: For logging (optional but recommended)

You can create these endpoints in the VPC console under Endpoints.

AWS Console Method

Note: SageMaker training jobs are one-time operations. You configure VPC settings when creating a new training job. Existing training jobs cannot be modified.

Creating a New Training Job with VPC Configuration

  1. Open the Amazon SageMaker console
  2. In the left navigation, choose Training > Training jobs
  3. Choose Create training job
  4. Fill in the basic job details:
    • Enter a Training job name
    • Select an IAM role with appropriate permissions
  5. Configure the algorithm and data settings as needed
  6. Scroll to the Network section
  7. For VPC, select your VPC from the dropdown
  8. For Subnet(s), select one or more private subnets
  9. For Security group(s), select security groups that allow required traffic
  10. (Optional) Enable Network isolation if you want to completely block internet access
  11. Complete the remaining configuration and choose Create training job

Checking Existing Training Jobs

  1. Open the Amazon SageMaker console
  2. In the left navigation, choose Training > Training jobs
  3. Select a training job to view its details
  4. In the Network section, verify that VPC, Subnets, and Security groups are configured
  5. If a job shows no VPC configuration, note it for future remediation (you cannot modify existing jobs)
AWS CLI (optional)

List Training Jobs Without VPC Configuration

# List all training jobs
aws sagemaker list-training-jobs \
--region us-east-1 \
--query 'TrainingJobSummaries[*].TrainingJobName' \
--output text

Check VPC Configuration for a Specific Job

# Replace <training-job-name> with your actual job name
aws sagemaker describe-training-job \
--training-job-name <training-job-name> \
--region us-east-1 \
--query '{JobName: TrainingJobName, VpcConfig: VpcConfig}'

If VpcConfig is null or empty, the job was not configured with VPC settings.

Create a Training Job with VPC Configuration

aws sagemaker create-training-job \
--training-job-name my-vpc-training-job \
--role-arn arn:aws:iam::<account-id>:role/<sagemaker-execution-role> \
--algorithm-specification '{
"TrainingImage": "<account-id>.dkr.ecr.us-east-1.amazonaws.com/my-training-image:latest",
"TrainingInputMode": "File"
}' \
--input-data-config '[{
"ChannelName": "train",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": "s3://<bucket-name>/training-data/",
"S3DataDistributionType": "FullyReplicated"
}
}
}]' \
--output-data-config '{
"S3OutputPath": "s3://<bucket-name>/output/"
}' \
--resource-config '{
"InstanceType": "ml.m5.large",
"InstanceCount": 1,
"VolumeSizeInGB": 30
}' \
--vpc-config '{
"Subnets": ["subnet-xxxxxxxxxxxxxxxxx"],
"SecurityGroupIds": ["sg-xxxxxxxxxxxxxxxxx"]
}' \
--stopping-condition '{
"MaxRuntimeInSeconds": 86400
}' \
--enable-inter-container-traffic-encryption \
--region us-east-1

Replace placeholders:

  • <account-id>: Your AWS account ID
  • <sagemaker-execution-role>: IAM role for SageMaker
  • <bucket-name>: Your S3 bucket name
  • subnet-xxxxxxxxxxxxxxxxx: Your VPC subnet ID
  • sg-xxxxxxxxxxxxxxxxx: Your security group ID
CloudFormation - IAM Policy Enforcement (optional)

Since SageMaker training jobs are operational resources (not infrastructure), CloudFormation cannot directly create them. However, you can enforce VPC configuration through IAM policies.

IAM Policy to Enforce VPC Settings

This policy denies the creation of training jobs that lack VPC configuration:

AWSTemplateFormatVersion: '2010-09-09'
Description: IAM Policy to enforce VPC configuration for SageMaker Training Jobs

Parameters:
AllowedSubnets:
Type: CommaDelimitedList
Description: List of allowed subnet IDs that training jobs must use
AllowedSecurityGroups:
Type: CommaDelimitedList
Description: List of allowed security group IDs that training jobs must use

Resources:
SageMakerVPCEnforcementPolicy:
Type: AWS::IAM::ManagedPolicy
Properties:
ManagedPolicyName: SageMaker-VPC-Enforcement-Policy
Description: Enforces VPC configuration for SageMaker training jobs
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: DenyTrainingJobsWithoutVPC
Effect: Deny
Action:
- sagemaker:CreateTrainingJob
Resource: '*'
Condition:
'Null':
'sagemaker:VpcSubnets': 'true'
- Sid: AllowTrainingJobsWithApprovedVPC
Effect: Allow
Action:
- sagemaker:CreateTrainingJob
Resource: '*'
Condition:
ForAllValues:StringEquals:
'sagemaker:VpcSubnets': !Ref AllowedSubnets
'sagemaker:VpcSecurityGroupIds': !Ref AllowedSecurityGroups

Outputs:
PolicyArn:
Description: ARN of the VPC enforcement policy
Value: !Ref SageMakerVPCEnforcementPolicy

Deploy the Policy

aws cloudformation deploy \
--template-file sagemaker-vpc-enforcement.yaml \
--stack-name sagemaker-vpc-enforcement \
--parameter-overrides \
AllowedSubnets="subnet-xxxxxxxxx,subnet-yyyyyyyyy" \
AllowedSecurityGroups="sg-xxxxxxxxx" \
--capabilities CAPABILITY_NAMED_IAM \
--region us-east-1

Attach this policy to IAM roles or users who create SageMaker training jobs.

Terraform - IAM Policy Enforcement (optional)

IAM Policy to Enforce VPC Settings

variable "allowed_subnets" {
description = "List of allowed subnet IDs for training jobs"
type = list(string)
}

variable "allowed_security_groups" {
description = "List of allowed security group IDs for training jobs"
type = list(string)
}

data "aws_iam_policy_document" "sagemaker_vpc_enforcement" {
statement {
sid = "DenyTrainingJobsWithoutVPC"
effect = "Deny"
actions = [
"sagemaker:CreateTrainingJob"
]
resources = ["*"]
condition {
test = "Null"
variable = "sagemaker:VpcSubnets"
values = ["true"]
}
}

statement {
sid = "AllowTrainingJobsWithApprovedVPC"
effect = "Allow"
actions = [
"sagemaker:CreateTrainingJob"
]
resources = ["*"]
condition {
test = "ForAllValues:StringEquals"
variable = "sagemaker:VpcSubnets"
values = var.allowed_subnets
}
condition {
test = "ForAllValues:StringEquals"
variable = "sagemaker:VpcSecurityGroupIds"
values = var.allowed_security_groups
}
}
}

resource "aws_iam_policy" "sagemaker_vpc_enforcement" {
name = "SageMaker-VPC-Enforcement-Policy"
description = "Enforces VPC configuration for SageMaker training jobs"
policy = data.aws_iam_policy_document.sagemaker_vpc_enforcement.json
}

output "policy_arn" {
description = "ARN of the VPC enforcement policy"
value = aws_iam_policy.sagemaker_vpc_enforcement.arn
}

Example tfvars

allowed_subnets = [
"subnet-xxxxxxxxxxxxxxxxx",
"subnet-yyyyyyyyyyyyyyyyy"
]

allowed_security_groups = [
"sg-xxxxxxxxxxxxxxxxx"
]

Apply

terraform init
terraform plan
terraform apply

Attach the created policy to IAM roles or users who create SageMaker training jobs.

Verification

After creating a training job with VPC settings, verify the configuration:

  1. Open the SageMaker console
  2. Navigate to Training > Training jobs
  3. Select your training job
  4. In the Network section, confirm:
    • VPC shows your selected VPC
    • Subnets lists your private subnets
    • Security groups shows your configured security groups
CLI Verification
aws sagemaker describe-training-job \
--training-job-name <training-job-name> \
--region us-east-1 \
--query 'VpcConfig'

Expected output (with VPC configured):

{
"SecurityGroupIds": ["sg-xxxxxxxxxxxxxxxxx"],
"Subnets": ["subnet-xxxxxxxxxxxxxxxxx"]
}

If the output is null, VPC settings are not configured.

Check Multiple Jobs

# List recent training jobs and their VPC status
for job in $(aws sagemaker list-training-jobs --region us-east-1 --max-results 10 --query 'TrainingJobSummaries[*].TrainingJobName' --output text); do
vpc=$(aws sagemaker describe-training-job --training-job-name "$job" --region us-east-1 --query 'VpcConfig' --output text)
echo "Job: $job - VPC Config: $vpc"
done

Additional Resources

Notes

  • Existing jobs cannot be modified: VPC settings must be configured at job creation time. If an existing training job lacks VPC configuration, you must create a new job with the correct settings.

  • Private subnets recommended: Use private subnets (without direct internet access) for training jobs. If internet access is needed, route through a NAT gateway.

  • Security group configuration: Ensure security groups allow:

    • Outbound HTTPS (443) to S3, ECR, and SageMaker endpoints
    • If using distributed training, allow inter-container communication on all ports within the security group
  • Network isolation: For maximum security, enable EnableNetworkIsolation. This completely blocks internet access, requiring VPC endpoints for all AWS services.

  • Cost considerations: VPC endpoints incur additional charges. Factor this into your architecture planning.

  • IAM enforcement: To prevent non-compliant training jobs, deploy the IAM policy from the CloudFormation or Terraform sections above. This ensures all future training jobs require VPC configuration.