alpha

AWS Cost Management

Dual-layer cost monitoring combining budget thresholds with ML-based anomaly detection

#aws#terraform#cost#budget#anomaly-detection#cost-category#cli#finops#tagging

AWS Cost Management Architecture

Use both AWS Budgets and Cost Anomaly Detection for cost visibility.

┌─────────────────┐     ┌─────────────────────┐
│  AWS Budgets    │     │  Cost Anomaly       │
│  (Thresholds)   │     │  (ML Detection)     │
└────────┬────────┘     └──────────┬──────────┘
         │                         │
         └──────────┬──────────────┘

              ┌───────────┐
              │ SNS Topic │ → Slack/Email
              └───────────┘

Each layer catches what the other misses:

ScenarioBudgetsAnomaly
Gradual cost creep toward limit✓ Alerts at 100%✗ Becomes “normal”
Sudden spike, still under budget✗ No threshold crossed✓ Detects deviation
Predictable seasonal increase✗ May trigger false alarm✓ ML learns the pattern

Budgets enforce hard limits. Anomaly detection catches the unexpected.

AWS Budget

Set static monthly limits with percentage-based alert thresholds.

Predictable and explicit. Requires an SNS topic for notifications (created separately).

locals {
  budget = 1000
}

data "aws_sns_topic" "this" {
  name = "notifications"
}

resource "aws_budgets_budget" "this" {
  name         = var.name
  budget_type  = "COST"
  limit_amount = local.budget
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = [format("UUID$%s", var.uuid)]
  }

  # 100% Forecasted
  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 100
    threshold_type            = "PERCENTAGE"
    notification_type         = "FORECASTED"
    subscriber_sns_topic_arns = [data.aws_sns_topic.this.arn]
  }

  # 100% Actual
  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 100
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_sns_topic_arns = [data.aws_sns_topic.this.arn]
  }

  tags = {
    Name = var.name
  }
}

Budget limit: Expected baseline plus anticipated usage, with ~20% buffer. The buffer accounts for estimation uncertainty and variable month lengths (28-31 days).

Thresholds:

  • 100% forecasted → projected to exceed budget
  • 100% actual → budget exceeded

Utilization and Coverage Budgets

Monitor commitment usage to avoid paying for unused RIs or Savings Plans.

Budget TypeQuestion it answersAlert when
RI_UTILIZATIONAre we using purchased RIs?Below target (e.g., <80%)
RI_COVERAGEWhat % of usage is covered by RIs?Below target
SAVINGS_PLANS_UTILIZATIONAre we using purchased SPs?Below target (e.g., <80%)
SAVINGS_PLANS_COVERAGEWhat % of usage is covered by SPs?Below target
resource "aws_budgets_budget" "sp_utilization" {
  name         = "savings-plans-utilization"
  budget_type  = "SAVINGS_PLANS_UTILIZATION"
  limit_amount = "80.0"
  limit_unit   = "PERCENTAGE"
  time_unit    = "MONTHLY"

  notification {
    comparison_operator       = "LESS_THAN"
    threshold                 = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_sns_topic_arns = [data.aws_sns_topic.this.arn]
  }
}

Low utilization means you’re paying for commitments you’re not using. Investigate:

  • Over-purchased capacity
  • Workload changes since purchase
  • Resources moved to different instance families (for EC2 RIs)

See: Creating a Budget

Budget Actions

Automate responses when budget thresholds are crossed.

Budget Actions execute automatically or queue for approval when thresholds trigger.

Action TypeWhat it does
APPLY_IAM_POLICYAttach deny policy to users/roles/groups
APPLY_SCP_POLICYApply SCP at OU level (management account only)
RUN_SSM_DOCUMENTSStop/terminate EC2 or RDS instances
Approval ModelBehavior
AUTOMATICExecutes immediately when threshold crossed
MANUALQueues action, notifies via SNS, requires approval
Manual approval flow:

Budget exceeded → Action queued → SNS notification →
Human reviews → Approves via console/CLI → Action executes

Use case: Budget hits 100% → automatically apply deny policy → blocks new EC2 launches → prevents runaway spend.

Constraints:

  • Requires IAM role granting Budgets permission to execute actions
  • SCPs apply at OU level only, not individual accounts
  • Actions can auto-reverse when budget returns to range

See: Configuring Budget Actions

Budget Estimation

Use Cost Explorer CLI to analyze historical costs before setting budget limits.

Step 1: Get monthly costs for the last 3 months to establish baseline:

# macOS (BSD)
START=$(date -v-3m +%Y-%m-01)

# Linux (GNU)
START=$(date -d "-3 months" +%Y-%m-01)

aws ce get-cost-and-usage \
  --time-period Start=$START,End=$(date +%Y-%m-01) \
  --granularity MONTHLY \
  --metrics UnblendedCost \
  --filter '{"Dimensions": {"Key": "SERVICE", "Values": ["Amazon Simple Storage Service"]}}'

Step 2: If a month looks anomalous, drill down to daily costs:

aws ce get-cost-and-usage \
  --time-period Start=$(date +%Y-%m-01),End=$(date +%Y-%m-%d) \
  --granularity DAILY \
  --metrics UnblendedCost \
  --filter '{"Dimensions": {"Key": "SERVICE", "Values": ["Amazon Simple Storage Service"]}}'

Step 3: Project monthly cost with buffer:

Daily average × 31 days = Projected monthly
Projected monthly × 1.2 = Budget limit (with 20% buffer)

The 20% buffer accounts for estimation uncertainty and variable month lengths (28-31 days).

AWS Cost Anomaly Detection

Use ML-based monitors to detect deviations from learned spending patterns.

Alerts regardless of budget limits—catches spikes that stay under threshold.

resource "aws_ce_anomaly_monitor" "this" {
  name         = var.name
  monitor_type = "CUSTOM"

  monitor_specification = jsonencode({
    And            = null
    CostCategories = null
    Dimensions     = null
    Not            = null
    Or             = null
    Tags = {
      Key          = "user:UUID"
      Values       = [var.uuid]
      MatchOptions = ["EQUALS"]
    }
  })

  tags = {
    Name = var.name
  }
}

resource "aws_ce_anomaly_subscription" "this" {
  name             = var.name
  frequency        = "IMMEDIATE"
  monitor_arn_list = [aws_ce_anomaly_monitor.this.arn]

  subscriber {
    type    = "SNS"
    address = data.aws_sns_topic.this.arn
  }

  threshold_expression {
    and {
      dimension {
        key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
        match_options = ["GREATER_THAN_OR_EQUAL"]
        values        = [local.budget * 0.05]
      }
    }
    and {
      dimension {
        key           = "ANOMALY_TOTAL_IMPACT_PERCENTAGE"
        match_options = ["GREATER_THAN_OR_EQUAL"]
        values        = ["20"]
      }
    }
  }

  tags = {
    Name = var.name
  }
}

Thresholds (AND logic):

  • Absolute: ≥$50 impact (~5% of budget limit)
  • Percentage: ≥20% above ML-predicted cost

Both conditions MUST be true, filtering noise while catching real spikes.

Absolute threshold calibration: Set to ~5% of budget limit. This derives from error budget thinking—if your 20% buffer is your “cost error budget,” alert when a single anomaly threatens to consume ~25% of that buffer (0.20 × 0.25 = 0.05).

AWS Cost Categories

Use Cost Categories to group costs by rules, not by tagging resources.

Cost Categories apply to cost line items, not resources. Define rules based on dimensions (account, service, tag, region), and AWS automatically categorizes all matching costs.

┌─────────────────────────────────────────────────────────────┐
│                      Cost Categories                        │
├─────────────────────────────────────────────────────────────┤
│  Tags                          Cost Categories              │
│  ────                          ───────────────              │
│  Applied to: Resources         Applied to: Cost line items  │
│  Requires: Tagging each        Requires: Rules              │
│  Retroactive: No (backfill)    Retroactive: Yes (in month)  │
└─────────────────────────────────────────────────────────────┘

Rule Dimensions

Cost Categories can group by:

DimensionExample
AccountAccount ID or name
ServiceAmazonS3, AmazonEC2, AWSLambda
Regionus-east-1, eu-west-1
TagAny activated cost allocation tag
Charge TypeUsage, Tax, Fee, Refund
Cost CategoryAnother cost category (hierarchical)

Rule Types

Regular rules - static mapping:

rule {
  value = "Platform"
  rule {
    dimension {
      key           = "LINKED_ACCOUNT"
      values        = ["111111111111", "222222222222"]
      match_options = ["EQUALS"]
    }
  }
}

Inherited value rules - dynamic from tag values:

rule {
  type = "INHERITED_VALUE"
  inherited_value {
    dimension_name = "TAG"
    dimension_key  = "Team"
  }
}

Inherited rules automatically create category values from tag values. If resources have Team=alpha, Team=beta, the cost category gets values alpha, beta without manual rule updates.

Service-Level Anomaly Detection

Use Cost Categories to scope anomaly monitors to a specific AWS service.

Anomaly monitors cannot filter by service directly. Create a Cost Category first, then reference it in the monitor specification.

resource "aws_ce_cost_category" "this" {
  name         = var.name
  rule_version = "CostCategoryExpression.v1"

  rule {
    value = "S3"

    rule {
      dimension {
        key           = "SERVICE_CODE"
        values        = ["AmazonS3"]
        match_options = ["EQUALS"]
      }
    }
  }

  default_value = "Other"
}

resource "aws_ce_anomaly_monitor" "this" {
  name         = var.name
  monitor_type = "CUSTOM"

  monitor_specification = jsonencode({
    And        = null
    Dimensions = null
    Not        = null
    Or         = null
    Tags       = null
    CostCategories = {
      Key          = aws_ce_cost_category.this.name
      Values       = ["S3"]
      MatchOptions = ["EQUALS"]
    }
  })

  tags = { Name = var.name }
}

Common service codes: AmazonS3, AmazonEC2, AmazonRDS, AWSLambda.

Constraints:

  • Cost Categories take up to 24 hours to populate after creation
  • Only management account can create/manage
  • Retroactive within current month only

See: Organizing Costs Using Cost Categories

AWS Data Exports

Use Data Exports (CUR 2.0) for granular cost and usage data in S3.

CUR 2.0 delivers detailed billing data to S3 for analysis with Athena, QuickSight, or custom pipelines.

Features:

  • Fixed schema with nested key-value pairs for tags, cost categories, product attributes
  • SQL-based column selection and row filtering
  • Split cost allocation data for ECS/EKS container costs

Setup

Billing Console → Data Exports → Create export →
Select table (CUR 2.0) → Choose S3 bucket → Configure columns

CLI allows full SQL: column selection, row filters, column renaming.

Constraints:

  • Parquet format (columnar, optimized for Athena queries)
  • No backfill—data starts from export creation date
  • Delivered to S3 (standard storage costs apply)

See: What is AWS Data Exports?

AWS Cost Allocation Tags

Tags on resources are NOT the same as tags in billing. Activation is required.

Resource tags and cost allocation tags are separate concepts. A tag applied to an EC2 instance does nothing for cost tracking until you explicitly activate it in the Billing console.

Tag Types

AWS provides two tag types, activated separately:

TypePrefixSourceScope
AWS-generatedaws:Created by AWS automaticallyLimited services (no Lambda, RDS, SNS)
User-defineduser:Created by youAll taggable resources

AWS-generated tags (e.g., aws:createdBy) auto-enable for all member accounts once activated. User-defined tags require manual application but offer full control over your business taxonomy.

Use both. AWS-generated catches what you forgot to tag. User-defined expresses your cost structure.

See: AWS-Generated vs User-Defined Cost Allocation Tags

Activation

Activate tags in the Billing console or via CLI before they appear in Cost Explorer or Budgets.

Console:

Billing Console → Cost Allocation Tags → Select tags → Activate

CLI:

# List inactive tags
aws ce list-cost-allocation-tags --status Inactive

# Activate tags (max 20 per request)
aws ce update-cost-allocation-tags-status \
  --cost-allocation-tags-status \
    TagKey=Environment,Status=Active \
    TagKey=Project,Status=Active \
    TagKey=UUID,Status=Active

Constraints:

  • Only management account can activate tags
  • Takes up to 24 hours to appear after activation
  • Maximum 500 active cost allocation tags
  • When moving accounts between organizations, tags lose “active” status—reactivate in new org

See: Activating User-Defined Cost Allocation Tags, update-cost-allocation-tags-status CLI

Retroactivity

Cost allocation tags are prospective by default.

Activating a tag today shows costs from today forward. Historical costs remain untagged.

Backfill (since March 2024) allows retroactive application up to 12 months:

Billing Console → Cost Allocation Tags → Backfill tags → Select month

Constraints:

  • Resource MUST have had the tag at that time - can’t invent history
  • Backfill date must be 1st of month (billing period start)
  • One backfill request per 24 hours
  • Updates Cost Explorer, Data Exports, CUR within 24 hours
Timeline example:

June 2024     - Tag "Project=X" applied to resource
November 2024 - Tag activated for cost allocation
December 2024 - Backfill requested from January 2024

Result:
- Jan-May 2024: No tag values (tag wasn't on resource)
- Jun-Dec 2024: Tag values visible in cost data

See: Backfill Cost Allocation Tags

Tagging Strategy

Use hierarchical tags for aggregation and unique tags for isolation.

Cost allocation serves two purposes: aggregate costs for reporting (showback) and isolate costs for alerting (budgets, anomaly detection). Different tag types serve each purpose.

┌─────────────────────────────────────────────────────────────┐
│                         Workload: app                       │
│                         UUID: a1b2c3d4                      │
├─────────────────────────────┬───────────────────────────────┤
│   Component: database       │   Component: cache            │
├─────────────────────────────┼───────────────────────────────┤
│   Name: primary-db          │   Name: redis-1               │
│   Name: replica-db          │                               │
└─────────────────────────────┴───────────────────────────────┘
TagScopePurpose
NamePer resourceHuman-readable identifier in console
WorkloadShared across deploymentGroup of resources delivering business value
ComponentShared within componentLogical unit within workload
UUIDWorkload or component levelCollision guardrail for precise filtering

Aggregation (shared tags):

  • Workload=app → total cost of the app across all deployments
  • Component=database → cost of database components

Isolation (unique tags):

  • UUID=a1b2c3d4 → cost of this specific deployment

Without UUID, generic tags match unrelated resources:

# Bad - matches all production resources
cost_filter {
  name   = "TagKeyValue"
  values = ["Environment$production"]
}

# Good - scoped to exact deployment
cost_filter {
  name   = "TagKeyValue"
  values = [format("UUID$%s", var.uuid)]
}

UUID enables:

  • Budget alerts scoped to specific infrastructure
  • Anomaly detection without cross-deployment noise
  • SSM associations targeting instances by deployment
  • AWS Resource Groups filtered to exact resources

See: Terraform Tagging for implementation with default_tags

Tag Key Format by Service

Tag key syntax differs across AWS cost services.

ServiceFormatExample
Cost Explorertag:KeyNametag:Environment
Budgets cost_filterTagKeyValue with Key$ValueEnvironment$production
Anomaly Detectionuser: prefixuser:Environment

This inconsistency causes silent failures. A filter that works in Cost Explorer won’t work in Budgets without reformatting.

See: Using Cost Allocation Tags

Split Cost Allocation

Use split cost allocation to attribute shared EC2 costs to individual containers.

EC2-backed ECS tasks and EKS pods share instance costs. Standard billing shows EC2 line items, not container-level breakdown. Split cost allocation calculates each container’s share based on CPU and memory consumption.

┌─────────────────────────────────────┐
│         EC2 Instance ($100)         │
├─────────────┬─────────────┬─────────┤
│  Pod A      │  Pod B      │  Pod C  │
│  40% CPU    │  35% CPU    │  25%    │
│  $40        │  $35        │  $25    │
└─────────────┴─────────────┴─────────┘

Opt-in required (two steps):

1. Cost Management Preferences → Split cost allocation data → Enable
2. CUR report → Edit → Report content → Split cost allocation data ✓

For EKS, AWS auto-generates cost allocation tags:

TagDescription
aws:eks:cluster-nameCluster name
aws:eks:namespaceKubernetes namespace
aws:eks:nodeNode name
aws:eks:workload-typeReplicaSet, StatefulSet, Job, DaemonSet
aws:eks:workload-nameWorkload name
aws:eks:deploymentParent deployment (ReplicaSets only)

EKS also supports importing Kubernetes labels as cost allocation tags (up to 50 per pod).

Constraints:

  • Data appears in CUR only, not Cost Explorer
  • Significant CUR volume increase (2-3 new line items per container per hour)
  • EKS accelerator support (GPU, Trainium, Inferentia) adds third line item
  • Fargate tasks already have discrete costs—split allocation not needed

See: Understanding Split Cost Allocation Data, Enabling Split Cost Allocation