AWS Cost Management
Dual-layer cost monitoring combining budget thresholds with ML-based anomaly detection
AWS Cost Management Architecture
Use both AWS Budgets and Cost Anomaly Detection for cost visibility.
┌─────────────────┐ ┌─────────────────────┐
│ AWS Budgets │ │ Cost Anomaly │
│ (Thresholds) │ │ (ML Detection) │
└────────┬────────┘ └──────────┬──────────┘
│ │
└──────────┬──────────────┘
▼
┌───────────┐
│ SNS Topic │ → Slack/Email
└───────────┘
Each layer catches what the other misses:
| Scenario | Budgets | Anomaly |
|---|---|---|
| Gradual cost creep toward limit | ✓ Alerts at 100% | ✗ Becomes “normal” |
| Sudden spike, still under budget | ✗ No threshold crossed | ✓ Detects deviation |
| Predictable seasonal increase | ✗ May trigger false alarm | ✓ ML learns the pattern |
Budgets enforce hard limits. Anomaly detection catches the unexpected.
AWS Budget
Set static monthly limits with percentage-based alert thresholds.
Predictable and explicit. Requires an SNS topic for notifications (created separately).
locals {
budget = 1000
}
data "aws_sns_topic" "this" {
name = "notifications"
}
resource "aws_budgets_budget" "this" {
name = var.name
budget_type = "COST"
limit_amount = local.budget
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "TagKeyValue"
values = [format("UUID$%s", var.uuid)]
}
# 100% Forecasted
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_sns_topic_arns = [data.aws_sns_topic.this.arn]
}
# 100% Actual
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [data.aws_sns_topic.this.arn]
}
tags = {
Name = var.name
}
}
Budget limit: Expected baseline plus anticipated usage, with ~20% buffer. The buffer accounts for estimation uncertainty and variable month lengths (28-31 days).
Thresholds:
- 100% forecasted → projected to exceed budget
- 100% actual → budget exceeded
Utilization and Coverage Budgets
Monitor commitment usage to avoid paying for unused RIs or Savings Plans.
| Budget Type | Question it answers | Alert when |
|---|---|---|
RI_UTILIZATION | Are we using purchased RIs? | Below target (e.g., <80%) |
RI_COVERAGE | What % of usage is covered by RIs? | Below target |
SAVINGS_PLANS_UTILIZATION | Are we using purchased SPs? | Below target (e.g., <80%) |
SAVINGS_PLANS_COVERAGE | What % of usage is covered by SPs? | Below target |
resource "aws_budgets_budget" "sp_utilization" {
name = "savings-plans-utilization"
budget_type = "SAVINGS_PLANS_UTILIZATION"
limit_amount = "80.0"
limit_unit = "PERCENTAGE"
time_unit = "MONTHLY"
notification {
comparison_operator = "LESS_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [data.aws_sns_topic.this.arn]
}
}
Low utilization means you’re paying for commitments you’re not using. Investigate:
- Over-purchased capacity
- Workload changes since purchase
- Resources moved to different instance families (for EC2 RIs)
See: Creating a Budget
Budget Actions
Automate responses when budget thresholds are crossed.
Budget Actions execute automatically or queue for approval when thresholds trigger.
| Action Type | What it does |
|---|---|
APPLY_IAM_POLICY | Attach deny policy to users/roles/groups |
APPLY_SCP_POLICY | Apply SCP at OU level (management account only) |
RUN_SSM_DOCUMENTS | Stop/terminate EC2 or RDS instances |
| Approval Model | Behavior |
|---|---|
AUTOMATIC | Executes immediately when threshold crossed |
MANUAL | Queues action, notifies via SNS, requires approval |
Manual approval flow:
Budget exceeded → Action queued → SNS notification →
Human reviews → Approves via console/CLI → Action executes
Use case: Budget hits 100% → automatically apply deny policy → blocks new EC2 launches → prevents runaway spend.
Constraints:
- Requires IAM role granting Budgets permission to execute actions
- SCPs apply at OU level only, not individual accounts
- Actions can auto-reverse when budget returns to range
See: Configuring Budget Actions
Budget Estimation
Use Cost Explorer CLI to analyze historical costs before setting budget limits.
Step 1: Get monthly costs for the last 3 months to establish baseline:
# macOS (BSD)
START=$(date -v-3m +%Y-%m-01)
# Linux (GNU)
START=$(date -d "-3 months" +%Y-%m-01)
aws ce get-cost-and-usage \
--time-period Start=$START,End=$(date +%Y-%m-01) \
--granularity MONTHLY \
--metrics UnblendedCost \
--filter '{"Dimensions": {"Key": "SERVICE", "Values": ["Amazon Simple Storage Service"]}}'
Step 2: If a month looks anomalous, drill down to daily costs:
aws ce get-cost-and-usage \
--time-period Start=$(date +%Y-%m-01),End=$(date +%Y-%m-%d) \
--granularity DAILY \
--metrics UnblendedCost \
--filter '{"Dimensions": {"Key": "SERVICE", "Values": ["Amazon Simple Storage Service"]}}'
Step 3: Project monthly cost with buffer:
Daily average × 31 days = Projected monthly
Projected monthly × 1.2 = Budget limit (with 20% buffer)
The 20% buffer accounts for estimation uncertainty and variable month lengths (28-31 days).
AWS Cost Anomaly Detection
Use ML-based monitors to detect deviations from learned spending patterns.
Alerts regardless of budget limits—catches spikes that stay under threshold.
resource "aws_ce_anomaly_monitor" "this" {
name = var.name
monitor_type = "CUSTOM"
monitor_specification = jsonencode({
And = null
CostCategories = null
Dimensions = null
Not = null
Or = null
Tags = {
Key = "user:UUID"
Values = [var.uuid]
MatchOptions = ["EQUALS"]
}
})
tags = {
Name = var.name
}
}
resource "aws_ce_anomaly_subscription" "this" {
name = var.name
frequency = "IMMEDIATE"
monitor_arn_list = [aws_ce_anomaly_monitor.this.arn]
subscriber {
type = "SNS"
address = data.aws_sns_topic.this.arn
}
threshold_expression {
and {
dimension {
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
match_options = ["GREATER_THAN_OR_EQUAL"]
values = [local.budget * 0.05]
}
}
and {
dimension {
key = "ANOMALY_TOTAL_IMPACT_PERCENTAGE"
match_options = ["GREATER_THAN_OR_EQUAL"]
values = ["20"]
}
}
}
tags = {
Name = var.name
}
}
Thresholds (AND logic):
- Absolute: ≥$50 impact (~5% of budget limit)
- Percentage: ≥20% above ML-predicted cost
Both conditions MUST be true, filtering noise while catching real spikes.
Absolute threshold calibration: Set to ~5% of budget limit. This derives from error budget thinking—if your 20% buffer is your “cost error budget,” alert when a single anomaly threatens to consume ~25% of that buffer (0.20 × 0.25 = 0.05).
AWS Cost Categories
Use Cost Categories to group costs by rules, not by tagging resources.
Cost Categories apply to cost line items, not resources. Define rules based on dimensions (account, service, tag, region), and AWS automatically categorizes all matching costs.
┌─────────────────────────────────────────────────────────────┐
│ Cost Categories │
├─────────────────────────────────────────────────────────────┤
│ Tags Cost Categories │
│ ──── ─────────────── │
│ Applied to: Resources Applied to: Cost line items │
│ Requires: Tagging each Requires: Rules │
│ Retroactive: No (backfill) Retroactive: Yes (in month) │
└─────────────────────────────────────────────────────────────┘
Rule Dimensions
Cost Categories can group by:
| Dimension | Example |
|---|---|
| Account | Account ID or name |
| Service | AmazonS3, AmazonEC2, AWSLambda |
| Region | us-east-1, eu-west-1 |
| Tag | Any activated cost allocation tag |
| Charge Type | Usage, Tax, Fee, Refund |
| Cost Category | Another cost category (hierarchical) |
Rule Types
Regular rules - static mapping:
rule {
value = "Platform"
rule {
dimension {
key = "LINKED_ACCOUNT"
values = ["111111111111", "222222222222"]
match_options = ["EQUALS"]
}
}
}
Inherited value rules - dynamic from tag values:
rule {
type = "INHERITED_VALUE"
inherited_value {
dimension_name = "TAG"
dimension_key = "Team"
}
}
Inherited rules automatically create category values from tag values. If resources have Team=alpha, Team=beta, the cost category gets values alpha, beta without manual rule updates.
Service-Level Anomaly Detection
Use Cost Categories to scope anomaly monitors to a specific AWS service.
Anomaly monitors cannot filter by service directly. Create a Cost Category first, then reference it in the monitor specification.
resource "aws_ce_cost_category" "this" {
name = var.name
rule_version = "CostCategoryExpression.v1"
rule {
value = "S3"
rule {
dimension {
key = "SERVICE_CODE"
values = ["AmazonS3"]
match_options = ["EQUALS"]
}
}
}
default_value = "Other"
}
resource "aws_ce_anomaly_monitor" "this" {
name = var.name
monitor_type = "CUSTOM"
monitor_specification = jsonencode({
And = null
Dimensions = null
Not = null
Or = null
Tags = null
CostCategories = {
Key = aws_ce_cost_category.this.name
Values = ["S3"]
MatchOptions = ["EQUALS"]
}
})
tags = { Name = var.name }
}
Common service codes: AmazonS3, AmazonEC2, AmazonRDS, AWSLambda.
Constraints:
- Cost Categories take up to 24 hours to populate after creation
- Only management account can create/manage
- Retroactive within current month only
See: Organizing Costs Using Cost Categories
AWS Data Exports
Use Data Exports (CUR 2.0) for granular cost and usage data in S3.
CUR 2.0 delivers detailed billing data to S3 for analysis with Athena, QuickSight, or custom pipelines.
Features:
- Fixed schema with nested key-value pairs for tags, cost categories, product attributes
- SQL-based column selection and row filtering
- Split cost allocation data for ECS/EKS container costs
Setup
Billing Console → Data Exports → Create export →
Select table (CUR 2.0) → Choose S3 bucket → Configure columns
CLI allows full SQL: column selection, row filters, column renaming.
Constraints:
- Parquet format (columnar, optimized for Athena queries)
- No backfill—data starts from export creation date
- Delivered to S3 (standard storage costs apply)
See: What is AWS Data Exports?
AWS Cost Allocation Tags
Tags on resources are NOT the same as tags in billing. Activation is required.
Resource tags and cost allocation tags are separate concepts. A tag applied to an EC2 instance does nothing for cost tracking until you explicitly activate it in the Billing console.
Tag Types
AWS provides two tag types, activated separately:
| Type | Prefix | Source | Scope |
|---|---|---|---|
| AWS-generated | aws: | Created by AWS automatically | Limited services (no Lambda, RDS, SNS) |
| User-defined | user: | Created by you | All taggable resources |
AWS-generated tags (e.g., aws:createdBy) auto-enable for all member accounts once activated. User-defined tags require manual application but offer full control over your business taxonomy.
Use both. AWS-generated catches what you forgot to tag. User-defined expresses your cost structure.
See: AWS-Generated vs User-Defined Cost Allocation Tags
Activation
Activate tags in the Billing console or via CLI before they appear in Cost Explorer or Budgets.
Console:
Billing Console → Cost Allocation Tags → Select tags → Activate
CLI:
# List inactive tags
aws ce list-cost-allocation-tags --status Inactive
# Activate tags (max 20 per request)
aws ce update-cost-allocation-tags-status \
--cost-allocation-tags-status \
TagKey=Environment,Status=Active \
TagKey=Project,Status=Active \
TagKey=UUID,Status=Active
Constraints:
- Only management account can activate tags
- Takes up to 24 hours to appear after activation
- Maximum 500 active cost allocation tags
- When moving accounts between organizations, tags lose “active” status—reactivate in new org
See: Activating User-Defined Cost Allocation Tags, update-cost-allocation-tags-status CLI
Retroactivity
Cost allocation tags are prospective by default.
Activating a tag today shows costs from today forward. Historical costs remain untagged.
Backfill (since March 2024) allows retroactive application up to 12 months:
Billing Console → Cost Allocation Tags → Backfill tags → Select month
Constraints:
- Resource MUST have had the tag at that time - can’t invent history
- Backfill date must be 1st of month (billing period start)
- One backfill request per 24 hours
- Updates Cost Explorer, Data Exports, CUR within 24 hours
Timeline example:
June 2024 - Tag "Project=X" applied to resource
November 2024 - Tag activated for cost allocation
December 2024 - Backfill requested from January 2024
Result:
- Jan-May 2024: No tag values (tag wasn't on resource)
- Jun-Dec 2024: Tag values visible in cost data
See: Backfill Cost Allocation Tags
Tagging Strategy
Use hierarchical tags for aggregation and unique tags for isolation.
Cost allocation serves two purposes: aggregate costs for reporting (showback) and isolate costs for alerting (budgets, anomaly detection). Different tag types serve each purpose.
┌─────────────────────────────────────────────────────────────┐
│ Workload: app │
│ UUID: a1b2c3d4 │
├─────────────────────────────┬───────────────────────────────┤
│ Component: database │ Component: cache │
├─────────────────────────────┼───────────────────────────────┤
│ Name: primary-db │ Name: redis-1 │
│ Name: replica-db │ │
└─────────────────────────────┴───────────────────────────────┘
| Tag | Scope | Purpose |
|---|---|---|
Name | Per resource | Human-readable identifier in console |
Workload | Shared across deployment | Group of resources delivering business value |
Component | Shared within component | Logical unit within workload |
UUID | Workload or component level | Collision guardrail for precise filtering |
Aggregation (shared tags):
Workload=app→ total cost of the app across all deploymentsComponent=database→ cost of database components
Isolation (unique tags):
UUID=a1b2c3d4→ cost of this specific deployment
Without UUID, generic tags match unrelated resources:
# Bad - matches all production resources
cost_filter {
name = "TagKeyValue"
values = ["Environment$production"]
}
# Good - scoped to exact deployment
cost_filter {
name = "TagKeyValue"
values = [format("UUID$%s", var.uuid)]
}
UUID enables:
- Budget alerts scoped to specific infrastructure
- Anomaly detection without cross-deployment noise
- SSM associations targeting instances by deployment
- AWS Resource Groups filtered to exact resources
See: Terraform Tagging for implementation with default_tags
Tag Key Format by Service
Tag key syntax differs across AWS cost services.
| Service | Format | Example |
|---|---|---|
| Cost Explorer | tag:KeyName | tag:Environment |
| Budgets cost_filter | TagKeyValue with Key$Value | Environment$production |
| Anomaly Detection | user: prefix | user:Environment |
This inconsistency causes silent failures. A filter that works in Cost Explorer won’t work in Budgets without reformatting.
See: Using Cost Allocation Tags
Split Cost Allocation
Use split cost allocation to attribute shared EC2 costs to individual containers.
EC2-backed ECS tasks and EKS pods share instance costs. Standard billing shows EC2 line items, not container-level breakdown. Split cost allocation calculates each container’s share based on CPU and memory consumption.
┌─────────────────────────────────────┐
│ EC2 Instance ($100) │
├─────────────┬─────────────┬─────────┤
│ Pod A │ Pod B │ Pod C │
│ 40% CPU │ 35% CPU │ 25% │
│ $40 │ $35 │ $25 │
└─────────────┴─────────────┴─────────┘
Opt-in required (two steps):
1. Cost Management Preferences → Split cost allocation data → Enable
2. CUR report → Edit → Report content → Split cost allocation data ✓
For EKS, AWS auto-generates cost allocation tags:
| Tag | Description |
|---|---|
aws:eks:cluster-name | Cluster name |
aws:eks:namespace | Kubernetes namespace |
aws:eks:node | Node name |
aws:eks:workload-type | ReplicaSet, StatefulSet, Job, DaemonSet |
aws:eks:workload-name | Workload name |
aws:eks:deployment | Parent deployment (ReplicaSets only) |
EKS also supports importing Kubernetes labels as cost allocation tags (up to 50 per pod).
Constraints:
- Data appears in CUR only, not Cost Explorer
- Significant CUR volume increase (2-3 new line items per container per hour)
- EKS accelerator support (GPU, Trainium, Inferentia) adds third line item
- Fargate tasks already have discrete costs—split allocation not needed
See: Understanding Split Cost Allocation Data, Enabling Split Cost Allocation