Amazon EC2
๐ Quicklinks (Bookmark):
Cost Explorer: AWS EC2 by Instance type and Running hours
Reservation Coverage: AWS EC2 RI coverage
Savings Plan Coverage: AWS EC2 SP coverage
Compute Rightsizing: AWS Compute Optimizer Rightsizing
Idle Compute: AWS Compute Optimizer Idle
EC2 Pricing table: AWS EC2 Pricing
EC2 CUR Queries: Query CUR on Athena
Amazon EC2 is the backbone of AWS compute, scalable, customizable, and dangerously easy to overspend on.
Letโs break it down by:
โ What youโre using โ What youโre paying โ What you should be doing โ And the AWS-native tools to make it happen.
๐ What is EC2?
Amazon Elastic Compute Cloud (EC2) provides resizable virtual servers in the cloud.
Available in every AWS region
Billed by the second or hour
Can run Linux, Windows, or custom AMIs
Comes in dozens of instance families across generations
โ๏ธ Instance Families โ Pick the Right Hammer
t
(Burstable)
Dev, test, low-traffic apps
Great for idle workloads
m
(General Purpose)
Web apps, small services
Good default starting point
c
(Compute Optimized)
High CPU workloads
Perfect for encoding, ML inference
r
/ x
(Memory Optimized)
DBs, caches, SAP
Watch memory:cost ratio
i
/ d
(Storage)
OLTP, NoSQL, logs, IOPS-heavy
High EBS throughput
g
, inf
, p
(Accelerated)
AI/ML, HPC
GPU-backed, very expensive
๐งฌ Instance Generations
Graviton (g6
, t4g
, etc.)
ARM
โ Linux only
โ No Windows, may need recompiled apps
x86 Intel/AMD (m5
, c6a
, etc.)
x86
โ Linux,
โ Windows
More costly, but universal compatibility
Run Linux? Try Graviton. Run Windows or legacy binaries? Stick to x86.
๐๏ธ Tenancy Options
Shared
Default
โ Best for 90% of workloads
Dedicated Instance
You need isolation
โ ๏ธ Slightly more expensive
Dedicated Host
BYOL licensing
๐ธ Most expensive, per-socket billing possible
๐ง EC2 Rightsizing Strategy
โ Quick Wins
Find underutilized instances (e.g. CPU < 10%)
๐ Same-Family Resize
Downsize within current instance family (e.g. m5.2xlarge โ m5.large
)
No re-architecture needed
๐ Cross-Family Change
Migrate to cost-effective families (e.g. m5 โ t3
or m5 โ m6g
)
Use Graviton for Linux (โ ๏ธ no Windows support)
๐ค Shut Down Idle
Stop non-prod or idle EC2s automatically during off-hours
Use tags + Instance Scheduler
๐ก Review and adjust sizing monthly โ usage changes, so should your provisioning.
๐ธ Purchase Model Optimization
On-Demand
0%
Dev/test, unpredictable workloads
๐ธ High cost
Savings Plans
30โ66%
Steady-state compute
โ ๏ธ Locked 1โ3 yrs
Reserved Instances
30โ72%
Predictable, type-specific workloads
โ ๏ธ Less flexibility
Spot
70โ90%
Fault-tolerant, stateless apps
โ ๏ธ Can be interrupted anytime
โก Use Savings Plans for baseline. โก Use Spot for scale-out workers.
โฑ Scheduled Usage
Stop non-prod resources when not in use.
Tools:
Lambda + EventBridge + Tags
๐ EC2 FinOps Toolbox
๐ Cost View in Cost Explorer
In Cost Explorer:
Filter โ
Service = EC2
Group by โ
Instance Type
,Region
,Tag
, orLinked Account
Use RI Coverage, SP Utilization, and Forecasting
๐ Open Cost Explorer
Cost Explorer: Fast-Triage Usage Types ๐
When you load EC2 in Cost Explorer or in CUR, watch for these usage types and what they often indicate:
BoxUsage:*
Base EC2 instance hours โ the main compute cost bucket
CPUCredits:*
T-family instances earning unused CPU credits
EBSOptimized:*
EC2-Other surcharge for instance type EBS optimization
DataTransfer-*
Network egress (inter-AZ, cross-AZ, internet)
ElasticIP:*
Idle or unattached Elastic IPs, incurring cost
Action Tips:
Filter by low
vCPU-hours
but non-zeroBoxUsage
to find idle instances.High
CPUCredits
accumulation suggests your T-class is over-provisioned.Use tag filters (project, team) to group and triage waste quickly.
๐ Deep Dive with CUR
When querying CUR for EC2 insights, these are your go-to columns:
line_item_resource_id
โ the EC2 instance IDproduct_instance_type
โ the instance family and sizeline_item_usage_type
โ e.g. BoxUsage, CPUCredits, DataTransferline_item_operation
โ start/stop, resize, etc.resourceTags/*
โ your team/project tag dimensionsline_item_unblended_cost
/line_item_blended_cost
โ cost values
Example Query Prompt:
Find t3 instances with low vCPU-hours
and high CPUCredits
โ candidate for downsizing or retirement.
๐ CUR Setup
โ ๏ธ Data Transfer & EBS Callouts
Inter-AZ traffic between EC2 instances is billable; intra-AZ is free (still monitor).
Cross-region transfers and internet egress can dominate cost in chatty applications.
EBS is tightly coupled โ most storage cost lives under EBS volumes and snapshots. Migrate
gp2 โ gp3
, right-size throughput/IOPS, clean up orphaned volumes.Co-locate high-traffic tiers (API + DB, worker + storage) in same AZ or use private link constructs to reduce transfer cost.
๐ฎ Advanced Tactics
Graviton Migration
Save 20โ40% for Linux workloads
Mixed-Instance ASG
Use cheapest family type across AZs
Spot + On-Demand fallback
Scale with resilience
Instance Scheduler
Shut down dev/test nights/weekends
Tagging
Enables showback by team/project
Convertible RIs
Switch types during term
Auto Scaling Right
Prevent zombie capacity
Forecasting via CE
Plan future RI/SP purchases
Spot Strategy โ next level
Use MixedInstancesPolicy in ASGs with multiple families and sizes to increase availability.
Define interruption budget (e.g. allow 10 % of capacity to be interrupted) to trade lower cost vs. reliability.
Use dynamic max price caps (e.g. set to 70โ90 % of on-demand) and fallback to On-Demand when Spot is reclaimed.
Monitor spot interruption events and automate instance drainage/shutdown gracefully.
๐จ Security & Compliance for EC2
Ensure latest AMI patching cadence, automate image refresh.
Enforce IMDSv2 usage and disable IMDSv1 to soften SSRF risks.
Limit public IP access; use NAT/Load Balancers + security groups.
Use SCPs / Guardrails to prevent unapproved instance types or regions.
Enforce SSM Patch Manager and logging agents for visibility and drift detection.
โ
EC2 FinOps Checklist
๐ง EC2 Cost Optimization Challenges
A Q&A-style deep dive into the most persistent, high-impact AWS EC2 cost problems โ and actionable solutions that go beyond โjust rightsize it.โ
Q1: Why do EC2 bills spiral from over-provisioning or bad pricing choices?
Because workloads evolve, but instance sizes and pricing models donโt. Teams keep on-demand instances running 24/7, even when utilization hovers below 20%.
โ Solution:
Run AWS Compute Optimizer and Cost Explorer weekly.
Shift predictable loads to Savings Plans / Reserved Instances (up to 72% off).
Use Spot Instances for fault-tolerant or batch workloads (up to 90% off).
Implement instance schedules to stop non-prod workloads after hours.
Q2: Why is committing to Savings Plans or RIs so confusing?
Because predicting your baseline usage is part science, part art. Misjudging it either locks in waste or misses savings.
โ Solution:
Default to Compute Savings Plans for flexibility.
Use Reserved Instances only where you need guaranteed capacity.
Monitor coverage vs utilization KPIs monthly and rebalance quarterly.
Q3: Whatโs behind random slowdowns on burstable (T-family) instances?
CPU credits. Once burst credits run out, throttling hits, silently killing performance.
โ Solution:
Monitor
CPUCreditBalance
via CloudWatch alarms.Switch to Unlimited mode (with awareness of extra cost) or scale out horizontally.
Move sustained loads to M/C/R/Graviton families.
Q4: Why do EBS volumes cause unpredictable slowness and high costs?
Older gp2 volumes tie IOPS to size, forcing over-provisioning for performance.
โ Solution:
Migrate to gp3 (decouples size and performance).
Allocate precise IOPS/throughput.
For critical workloads, use io2 / io2 Block Express and enable EBS-optimized instances.
Q5: How does using the wrong instance family burn money?
Running compute-heavy workloads on general-purpose (M-family) instances or vice versa leads to underutilization or overpayment.
โ Solution:
Let Compute Optimizer recommend the right family.
Benchmark using sysbench or internal metrics.
Try Graviton (ARM) instances โ 15โ40% better price-performance, after verifying compatibility.
Q6: Why does networking architecture silently inflate EC2 costs?
Cross-AZ chatter, poor placement, and hairpin NAT traffic increase latency and data transfer costs.
โ Solution:
Group chatty microservices in cluster Placement Groups.
Use VPC Endpoints (S3, DynamoDB) to bypass NAT.
Deploy Global Accelerator or CloudFront for edge proximity.
Q7: Why do memory-heavy workloads (ML/analytics) overrun budgets?
Memory leaks and over-sized R-instances hide behind โjust workingโ apps.
โ Solution:
Choose R-family or Graviton memory-optimized instances.
Use CloudWatch mem metrics to rightsize.
For AI workloads, add KV caching, quantization, or batching.
Q8: How can I safely use Spot Instances without chaos from interruptions?
Spot can save 70โ90%, but interruptions kill unprepared apps.
โ Solution:
Mix Spot + On-Demand in Auto Scaling Groups using attribute-based selection.
Implement checkpointing and handle 2-minute interruption notices.
Enable capacity rebalancing for smarter recovery.
Q9: Why do self-managed databases on EC2 eat into cost savings?
DIY databases accumulate inefficiencies: missing indexes, old AMIs, I/O-heavy storage.
โ Solution:
Audit queries using Performance Insights or pg_stat_statements.
Move to Amazon RDS/Aurora when possible.
For EC2 DBs: use gp3/io2, tune auto-vacuum, and monitor read/write IOPS.
Q10: Why does Auto Scaling waste resources or fail to respond fast enough?
Bad scaling signals or cooldowns cause over-provisioning or late scaling events.
โ Solution:
Use Target Tracking policies with metrics like CPU, queue depth, or requests/sec.
Mix instance types with attribute-based selection and capacity rebalancing.
Add warm pools for near-instant scale-out.
โ๏ธ Quick Wins
Migrate all gp2 โ gp3 volumes.
Cover steady baselines with Savings Plans.
Implement instance scheduling for non-prod.
Pilot Graviton instances for 20โ30% better price/performance.
Add Spot diversification and cost alarms for accountability.
๐ References
Last updated