Amazon EC2

Amazon EC2 is the backbone of AWS compute, scalable, customizable, and dangerously easy to overspend on.

Letโ€™s break it down by:

โ†’ What youโ€™re using โ†’ What youโ€™re paying โ†’ What you should be doing โ†’ And the AWS-native tools to make it happen.


๐Ÿš€ What is EC2?

Amazon Elastic Compute Cloud (EC2) provides resizable virtual servers in the cloud.

  • Available in every AWS region

  • Billed by the second or hour

  • Can run Linux, Windows, or custom AMIs

  • Comes in dozens of instance families across generations


โš™๏ธ Instance Families โ€” Pick the Right Hammer

Family
Use Case
Notes

t (Burstable)

Dev, test, low-traffic apps

Great for idle workloads

m (General Purpose)

Web apps, small services

Good default starting point

c (Compute Optimized)

High CPU workloads

Perfect for encoding, ML inference

r / x (Memory Optimized)

DBs, caches, SAP

Watch memory:cost ratio

i / d (Storage)

OLTP, NoSQL, logs, IOPS-heavy

High EBS throughput

g, inf, p (Accelerated)

AI/ML, HPC

GPU-backed, very expensive


๐Ÿงฌ Instance Generations

Generation
Architecture
OS Support
โš ๏ธ Caveats

Graviton (g6, t4g, etc.)

ARM

โœ… Linux only

โŒ No Windows, may need recompiled apps

x86 Intel/AMD (m5, c6a, etc.)

x86

โœ… Linux,

โœ… Windows

More costly, but universal compatibility

Run Linux? Try Graviton. Run Windows or legacy binaries? Stick to x86.


๐Ÿ›๏ธ Tenancy Options

Tenancy Type
Use When
Notes

Shared

Default

โœ… Best for 90% of workloads

Dedicated Instance

You need isolation

โš ๏ธ Slightly more expensive

Dedicated Host

BYOL licensing

๐Ÿ’ธ Most expensive, per-socket billing possible


๐Ÿง  EC2 Rightsizing Strategy

Strategy
What to Do
Tools / Notes

โœ… Quick Wins

Find underutilized instances (e.g. CPU < 10%)

๐Ÿ” Same-Family Resize

Downsize within current instance family (e.g. m5.2xlarge โ†’ m5.large)

No re-architecture needed

๐Ÿ”„ Cross-Family Change

Migrate to cost-effective families (e.g. m5 โ†’ t3 or m5 โ†’ m6g)

Use Graviton for Linux (โš ๏ธ no Windows support)

๐Ÿ’ค Shut Down Idle

Stop non-prod or idle EC2s automatically during off-hours

๐Ÿ’ก Review and adjust sizing monthly โ€” usage changes, so should your provisioning.


๐Ÿ’ธ Purchase Model Optimization

Model
Savings
Best For
Risk

On-Demand

0%

Dev/test, unpredictable workloads

๐Ÿ’ธ High cost

Savings Plans

30โ€“66%

Steady-state compute

โš ๏ธ Locked 1โ€“3 yrs

Reserved Instances

30โ€“72%

Predictable, type-specific workloads

โš ๏ธ Less flexibility

Spot

70โ€“90%

Fault-tolerant, stateless apps

โš ๏ธ Can be interrupted anytime

โžก Use Savings Plans for baseline. โžก Use Spot for scale-out workers.


โฑ Scheduled Usage

Stop non-prod resources when not in use.

Tools:


๐Ÿ” EC2 FinOps Toolbox

Tool
Purpose
Link

Cost Explorer

Analyze trends, tags, reservations

Compute Optimizer

Rightsize + Graviton tips

Trusted Advisor

Idle EC2, EBS, Elastic IPs

Savings Plans Console

Commit to usage, save up to 66%

Reserved Instances

Buy fixed-term EC2 savings

CUR + Athena

Deep cost analytics


๐Ÿ“‰ Cost View in Cost Explorer

In Cost Explorer:

  • Filter โ†’ Service = EC2

  • Group by โ†’ Instance Type, Region, Tag, or Linked Account

  • Use RI Coverage, SP Utilization, and Forecasting

๐Ÿ”— Open Cost Explorer


Cost Explorer: Fast-Triage Usage Types ๐Ÿ”

When you load EC2 in Cost Explorer or in CUR, watch for these usage types and what they often indicate:

Usage Type Pattern
Likely Meaning

BoxUsage:*

Base EC2 instance hours โ€” the main compute cost bucket

CPUCredits:*

T-family instances earning unused CPU credits

EBSOptimized:*

EC2-Other surcharge for instance type EBS optimization

DataTransfer-*

Network egress (inter-AZ, cross-AZ, internet)

ElasticIP:*

Idle or unattached Elastic IPs, incurring cost

Action Tips:

  • Filter by low vCPU-hours but non-zero BoxUsage to find idle instances.

  • High CPUCredits accumulation suggests your T-class is over-provisioned.

  • Use tag filters (project, team) to group and triage waste quickly.


๐Ÿ“Š Deep Dive with CUR

When querying CUR for EC2 insights, these are your go-to columns:

  • line_item_resource_id โ€” the EC2 instance ID

  • product_instance_type โ€” the instance family and size

  • line_item_usage_type โ€” e.g. BoxUsage, CPUCredits, DataTransfer

  • line_item_operation โ€” start/stop, resize, etc.

  • resourceTags/* โ€” your team/project tag dimensions

  • line_item_unblended_cost / line_item_blended_cost โ€” cost values

Example Query Prompt: Find t3 instances with low vCPU-hours and high CPUCredits โ€” candidate for downsizing or retirement.

๐Ÿ”— CUR Setup


โš ๏ธ Data Transfer & EBS Callouts

  • Inter-AZ traffic between EC2 instances is billable; intra-AZ is free (still monitor).

  • Cross-region transfers and internet egress can dominate cost in chatty applications.

  • EBS is tightly coupled โ€” most storage cost lives under EBS volumes and snapshots. Migrate gp2 โ†’ gp3, right-size throughput/IOPS, clean up orphaned volumes.

  • Co-locate high-traffic tiers (API + DB, worker + storage) in same AZ or use private link constructs to reduce transfer cost.


๐Ÿ”ฎ Advanced Tactics

Strategy
Why It Matters

Graviton Migration

Save 20โ€“40% for Linux workloads

Mixed-Instance ASG

Use cheapest family type across AZs

Spot + On-Demand fallback

Scale with resilience

Instance Scheduler

Shut down dev/test nights/weekends

Tagging

Enables showback by team/project

Convertible RIs

Switch types during term

Auto Scaling Right

Prevent zombie capacity

Forecasting via CE

Plan future RI/SP purchases

Spot Strategy โ€” next level

  • Use MixedInstancesPolicy in ASGs with multiple families and sizes to increase availability.

  • Define interruption budget (e.g. allow 10 % of capacity to be interrupted) to trade lower cost vs. reliability.

  • Use dynamic max price caps (e.g. set to 70โ€“90 % of on-demand) and fallback to On-Demand when Spot is reclaimed.

  • Monitor spot interruption events and automate instance drainage/shutdown gracefully.


๐Ÿšจ Security & Compliance for EC2

  • Ensure latest AMI patching cadence, automate image refresh.

  • Enforce IMDSv2 usage and disable IMDSv1 to soften SSRF risks.

  • Limit public IP access; use NAT/Load Balancers + security groups.

  • Use SCPs / Guardrails to prevent unapproved instance types or regions.

  • Enforce SSM Patch Manager and logging agents for visibility and drift detection.


โœ… EC2 FinOps Checklist


๐Ÿง  EC2 Cost Optimization Challenges

A Q&A-style deep dive into the most persistent, high-impact AWS EC2 cost problems โ€” and actionable solutions that go beyond โ€œjust rightsize it.โ€


Q1: Why do EC2 bills spiral from over-provisioning or bad pricing choices?

Because workloads evolve, but instance sizes and pricing models donโ€™t. Teams keep on-demand instances running 24/7, even when utilization hovers below 20%.

โœ… Solution:

  • Run AWS Compute Optimizer and Cost Explorer weekly.

  • Shift predictable loads to Savings Plans / Reserved Instances (up to 72% off).

  • Use Spot Instances for fault-tolerant or batch workloads (up to 90% off).

  • Implement instance schedules to stop non-prod workloads after hours.


Q2: Why is committing to Savings Plans or RIs so confusing?

Because predicting your baseline usage is part science, part art. Misjudging it either locks in waste or misses savings.

โœ… Solution:

  • Default to Compute Savings Plans for flexibility.

  • Use Reserved Instances only where you need guaranteed capacity.

  • Monitor coverage vs utilization KPIs monthly and rebalance quarterly.


Q3: Whatโ€™s behind random slowdowns on burstable (T-family) instances?

CPU credits. Once burst credits run out, throttling hits, silently killing performance.

โœ… Solution:

  • Monitor CPUCreditBalance via CloudWatch alarms.

  • Switch to Unlimited mode (with awareness of extra cost) or scale out horizontally.

  • Move sustained loads to M/C/R/Graviton families.


Q4: Why do EBS volumes cause unpredictable slowness and high costs?

Older gp2 volumes tie IOPS to size, forcing over-provisioning for performance.

โœ… Solution:

  • Migrate to gp3 (decouples size and performance).

  • Allocate precise IOPS/throughput.

  • For critical workloads, use io2 / io2 Block Express and enable EBS-optimized instances.


Q5: How does using the wrong instance family burn money?

Running compute-heavy workloads on general-purpose (M-family) instances or vice versa leads to underutilization or overpayment.

โœ… Solution:

  • Let Compute Optimizer recommend the right family.

  • Benchmark using sysbench or internal metrics.

  • Try Graviton (ARM) instances โ€” 15โ€“40% better price-performance, after verifying compatibility.


Q6: Why does networking architecture silently inflate EC2 costs?

Cross-AZ chatter, poor placement, and hairpin NAT traffic increase latency and data transfer costs.

โœ… Solution:

  • Group chatty microservices in cluster Placement Groups.

  • Use VPC Endpoints (S3, DynamoDB) to bypass NAT.

  • Deploy Global Accelerator or CloudFront for edge proximity.


Q7: Why do memory-heavy workloads (ML/analytics) overrun budgets?

Memory leaks and over-sized R-instances hide behind โ€œjust workingโ€ apps.

โœ… Solution:

  • Choose R-family or Graviton memory-optimized instances.

  • Use CloudWatch mem metrics to rightsize.

  • For AI workloads, add KV caching, quantization, or batching.


Q8: How can I safely use Spot Instances without chaos from interruptions?

Spot can save 70โ€“90%, but interruptions kill unprepared apps.

โœ… Solution:

  • Mix Spot + On-Demand in Auto Scaling Groups using attribute-based selection.

  • Implement checkpointing and handle 2-minute interruption notices.

  • Enable capacity rebalancing for smarter recovery.


Q9: Why do self-managed databases on EC2 eat into cost savings?

DIY databases accumulate inefficiencies: missing indexes, old AMIs, I/O-heavy storage.

โœ… Solution:

  • Audit queries using Performance Insights or pg_stat_statements.

  • Move to Amazon RDS/Aurora when possible.

  • For EC2 DBs: use gp3/io2, tune auto-vacuum, and monitor read/write IOPS.


Q10: Why does Auto Scaling waste resources or fail to respond fast enough?

Bad scaling signals or cooldowns cause over-provisioning or late scaling events.

โœ… Solution:

  • Use Target Tracking policies with metrics like CPU, queue depth, or requests/sec.

  • Mix instance types with attribute-based selection and capacity rebalancing.

  • Add warm pools for near-instant scale-out.


โš™๏ธ Quick Wins

  • Migrate all gp2 โ†’ gp3 volumes.

  • Cover steady baselines with Savings Plans.

  • Implement instance scheduling for non-prod.

  • Pilot Graviton instances for 20โ€“30% better price/performance.

  • Add Spot diversification and cost alarms for accountability.


๐Ÿ“š References


Last updated