Cost Optimization Strategies for AWS ECS Fargate — Cut your bill in half
Introduction
In the ever-evolving landscape of cloud computing, managing costs while maintaining performance is a critical challenge — particularly for startups operating on lean budgets. AWS ECS Fargate provides a serverless container platform that removes the need to manage servers, but if you don’t optimize effectively, costs can quickly get out of hand.
This article offers an overview of cost-saving strategies for AWS ECS Fargate, from choosing the right region and networking setup to leveraging dynamic scaling and spot instances. Later articles will expand on each of these topics, providing deeper, step-by-step tutorials (complete with screenshots) for those who want a more hands-on guide.
Disclaimer: This comes from a mean lean startup machine perspective, running AWS in the EU, and is only an overview of the possibilities.
1. Understanding AWS Pricing Across Regions
1.1 Regional Pricing Variations
- Factors Influencing Pricing: AWS pricing varies by region due to factors like local infrastructure costs, taxes, and demand.
- Cost vs. Latency Trade-off: While latency is crucial, some applications can tolerate slight delays, making it worthwhile to consider less expensive regions.
Not all AWS regions have the same feature set available, so that is also something to take into account, in addition to regalutory requirements (for instance the GDPR in the EU).
1.2 Strategies for Region Selection
- Analyzing Workload Requirements: Identify applications where latency is less critical.
- Comparing Regional Costs: Use AWS’s pricing tools to compare costs across regions.
- Data Transfer Costs: Consider the impact of inter-region data transfer fees.
For a price comparison of EC2 instances per region, see: https://cloudprice.net/aws/regions
For a ping comparison grid between AWS regions check out: https://www.cloudping.co/grid
Select a region that meet your demands in terms of ping, features, regulatory requirements and pricing. For instance Ireland (eu-west-1) would for many be a better choice than Frankfurt (eu-central-1) in terms of pricing in the EU, saving more than 10 %, keeping an acceptable ping for most of the area (depending on your use case).
2. Optimizing Networking Costs
Many follow tutorials from sites like this (Medium) when setting up their AWS resources. Not anything wrong with that necessarily, but security, price and so on is often not taken into consideration since they just delete the resources afterward anyway. I learnt that the hard way as a self thought AWS man myself, be aware.
Often they run ECS services in public subnets with public IP. Public IP incur additional charges, so it costs more, and you normally do not want to have your tasks directly available from the internet. The cost effective and more secure way is to use a private subnet with a route to a NAT instance, then use a load balancer (ALB for instance) or API Gateway to make your service available on the internet.
2.1 Reducing IPv4 Address Expenses
- AWS Charges for Public IPv4: Public IP addresses incur additional costs.
- Disabling Public IPs for Tasks:
- Private Subnets: Run tasks in private subnets without public IPs.
- NAT Instances: Use NAT instances to allow outbound internet access.
I suggest AWS brilliant blog article about the subject which shows you how to check your IPv4 usage:
https://aws.amazon.com/blogs/networking-and-content-delivery/identify-and-optimize-public-ipv4-address-usage-on-aws/
I think many would be surprised over the unneccesary IPv4 usage, and the costs that follow. This hit us hard like a year ago when AWS decided to charge even more for IPv4 usage, so we only use it where necessary, like load balancer (ALB), NAT instances and bastion host. For internal communication inside the VPC there are cheaper alternatives.
2.2 Using NAT Instances Instead of NAT Gateways
- Cost Comparison:
- NAT Gateways: Managed service but can be costly with high data transfer.
- NAT Instances: Self-managed EC2 instances that can be more cost-effective. - Setting Up NAT Instances:
- Launch EC2 Instance: Use a pre-configured AMI for NAT.
- Configure Routing: Update route tables to direct traffic through the NAT instance. - Best Practices:
- Instance Size: Choose the right instance type based on bandwidth needs.
- High Availability: Use Auto Scaling and multiple instances for redundancy.
NAT instances is way cheaper than NAT Gateways, so you should consider it if being lean is a requirement. NAT Gateways scale automatically and is easier to maintain, but you probably have to have a lot of traffic before this makes sense financially (even then just using a larger EC2 instance type for your AMI. Follow these easy steps to create a NAT instance AMI you can use to create NAT instances:
https://docs.aws.amazon.com/vpc/latest/userguide/work-with-nat-instances.html#create-nat-ami
3. Leveraging Spot Instances for Non-Critical Tasks
3.1 Overview of AWS Spot Instances
- How Spot Instances Work: Utilize unused AWS capacity at discounted rates, up to 70 %.
- Interruption Handling: AWS can reclaim spot instances with a two-minute warning.
3.2 Implementing Spot Instances with ECS Fargate
- Suitable Use Cases:
- Staging Environments: Development and testing environments.
- Batch Processing: Tasks that can handle interruptions. - Configuring Spot Capacity Providers:
- Create Capacity Provider: Two options, set it up on cluster level, creating a default strategy to use for all your services in the cluster, or set it up per service when creating a service. Select FARGATE_SPOT.
Unfortunately it is currently not possible to use FARGATE On-Demand instances as fallback when spot is not available without creating a custom solution yourself, it has been a highly requested feature at AWS for years. Please give this a thumbs up: https://github.com/aws/containers-roadmap/issues/773
It may be a strategic decision from AWS though, since this feature would lower the risk for production services to use spot, making it a popular choice.
- Fault Tolerance Strategies:
- Checkpointing: Save progress periodically.
- Auto Recovery: Design applications to restart gracefully.
Experiences with FARGATE_SPOT capacity provider strategy in staging environments
After using spot instances in our test environment for a while I would warn that services that uses more than the minimum of CPU and RAM are more likely to be terminated when spot capacity is not available. If that is a standalone service it might not be a problem, but for us it was our IAM solution used for authentication in all services, so the impact was untolerable. We solved it by giving it limited access to CPU and RAM in the test environment, the same value as our other tasks. Then it seems like AWS randomly selects which task to interrupt when spot is not available, which has worked out well for us.
Running all services on spot instances could be problematic. I suggest you try it and make changes if interruption of tasks becomes a problem. The potential cost savings is huge.
4. Dynamic Scaling and Scheduling
4.1 Scaling Tasks to Zero During Off-Peak Hours
- Identifying Idle Services: Determine which services are not needed 24/7.
- Scheduled Events: Use CloudWatch Events or EventBridge to trigger functions.
- Scaling Actions: Lambda functions adjust the desired count of tasks.
- Implementation Steps:
- Write Lambda Functions: Scripts to scale services up or down.
- Set Up Schedules: Define cron expressions for scaling times.
We have closing hours in our test environment, no need to pay for tasks there when none of us are working. This can easily be handled with EventBridge, scaling your ECS tasks down to zero for instance between 22–06, and maybe zero in weekends as well. I suggest you take this up with your team, finding reasonable schedules. Remember, you pay for every hour it runs, no matter if you use it or not. Time to eliminate that gap.
I have created an internal dashboard for this, working on making it accessible via AWS Marketplace so it becomes a breeze for you guys as well. It enables an always off strategy for acceptanse test environments, since these activities does not necessarily happen all the time. You save a lot of money by being able to easily start a test environment or a subset of tasks by a click in a dashboard/app when you need it.
4.2 Auto Scaling Based on Demand
- ECS Service Auto Scaling:
- Scaling Policies: Based on CPU, memory, or custom CloudWatch metrics.
- Target Tracking: Maintain optimal resource utilization. - Benefits:
- Cost Efficiency: Pay only for resources needed.
- Performance Optimization: Ensure sufficient capacity during peak times.
If you need autoscaling to handle burst traffic I suggest using a capacity provider strategy combining On-Demand and Spot. AWS supports that as long as an On-Demand task is running. AWS has written a blog article about that as well with an example: https://aws.amazon.com/blogs/containers/optimizing-amazon-elastic-container-service-for-cost-using-scheduled-scaling/
If you have pretty stable traffic you might be better off by monitoring metrics like CPU and memory usage manually, or by utilizing CloudWatch alarms. Also if you are cost sensitive, and afraid of costs skyrocketing out of control I would recommend this approach, even though you can set a maximum number of tasks and control the worst case auto scaling scenario.
5. Additional Cost Optimization Strategies
5.1 Resource Optimization
- Right-Sizing Tasks:
- AWS Compute Optimizer, a free to use feature of AWS, search for it in the console and activate it now. It gives great recommendations you can use for EC2 instances, ECS services, Lambda functions, EBS volumes, EC2 Auto Scaling groups and RDS databases in terms of right-sizing.
A note about Compute Optimizer: I have found their recommendations on under-provisioned resources to be worth taking with a grain of salt, check that resource(s) metrics yourself in CloudWatch and make your own decision before increasing spend (either by increasing amount of tasks or increase CPU/RAM). It seems like they want you to increase spend long before you actually need to. Compute optimizer hits a lot better with over-provisioned resources, where you overpay, in my experience.
- Monitor Usage: Use CloudWatch to track CPU and memory.
I recommend utilizing CloudWatch Alarms, especially for critical services. For instance when CPU usage reaches 80 % or whatever value makes sense in your situation. Integrate with SNS topics to get an alert in your teams Slack-channel, or just a direct email or whatever fits your use case.
- Adjust Allocations: Modify task definitions to match actual needs.
- Container Density: Run multiple containers per task where appropriate.
5.2 Using Savings Plans or Reserved Capacity
- AWS Savings Plans:
- Compute Savings Plans: Offer discounts for a committed usage over time.
- Evaluating Commitments: Analyze past usage to determine commitment levels. - Reserved Capacity:
- Fargate Spot: Combine with Savings Plans for additional savings.
5.3 Container Optimization
- Minimize Image Sizes:
- Efficient Base Images: Use slim versions of base images. Alpine Linux can be a good choice here depending on your use case. - Multi-Stage Builds: Reduce final image size by separating build and runtime dependencies.
- ARM-images: If you build ARM based images you can choose ARM architecture when creating your ECS service. It saves you about 20 % per vCPU hour (could vary from region to region, example for eu-west-1).
5.4 ECR cleanup
You pay for old images in ECR, we cut our AWS bill with 5 % cleaning up old builds. I suggest setting up Lifecycle policies for your ECR repositories with reasonable values, for us we have set 180 days, so it keeps the latest, but deletes every image older. If you forget this when setting your repositories your ECR costs could be significant, since rapid deployments could give a lot of images, gigabytes could become terabytes. You can see this on your AWS bill under ECR, it is basically money right out the window.
6. Monitoring and Continuous Improvement
6.1 Utilizing AWS Cost Management Tools
- AWS Cost Explorer: Analyze spending patterns and identify cost drivers.
This could be really useful to validate your setup. We had really high VPC costs due to unneccesary IPv4 charges for instance. Always seek to understand what drives your costs.
- Billing Reports: Set up detailed reports for in-depth analysis.
6.2 Setting Up Cost Alerts and Budgets
- AWS Budgets: Define spending thresholds and receive alerts.
- Cost Anomalies Detection: Identify unexpected cost spikes.
6.3 Implementing Tagging for Cost Allocation
- Resource Tags: Assign tags to ECS tasks and services.
If you like me did not understand why AWS let you tag all resources when you first started using it, I would highly recommend to reconsider. It is a really useful feature, tagging resources per team, per environment, cost driver etc, whatever fits your use case. Not just for cost allocation, but could also be used in scripts or cost saving apps from AWS Marketplace to schedule resources based on tags for instance.
- Cost Allocation Reports: Break down costs by tags for granular insights.
7 Utilizing AWS Activate for startups (credits, discounts etc)
AWS small business program via AWS Activate gives your startup 1 000 USD just to start you off just by submitting an application. And that is on top of their 12 month free tier for most services. In addition you can apply for up to 100 000 USD in credits (counting those 1 000) by utilizing their partner program.
AWS Partner program
I have personal experience with partners Cloudvisor and Vestbee, cashing in an extra 10 000 USD worth of credits.
Cloudvisor also provides WAFR (Well Architected Framework Review), a free audit of your cloud infrastructure by certified AWS solution architects. You deploy a CloudFormation template giving them access to your resources for analyzing, working together to address the issues they discover via a shared Slack channel. You are free to choose which pillars you like to focus on. It could be optimization (performance), security, cost optimization etc. Really useful, and especially if you are self thaught, like me, learning best practices from experienced AWS consultants proved very valuable. When they are happy you feel confident about your AWS setup.
Vestbee required us to setup a profile at their site, showing off your investor case, helping you reach new investors etc. They also in addition to AWS (see https://aws.amazon.com/startups/offer) give you access to great discounts. Very useful for both parties in our case, in addition to credits.
There is a lot of AWS partners that could grant significantly more in credits, but most of them are accelerator programs/incubators specialized in a field. A bit more hassle to qualify. If you are affiliated with an accelerator already you should check with them if they are eligible to give you credits.
7. Conclusion
Optimizing costs on AWS ECS Fargate requires a multifaceted approach, balancing performance and expenses. By strategically selecting regions, optimizing networking, leveraging spot instances, implementing dynamic scaling, and continuously monitoring usage, businesses can significantly reduce their cloud expenditures. Regular reviews and adjustments are essential to maintain cost efficiency in the dynamic cloud environment.
Just let me know what you think in the comments, any cost saving subject you would like a more thorough guide on? Let me know, I will see what I can do. And would love to hear your cost saving tips! Will start checking out the AWS Marketplace soon, many solutions there for cost optimization that could be worth checking out.