Spotify your AWS infrastructure and save cost - part - 1

Spot fleet is an AWS managed service that manages a collection of Spot Instances and optionally on-demand instances. In this post we’ll walk through the advantages and strategies that an organization can follow to take full advantage of spot instances to save significant infrastructure cost.

Spot fleet makes it easier to manage a set of spot instances based on a specific criterion. In order to use spot fleet we can use either AWS console, AWS API or AWS CLI. in part 2 of this article this article we will explore using AWS CLI to launch a spot fleet and also explore various options that can be used. In part 3 we will explore launching spot fleet using terraform

In order to understand spot fleet, let’s understand the difference between spot and on-demand instances.

On-Demand instance – As the name suggests it’s on-demand that means you can launch an ec2 instance and pay for compute capacity by the hour or seconds (minimum 60 seconds) with no long term commitments

Spot Instances – Works on supply and demand model and AWS allows us to take advantage of using its unused compute capacity at a cheaper price. These instances are called as spot instances. The major advantage with spot instances is that they are a lot cheaper as compared to on-demand instances. In many cases using spot instance will save at least between 65 – 80% cost compared to on-demand prices.

Lets compare the key differences between On-demand and Spot  instances

On-demand Instance Spot Instance
Pay by hour / seconds with minimum of 60 seconds. The hourly price for spot instances varies based on demand Pay by instance per hour however the cost is up to 90% cheaper than on-demand pricing. The hourly price of on-demand instance is static
AWS does not interrupt the on-demand instance Comes from unused compute capacity that means AWS can interrupt and claim the spot instance by giving 2 min warning
Recommended to be used for any kind of workloads Recommended to be used for stateless, fault tolerant and flexible workloads for example CI/CD, containerized workloads, High Performance Computing (HPC) are some of the most common use cases that can utilize spot instances
If capacity is not available, the Spot Request continues to automatically make the launch request until capacity becomes available If capacity is not available when you make a launch request, you get an insufficient capacity error (ICE)

Let’s look at Spot pricing history to get an idea of Spot prices.

Spotify

Picture - 1 : Spot pricing history

The above graph shows the Spot pricing history of m4.xlarge instance in different availability zone. That means if we were running our workload on m4.xlarge Spot instances in us-east-1a for last 3 months then our Spot instance would be running without interruptions saving 70% cost

How organizations can take advantage of spot instances?

Design you work loads to be fault tolerant

In order to take advantage of spot fleet the most important thing is that you need to design application architecture to be fault tolerant so that they can handle situations like instance termination gracefully. At a minimum implement the following strategies:

 

  • Design you application to handle instances termination gracefully

  • Design and implement you infrastructure automation scripts to be flexible on choosing instance types. This will allow you to use instance types with least interruptions

  • Use elastic load balancing and auto scaling groups to distribute the load across EC2 instances

  • AWS provides 2 mins instance interruption notice so handle instance termination cloud watch events combined with custom logic to orchestrate workloads around any potential interruptions

  • Use connection draining to handle graceful instance removal from load balancer allowing in flight request maximum of 2 mins time to finish up their task

  • Deploying your application across many instance types will further enhance availability. Use of Spot Fleet makes diversification across multiple instance types and availability zones easier

Plan your workloads

Plan your workloads for example below table summarizes some of the work loads that can take advantage of the spot instances

Workload TypeStrategy
CI/CDUse spot fleet to manage compute capacity for your CI/CD workloads. If you are using Jenkins then you can take advantage of Jenkins plugin for spot fleet to run EC2 spot instances as worker nodes
Test environments Use spot instances to run your test environments. In fact if organizations design their infrastructure automation to be flexible enough to run entire test environment using spot instance will save at least between 60 - 70% of the cost as compared to on-demand pricing
Analytics One of the best cases to use spot instances to run complex analytics workloads at a lower cost. For example running Monte Carlo simulation across multiple spot instances to scale to massive amount of capacity at the best possible price
Stateless microservices Stateless microservices can take advantage of spot instances by using spot instances for scaling requirements. Use spot fleet to manage capacity of your microservices using diversified strategy and spot fleet will maintain capacity for you using spot instances. Additionally you can also use spot fleet with a mix of on-demand and spot instances to maintain a minimum capacity as well as use spot instances for any additional spikes saving significant cost

Use AWS Spot Instance Advisor

Use AWS Spot Instance Advisor to stay informed about instance interruption frequency and cost saving. 

Let’s look at the top 10 instances on AWS Spot instance advisor

10 spot instances

Picture 2: Spot instance advisor top 10 Spot instances

The important thing to note in the above picture is the last column “Frequency of interruption” which is <5% for the instance types listed in the first column and the expected cost savings in the 4th column. The recommended way is to start with instance types having lower frequency of being interrupted and adding more instance types as you improve your application’s flexibility and fault tolerance. See the Best Practices section for more tips and tricks.

Key Spot fleet concepts

Spot Fleet request – There are two ways we can request a spot fleet: one-time and maintain. When you place spot fleet one-time request spot fleet places a one-time request and does not attempt to replenish Spot Instances if capacity is diminished. This means that If capacity is not available, Spot Fleet does not submit requests in alternative Spot pools. In case of a maintain type requests Spot Fleet manages the target capacity over time and automatically replenish any interrupted instances.

Spot Instance pool – Is a set of unused EC2 instances with the same instance type (for example, m4.xlarge), operating system, Availability Zone, and network platform

Allocation strategy – Spot Fleet use allocation strategy specified in your Spot Fleet request to fulfill the desired capacity. Following are the allocation strategies can be specified in the Spot Fleet request

lowestPrice – Allocate spot instance from the instance pool with lowest price. This is the default strategy

diversified – Spot instances allocation is distributed across all the instance pools specified in the request, with the exception of those where the current Spot price is above the On-Demand price. Allocation strategy gives flexibility to choose the Spot instance allocation as per our goals. Below table summarizes the guidelines

lowestPricediversified
Suitable for small fleet size with fewer instances Suitable for large fleets with hundreds of instances
Impacted by the changes in the spot pool price. For example you might be paying high prices if spot pool price increases Spot fleet diversified the allocation across multiple instance pool so overall cost is protected from spot price fluctuations and most of the time it averages around 65%-75% off of On-Demand pricing
Entire fleet can be interrupted and subsequent replenishment (e.g spot price become more on-demand price or capacity not available) Fraction of fleet (1/Nth of total capacity) subject to possible interruption and subsequent replenishment
Suitable for short lived non time sensitive applications such as Scientific simulations, research computations etc. Suitable for long running time sensitive applications such as front facing web servers, CI/CD, HPC and transcoding etc.

Spot fleet instance weighting – Instance weighting is another feature you can use in case you need capacity in terms of let’s say number of CPUs or memory etc. when you use instance weighting then the spot price you need to specify in terms of per unit hour, the default is instance per hour. Using instance waiting feature of Spot Fleet will require following things:

  1. Set the target capacity when requesting a Spot Fleet. The target capacity could be in number of instances or in other units such as vCPUs, memory or storage etc.
  2. Set the price per unit. This could be done by dividing your instance type by the number of units that it represents
  3. Specify the instance weight in the Spot Fleet launch configuration. 

The following tables provide examples of calculations to determine the price per unit for a Spot Fleet request with a target capacity of 10. We’ll discuss these in detail in the next part of this article

Instance type Instance weight Price per instance hour Price per unit hour Number of instances launched
r3.xlarge2$0.05.025 (.05 divided by 2)5 (10 divided by 2)
Instance type Instance weight Price per instance hour Price per unit hour Number of instances launched
r3.8xlarge8 $0.10.0125 (.10 divided by 8)2 (10 divided by 8, result rounded up)