Lyft Increases Simulation Capacity, Lowers Costs Using Amazon EC2 Spot Instances

About Lyft Level 5

Lyft, one of the largest transportation networks in the United States and Canada, is on a mission: improve people’s lives with the world’s best transportation. Along with its focus on shared rides, bike-share systems, electric scooters, and public transit partnerships, Lyft launched its Level 5 autonomous vehicle (AV) division in 2017 as part of its effort to achieve this mission. Using petabytes of data gathered from its AV fleet, Lyft’s engineers run millions of simulations each year to improve the performance and safety of its self-driving system.

But those simulations are compute-intensive, and Lyft knew it would need massive computing power that could scale up and down at an affordable price. The company, which has been using Amazon Web Services (AWS) for its rideshare platform since the day it launched in 2012, turned to AWS again to boost its compute capacity and lower costs, ultimately choosing a combination of Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances and Amazon Elastic Kubernetes Service (Amazon EKS) for its AV simulation workload.

Running Simulations on Amazon EC2 Spot Instances

Running simulations on thousands of graphics processing units (GPUs) in parallel is critical to Level 5’s success in testing and improving how AVs respond to various driving situations. “Simulation is one of the key ways we improve the safety of our software before it goes anywhere—even a test track,” says Timothy Perrett, senior staff engineer at Lyft Level 5. Exploring the simulation space (such as varying the speed, position, or vehicle dynamics) requires repeated testing and thus a lot of computing flexibility.

Early on, it was clear that Level 5 would have very different computing needs than Lyft’s rideshare business. “Level 5 has different needs and constraints,” says Perrett. “Most of our computing needs are in servicing large, batch-style workloads that have a very spiky profile. We need the ability to burst up to high peak loads and then quickly turn everything down when we’re not using it.”

Lyft could have invested in on-premises central processing units and GPUs, but the Lyft team’s prior experience on AWS made the AWS Cloud its first choice. So the testing began. Level 5 engineers started by utilizing capacity from Amazon EC2 On-Demand Instances, in conjunction with Amazon EKS, the fully managed Kubernetes service that AWS offers.

After experimenting with running simulations using On-Demand Instances, Lyft’s Level 5 team quickly realized it could improve efficiency and reduce costs by shifting to Amazon EC2 Spot Instances. Now more than 90 percent of the simulations run on Amazon EC2 Spot Instances, including Amazon EC2 P3 Instances powered by NVIDIA V100 Tensor Core GPUs, and that enables Lyft to take advantage of unused Amazon EC2 capacity in the AWS Cloud at up to a 70 percent discount compared to On-Demand pricing. “When we experimented with running on Amazon EC2 Spot Instances, we realized that as our program was growing quickly, there was an opportunity to significantly reduce our operational costs,” says Perrett.

Enabling Simulations to Run Efficiently

The Level 5 team distributes its simulation workload in what Perrett calls a “clever dance” to ensure that simulations still run even when Amazon EC2 Spot Instances aren’t available because of high demand. Engineering staff observed which clusters—and pools within those clusters—operated efficiently and took into account regional zone usage. “We became smarter about how we allocate work and how we relocate jobs in a given resource pool on a given day,” Perrett notes. The team used Amazon EKS to prioritize and scale resource pools so jobs were efficiently using instances.

The engineering team was also careful to design systems so that simulations would function on a variety of hardware, depending on what was available—something Lyft calls fleet diversity. Perrett explains, “We put a lot of effort into making our stack work on whichever type of instance is available—Amazon EC2 P3 Instances versus the Amazon EC2 P2 Instances, for example.” This flexibility helps Level 5 engineers avoid having to wait to schedule simulations, even when demand is high.

Lyft also has to manage a massive amount of data gathered from simulations and from its AV fleet, and it takes advantage of Amazon Simple Storage Service (Amazon S3) to store and access an ever-expanding dataset as Lyft increases the number of sensors on its test vehicles. Gathering and storing all that information from its AVs and simulations amount to petabytes of data, and transferring that amount of data directly to the cloud, as the Level 5 team did in the early days, was costly. To reduce that cost, Lyft uses AWS Direct Connect, a dedicated network connection between its Level 5 engineering center and its cloud systems. “We have a very high-capacity network that connects to the places where we operate our AV fleet,” Perrett notes. “And then we upload the data for a much lower cost per petabyte.”

By carefully partitioning and directing its simulation traffic on Amazon EC2 Spot Instances, Lyft’s Level 5 engineering team reduced the cost of simulations to just pennies for each execution. “About 77 percent of our computing fleet across all Level 5 workloads—and over 90 percent of our AV simulation workload—is now on Amazon EC2 Spot Instances, and the cost savings overall has been around two-thirds,” says Perrett. “We were able to scale up our computing capacity significantly while reducing the overall cost of operation.”

Benefits of AWS

● Reduced compute costs by two-thirds
● Scaled up computing capacity significantly
● Increased velocity of development for AVs

AWS Services Used

Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers.

Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud. Spot Instances are available at up to a 90% discount compared to On-Demand prices. You can use Spot Instances for various stateless, fault-tolerant, or flexible applications.

Amazon EKS is a fully-managed Kubernetes service. EKS runs upstream Kubernetes and is certified Kubernetes conformant so you can leverage all benefits of open source tooling from the community.

Conclusion

Running millions of simulations at steep cost savings on AWS allows Lyft’s engineering team to run its tests from inside its offices, enabling staff to gain confidence in software changes prior to taking physical vehicles out in the real world. “Simulations are a more cost-effective means of validating software changes compared to taking a vehicle to the test track,” Perrett says. “This improves iteration time for engineering staff and helps improve safety and software quality on a shorter time horizon.”

Instead of using On-demand Instances, using Spot Instances is one of the suitable solutions to help businesses optimize operating costs while ensuring business operations. If you are having problems related to EC2 cost, try to use Spot Instances, your business will definitely save a lot of money.

About VTI Cloud

VTI Cloud is the Advanced Consulting Partner of AWS in Vietnam, with a team of more than 50+ AWS certified solution engineers. With the desire to support customers in their digital transformation journey and moving to the AWS cloud, VTI Cloud is proud to be a pioneer in solution consulting, software development, and deployment of AWS infrastructure for customers in Vietnam and Japan.

Building secure, high-performance, flexible, and cost-optimized architectures for customers is VTI Cloud‘s primary mission in the mission of enterprise technology.

Reference: https://aws.amazon.com/vi/solutions/case-studies/Lyft-level-5-spot/?nc1=f_ls