12 min read
Load testing streaming video at scale, part 1: Introducing the testbed and cloud instance evaluation
The cloud has become a pillar for the deployment of many streaming video services, which often run different functions in different parts of the network. Consider a typical Video on Demand (VOD) setup, for example, with a library of media assets hosted on remote object storage, dynamically packaged by a cluster of origins that is configured with a load balancer in front it and protected by a shield cache, with the generated stream served out through a CDN.
Unfortunately, building your media delivery and choosing the right components can sometimes become an overwhelming task, as many things need to be taken into account:
- What kind of hardware do I need?
- What media processing function to chose (e.g., media content re-packaging, manifest generation, trans-rating)?
- What media format from the “jungle” of media specifications should I use?
- What kind of metrics do my third-party systems need for interoperability alignment?
- What is the maximum response time client devices are able to handle?
- … just to name a few.
These questions are crucial to answer, but currently there is no standardized method for testing distributed media workflows, and without standardized testing what will you base the answers to these questions on?
That’s why this blog introduces a testbed to evaluate different setups. It takes a cloud-first approach and focuses on Real-Time Distributed Media Processing Workflows (DMPW) based on MPEG’s Network Based Media Processing (NBMP) standard.
The testbed’s objective is to reduce testing complexity when deploying distributed media processing in the cloud and to understand the minimum requirements of your setup.
NBMP
Network based media processing refers to media processing that is carried out at different points in the network. Consider transcoding happening at a different point than the packaging and encryption, for example. MPEG initiated a standardization effort (ISO/IEC 23090 – 8:2020) around this approach, to specify media formats, workflow descriptors, API’s and a reference architecture for network and cloud-based processing of media. The NBMP standard targets use cases such as delivery of immersive media, real-time processing and large scale streaming.
We introduce this standard as a key enabler for testing and deploying distributed media processing workflows at scale with a standards-based configuration and deployment. The testbed we developed enables the careful study of different deployment and configuration aspects, like:
- Choice of instance type
- Choice of streaming protocol
- Benefit of edge-based deployment
Test setup
The specific use case of interest is real-time (on-request) distributed media processing workflows for online media delivery over HTTP(S). Contrary to batch processing, which happens “offline”, a request of a client for a new segment needs a response that is dynamically generated in near real-time (<400ms), and many concurrent requests may occur (scale).
Other than testing performance specifically, the framework targets cost-savings and horizontal scaling policies by automatically detecting if a specific deployment configuration can support a certain workload.
Enough of specs … show me the thing!
Figure 1 illustrates the NBMP-based testbed integration. Each component is described as follows:
- NBMP Source: Component in charge of triggering the Workflow Manager using the NBMP workflows APIs.
- Workflow Manager: Is in charge of instantiating and scheduling media processing tasks and monitors each step of the DMPW.
- Function Repository: Contains the media processing images that the DMPW uses to process the media. It can use either publicly available images (e.g., through Docker Hub) or personalized local software images.
- Media Source: Contains the source media files. The testbed supports both live generated and stored media sources.
- Media Processing Entity (MPE): Is responsible for running the Media Processing Function and executing NBMP tasks.
- Media Processing Function (MPF): origin function capable of completing one or more different processing tasks in real-time upon a client request. Some supported functions include:
- Media content re-packaging
- Manifest generation
- Segment (re-)encryption
- Trans-rating
- Media Sink (Workload Generator): Function that emulates video streaming requests and can be used to test Video On Demand (VOD) and live streaming setups.
- Data Storage & Visualization: The NBMP testbed uses an OpenTSDB database to store metrics resulting from executed tests, and Grafana to visualize these metrics.
Media Source
The testbed supports both live and stored media sources.
For our live media source we emulate a live encoder by producing a CMAF bitrate ladder on-the-fly, using HTTP POST requests to transmit the resulting media segments to the Media Processing Entity, as specified by the DASH-IF Live Media Ingest Protocol.
For our VOD tests we use the stored media source that is detailed in Table 1. It is based on the Sita Sings the Blues content that is publicly available. The video is encoded with a GOP size of 48 frames at 24 fps, resulting in a fragment duration of two seconds and a total duration of 4891 seconds. Each media track was packaged into a CMAF media container and stored as a CMAF track on cloud-based storage that is accessible over HTTP(S)-based and which supports byte-range requests.
TrackResolutionBitrate (kbps)CodecProfile600k.cmfv320x180592avc1High@2.01200k.cmfv480x2701176avc1High@2.12000k.cmfv960x5401954avc1High@3.13000k.cmfv1920x10802919avc1High@4.04000k.cmfv1920x10803894avc1High@4.0eng.cmfaN/A132aacLC
Table 1: Encoding specifications of Media Source: Sita Sings the Blues
Workload Generator
The Workload Generator is based on Locust , an open-source tool for load testing of web applications. Locust is optimized to support the emulation of a large number of workers, without a large number of threads by using an event-driven model. Locust allows writing specific user behavior for load testing through plain Python programming language scripts.
For our experiments, the selected load generation configuration is a ramp-up from zero to 50 concurrent workers.
The workers behave as follows for VOD:
- For MPEG-DASH the MPD is requested first, then all audio and video segments are requested sequentially
- For HLS the Master Playlist is requested first, then the Media Playlists (.m3u8), and then all audio and video segments are requested sequentially
A next segment is requested upon completion of the prior request for both types of clients.
For live DASH and HLS clients behave like described above, except that they also refresh the MPD and Media Playlists periodically.
Upon any request, the media processing workflow is triggered to produce the target output and serve it as a response to the client. After receiving a response, a worker will immediately fire of the next request. This means that the lower the response times the more requests will be triggered.
Therefore, the workers in this test setup introduce more load than a “normal” user watching a stream (because the very quick ramp up of requests will stop after the player’s buffer has been filled).
Figure 2 provides an illustration of how the Workload Generator runs in a distributed mode based on a master/slave (client) model.
Media Processing Function
With the source of our streams and the workers defined, the crucial part still missing from the testbed setup we have run a wide variety of tests with is the Media Processing Function.
The flexibility of the testbed framework described above means it isn’t tied to any MPF in particular, but in our case we relied on Unified Origin to fulfill the function. This dynamic packager that runs as a plugin on an Apache web server is one of the core products of Unified Streaming and used in many large scale streaming setups worldwide.
Cloud instance evaluation
With our testbed set up, we first tried to answer one of the most popular questions when it comes to running a media processing workflow in the cloud: which instance type should I use?
We considered different categories for the target DMPW. Specifically, we selected AWS’s c5, m5 and r5 cloud instance families that use a near bare-metal performance virtualization technology (AWS Nitro), and we selected the c4, r4 and i3 instances that use a xen-based hypervisor. In prior tests we have seen that smaller instance types, which have higher virtualization cloud overhead and limited resources, are not effective for real-time large scale request handling.
The details of the “large” AWS EC2 instance types that we compared are shown in Table 2. The second column lists the CPU technology used, based on the Intel Xeon processor family or the EPYC 7x family as developed by AMD. For more details on the processor types we refer to the AWS EC2 documentation.
EC2 instance types tested (size:large)*TypeCPUClock (GHz)Mem. (GiB)Net. Bandwidth (Gbps)$/hrHypv.m5Xeon_p8x3.18100.115nitrom5aEpyc_7x2.58100.104nitrom5nXeon_Sc3.18250.141nitroc4Xeon_E52.93.75N/A**0.114xenc5Xeon_p8x3.44100.097nitroc5aEpyc_7x3.34100.087nitroc5nXeon_p8x3.45.25250.123nitror5Xeon_p8x3.116100.152nitror5aEpyc_7x2.516100.137nitror5nXeon_Sc3.116250.178nitror4Xeon_E52.315.25100.160xeni3Xeon_E52.315.25100.186xen
Table 2: EC2 instance types tested. *Details provided by AWS for Frankfurt datacenter, checked on 18th of May 2021. **Network bandwidth provided as “moderate”, no numbers are given.
Test results per instance type
Figure 3 illustrates the average response time of each cloud instance to a request for a media segment. The bitrate of the video and audio segments are 3894kbps and 132kbps, respectively.
However, do note that the presented measurements are based on a basic setup of Unified Origin for a VOD scenario, which wasn’t optimized for any specific instance type. Therefore, the average response time as reported below are higher than what can be achieved in a fully optimized production setup.
For visualization purposes, we also divided and compared the twelve tested instances in two radar plots shown as shown in figures 4 and 5. The scores are normalized to the best scoring instance, and better is larger (i.e., the inverse is used for latency, memory usage, CPU usage, etc.). Also, the cost aspect (pricing) is included.
Taking all of this into consideration, the cloud instance c5a.large of the computing optimized type obtained the highest throughput and lowest response times in comparison to the other cloud instances. In contrast, the c4.large instance obtained the worst performance when it comes to total media delivered.
The difference in performance throughput correlates to the different virtualization technologies used by the cloud instance. Surprisingly, the instance c5a.large not only performed best but also featured the lowest cost per hour.
Therefore, the key points to take away here are:
- More expensive hardware does not always mean better performance!
- In a distributed VOD workflow, Unified Origin can obtain higher performance gains using compute optimized instances compared to other cloud instance types
- Using near bare-metal hypervisors such as Nitro hypervisors can improve the overall performance of your workflow
- Cloud instances with higher CPU clock speeds can be considered more suitable for Unified Origin’s workloads
Conclusions
In this first part of our series on load testing streaming video at scale we introduced a testbed for evaluation of Real-Time Distributed Media Processing workflows based on MPEG’s NBMP standard. We described each component of the testbed and its capabilities to study different deployment and configuration aspects of a DMPW.
We also compared certain performance metrics of a media workload to select the best cloud instance and identify its key requirements.
In the part 2 of this series we will present how the testbed presented here can help compare different types of media processing configurations using Unified Origin, with a focus on achieving the best results when streaming VOD from a remote object storage.