How fast can you decode videos into frames with FFmpeg? Part-1

Jc Huynh
4 min readMar 5, 2020

With the recent advances in computer vision, researchers and engineers are looking forward to building impactful applications to change our lives. Video data is one of the most informative data sources that we can easily collect at scale on the Internet. However, many computer vision models work at frame-level data. Oftentimes, people have to decode collected videos into frames to feed into those models. FFMpeg is one of the most widely used frameworks that many researchers and engineers often use to do the decoding job.

FFmpeg is the leading multimedia framework, able to decode, encode, transcode, mux, demux, stream, filter and play pretty much anything that humans and machines have created. It supports the most obscure ancient formats up to the cutting edge. No matter if they were designed by some standards committee, the community or a corporation. It is also highly portable: FFmpeg compiles, runs, and passes our testing infrastructure FATE across Linux, Mac OS X, Microsoft Windows, the BSDs, Solaris, etc. under a wide variety of build environments, machine architectures, and configurations. [1]

Given a large number of videos, have you ever wondered how long would it take to decode all of them? For example, if a user sends you an hour-long video, can you estimate the amount of time to analyze that video beforehand? Would it be better to do decoding on the CPU rather than on the GPU? How fast can your machines/servers actually decode a single video or multiple videos concurrently? Let’s dive into some experiments.

First, we need to download some videos for our benchmark. Personally, I prefer something fast to achieve my goal so I used pytube library to download a video with multiple resolutions on Youtube. You can take a look at my python code below. Basically, the code uses pytube to download any videos from a Youtube URL with an mp4 file extension and save them into videos directory.

Note: If you are using python3, do "pip install pytube3" instead of "pip install pytube"

Second, we need FFmpeg. If you are going to do the benchmark only on the CPU, there is a simple solution.

For ubuntu user: sudo apt install ffmpeg
For fedora user: sudo dnf install ffmpeg

Otherwise, you can follow these instructions to compile FFmpeg with CUDA-support yourself.

Now, you already have FFmpeg and some video samples, let’s write a simple program to do a benchmark on a single video.

The code seems to be long. But basically, it iterates through the videos directory and decodes the video at different target fps. For instance, we decode a 30 fps video to extract 1, 2, 5, 10, 15, 30 frames every second and estimate the number of frames per second we get as outputs. The core function “extract_frames” actually just calls the FFmpeg program from a command line to extracts frames out of the video without storing any data on a local disk. By doing so, we can measure the decoding latency precisely. To be specific, two commands I use for decoding are:

FOR CPU: ffmpeg -i <video_path> -r <target_fps> -f null /dev/null

FOR GPU: ffmpeg -hwaccel nvdec -i <video_path> -r <target_fps> -f null /dev/null

Below are the results for your references I’ve done on some machines/servers I have access to.

CPU Benchmark 1
CPU Benchmark 2
CPU Benchmark 3

I evaluate the throughput of our systems in terms of decoding fps which measures the number of frames we can decode every second. All my videos have H264 codec.

From the 3 charts above, there are some observations.

  1. It’s easy to notice that the decoding fps increase when the resolution of the video decrease. For example, at target fps 1, decoding 720p video is 1.68x faster than decoding 1080p video in CPU benchmark 1(38fps vs 64fps).
  2. We can also observe that decoding fps at target fps 30 is approximately 2x higher than at target fps 15. That’s because the amount of computation to extract 15 frames every second is kind of similar to the amount of computation to extract all the frames. FFmpeg has to iterate through all the frames and drop irrelevant outputs. In the end, the total number of outputs frames is reduced by 2x, making the decoding fps is reduced by 2x. The same principle applies to other cases (e.g., target fps 10 vs target fps 5, etc.).
  3. Another interesting observation is that the Azure VM is super fast on low-resolution videos (e.g., 240p and 144p videos). I’ll have to do further experiments to find out the explanation for this behavior.

During the benchmark, I noticed that the CPUs on those machines are not 100% utilized, making rooms for further improvements.

In the next part, we will take a look at some optimizations such as using more threads for decoding or running FFmpeg decoder on multiple videos concurrently. We will also measure the decoding performance on GPU to do some comparison with the CPU performance.

Stay tuned and see you in the next part.

--

--