4 min read

So long, audio

“We are aligned.”

So many people, so many things in this world need to align with others for them to work well.

These things include audio tracks. With your video tracks.

Oh, and your text tracks, too.

Which leads us to our customer question of the month: ​“My video timeline is so spiffy and short, so how come, in my DASH manifest, my audio timeline is so long?”

You didn’t do the math, silly! If you want to go from …

<SegmentTimeline>
<S t=“0” d=“96256” r=“2” />
<S d=“95232” />
<S d=“96256” r=“2” />
<S d=“95232” />
<S d=“96256” r=“2” />
<S d=“95232” />
<S d=“96256” r=“2” />

….

to

<SegmentTimeline>
<S t=“0” d=“921600” r=“37” />
<S d=“210192” />
</​SegmentTimeline>

… then you need to ensure your segment size fits both your audio and your video. If you don’t, then your audio is constantly having to compensate. Add a little bit of audio in one segment, take a little bit of audio off for the next segment, and so on. What you don’t have here are segment boundaries that are nicely aligned between audio and video.

And we all know how important boundaries are!

Non-Aligned

Aligned

Segments, fragments, boundaries, confused? Well, no wonder!

Let’s focus on how things work when dynamically packaging your content (i.e., running Unified Origin).

Audio segments can only be a multiple of fragment length on disk — dictated by your live encoder or how you’ve prepared your on-demand assets on disk — which, in turn, can only be a multiple of an audio frame length (in AAC, 1024 samples).*

Video segments can only be a multiple of fragment length on disk, which, in sensible scenarios, can only be a multiple of a GOP (Group of Pictures) length.

What we mean by a segment is: what is referenced in the client manifest or sent by a live encoder.

What we mean by a fragment is: the fragments in which the content is stored, in a fragmented MP4 file (hello ​‘moof,’ hello ​‘mdat’!)

And just to confuse matters a bit more: in many scenarios, segments and fragments have the same length.

Now we have a better understanding of the terminology, let’s align! First off, you need:

  • Video with an integer frame rate (i.e., no dropped frame rate)
  • Audio with a sample rate of 48 kHz (i.e., not 44.1 kHz)
  • An encoder that supports non-integer segment durations

Then you need a calculator and a half sheet of paper. (For a handy GOP size calculator for segment alignment, you can go here.)

So … an AAC audio frame consists of 1024 samples. This means that one frame of AAC audio with a sample rate of 48 kHz is 1024/48000 seconds long.

The length of one frame of video is simply 1 divided by the frame rate (e.g., 1/25 seconds for a frame rate of 25).

And what is a sensible segment duration that is both a multiple of 1024/48000 and 1/25? First, we need to know the lowest common denominator of both these numbers. In this case, the lowest common denominator is 8/25, or 0.32.

Then simply find a GOP length that fits your use case — one which is a multiple of the lowest common multiple that you calculated. A sensible duration in this case would be 1.92 seconds, for example. This equals 90 audio frames and 48 video frames.

For VOD content, you can even use Unified Packager to adjust your fragment duration. For video, you will need to use a multiple of your GOP length, audio can be any multiple of its frame length, and text tracks can be set to whatever you like. Preferably, all these should be the same. Otherwise you won’t see that lovely neat timeline!

Now you’re aligned, and perfect alignment ensures maximum compatibility. Perfect alignment optimizes the media for streaming delivery in terms of bandwidth usage and manifest sizes.

It’s not just a pretty manifest you’re after: perfect alignment also helps when multiplexing tracks within a single MPEG‑2 transport stream, when creating Virtual subclips, and when using Unified Capture without frame accuracy (i.e, without transcoding). Without perfect alignment, a virtual subclip or captured clip may contain (small) gaps of audio or video at the start or end of the clip, resulting in potential playback issues.

Thanks for reading. We wish all our readers perfect alignment!

*Here’s a tip for anyone whose audio tracks on disk are fragmented in a way that doesn't let you achieve segment alignment with video (e.g., ~2 seconds). Skip repackaging all these tracks and create dref MP4s for them instead. If you use these dref MP4s as input to Unified Origin, then Origin will be able to create segments that are a multiple of the audio frame length (e.g., 1024/48000 seconds) instead of fragment length (e.g., ~2 seconds).

Share