Spatial Audio for Cinematic VR and 360 Videos

Audio is a powerful way to fully immerse your audience in your VR experience. In this section, you'll learn the basics of spatial audio.


Written by Abesh Thakur, Product Manager, Immersive Audio, Facebook

Spatial audio is a powerful way to fully immerse a user and direct attention within a 360 video or VR experience via sound. A huge part of our attention can be directed with audio cues but a fully immersive experience requires a detailed spatial audio mix and not just cues that are added on as an afterthought. Spatial audio makes what we hear a believable auditory experience that matches what we see and have experienced. And for this precise reason, for the most realistic and impactful experience, it is pertinent for the sound design to be a part of the creative brief from the very beginning as bad or misplaced audio design and cues can be a deterrent to a convincing outcome.

Here we’ll cover the basics of spatial audio and provide some how-tos to help you make an awesome 360 VR experience.

What is Spatial Audio?

The human brain interprets auditory signals in a specific way that allows it to make decisions about its surrounding environment. We use our two ears, in conjunction with the ability to move our heads in space, to make better decisions about the position of an audio signal and the environment the sound is in.

Spatial audio in virtual reality involves the manipulation of audio signals so they mimic acoustic behavior in the real world. An accurate sonic representation of a virtual world is a very powerful way to create a compelling and immersive experience. Spatial audio not only serves as a mechanism to complete the immersion but is also very effective as a UI element: audio cues can call attention to various focal points in a narrative or draw the user into watching specific parts of a 360 video, for example.

Spatial audio can be best experienced using a normal pair of headphones . No special speakers, hardware, or multi-channel headphones are required. For more details on auditory localisation, check out this write up on the Oculus Developer Center.

Try it for yourself with these examples of spatialized audio. Make sure you have your headphones on!

Fuerza Imprevista (JauntVR)

Through the Ages (NatGeo)

Rapid Fire: A Brief History of Flight (Studio Transcendent)

Linear vs. Interactive Audio Design

While the playback and consumption of spatial audio is the same regardless of whether the experience is an interactive experience, 360 video, or a cinematic mixed reality piece, the workflow to create such content is significantly different. In particular, games and interactive experiences often rely on audio samples played from discrete sound sources that are mixed in real time relative to the position of the camera. The Oculus Audio SDK is designed to add high-quality spatialization tools to existing tools (such as FMOD and Wwise) and engines (Unity and Unreal) often used by game developers.

Thanks to the growth of production and consumption of immersive panoramic or VR experiences, developers and infrastructure owners are re-examining the constraints of video container specifications. Better file compression, codec and streaming infrastructure support for multichannel spatial audio, or, support of various real-time interactive metadata elements to influence traditional video playback or social reactions with live video are examples of a trend which is proof of the worlds of traditional video broadcasting and siloed game-like apps coming together.

Ambisonics

Ambisonic technology is a method to render 3D sound fields in a spherical format around a particular point in space. It is conceptually similar to 360 video except the entire spherical sound field is audible and responds to changes in head rotation. There are many ways of rendering to an Ambisonic field, but all of them rely on decoding to a binaural stereo output to allow the user to perceive the spatial audio effect over a normal pair of headphones.

Ambisonic audio itself can be of n- orders comprising of various channels. More channels results in higher spatial quality, although there is a limit to the perceived difference in sound quality as one goes beyond 3rd order Ambisonics (16 channels of audio). Regardless of the number of channels used for encoding the original signal, the decoded binaural audio output will always be to two channels. As the listener moves their head the content of the decoded output stream shifts and changes accordingly, providing a 3D spatial effect.

Ambisonics is not the only way to render spatial audio for 360 videos. There are other solutions in the market as well, although the effectiveness, feature set, toolchains, and final render quality varies between various techniques:

  • Traditional surround sound such as 5.1, 7.1 etc. which can be decoded over virtual speakers and rendered binaurally over headphones. Depending on the content, the rendered sound field may suffer from ‘holes’ between the speakers and won't have the same smoothness in spatial accuracy or resolution
  • Quad-binaural: 4 pairs of pre-rendered binaural stereo tracks each in 0, 90, 180 and 270 degrees. The audio streams are faded in based on head-rotation

Using Facebook 360 Spatial Workstation to Create 360 Linear VR

The FB 360 Spatial Workstation is an end-to-end pipeline that allows sound designers to drop in audio sources, pan and sync to scene elements, and render to a single ambisonic file that is played back on Facebook and Oculus video. Originally developed by Two Big Ears, Facebook 360 Spatial Workstation is now a free tool provided by the Audio 360 team at Facebook. Spatial Workstation is a collection of plugins for DAWs that include a Spatialiser, video player, Encoder, and Loudness meter, just to name a few. These plugins help authors create spatial audio content, encode it with platform-specific metadata (for Facebook, YouTube, etc.), and play it back in a client application.

The diagram above illustrates a typical end-to-end workflow focusing on sound design, asset preparation, mixing with final video, and publishing to Facebook, Oculus or other supported apps.

For most 3rd party apps on Gear VR and other platforms using the Rendering SDK, the sound designer prepares a .tbe file, which is delivered separately. The 3rd party application has the underlying Rendering SDK integrated, which allows it to play back the .tbe file in sync with the video file. There are multiple APIs included with the documentation that allow synchronization to an external clock.

For Facebook and YouTube, the Facebook 360 Encoder application creates an upload-ready .mp4 file.

Content Creation

The Facebook 360 Spatial Workstation is a collection of plugins for creating interactive spatial mixes for 360 videos.

  • The Spatialiser plugin allows the sound designer to place a sound source in space. The source itself could be a mono source, an Ambisonics recording, or a multi-channel source such as a surround reverb. Non-mono sources act as a 'bed' while diegetic mono sources, such as dialogue and sound effects, are usually placed in a scene. Non-diegetic audio such as narration or background music is usually routed to the head-locked stereo bus. This makes it part of the final mix but not relative to head orientation.
  • The Control plugin acts as the command centre, controlling how all audio is routed for real-time binaural playback over headphones. This plugin also manages global settings of features such as early reflections and mix focus.
  • The Video player is a built-in 360 video player that is 'slaved' to the DAW timeline, and allows the sound designer to preview the mix with the 360 video in real time, either in VR or on the desktop. Desktop mode allows rotating the video with the keyboard or mouse, which will rotate the sound field instantly, providing direct feedback during the authoring stage.
  • The Converter plugin is a utility that can rotate a mix after it has been created, or output to other formats such as 4-channel ambiX or 2 channel static binaural stereo.
  • The Loudness meter provides an overview of the loudness of the entire mix when looking in a particular direction. Loudness for spatial mixes is considerably different than what is offered inside DAWs, which is usually for static content. Spatial audio for 360 videos is considerably more complex and this meter gives useful data that will prevent the final uploaded content from distorting when played back on the target device.

Features

Panning and Object tracking : The Spatialiser plugin can be used to treat audio sources in a mix as objects and position them around a 360 rectilinear video. There is no limit on the number of audio sources that can be used to be positionally placed and moved in a space. If you have a speaker in the footage and a lavalier mic capturing the audio track, it can then be moved to follow the speaker along the timeline and during playback, the sound will follow the speaker and take real-time headtracking information into account.

Early Reflections : The Spatial Workstation can also generate upto 3rd order early reflections per source, that helps in mix fidelity. Early reflections provide valuable cues as to the nature of the room/space the action is happening and add a layer of immersion. This can then be chained with various reverb plugins to add the reverb tail. You can either use ambisonic reverbs or multichannel reverb plugins.

Mix Focus : The mix focus feature is unique to the Spatial Workstation and to the Facebook and Oculus video apps. Think of it as an acoustic torchlight. It allows the designer to define a fixed field of view attached to the viewport, beyond which ALL sounds will be muted to a defined attenuation level and all the sounds inside the defined viewport will be highlighted the most. This pre-defined view area moves and adapts to listener/HMD orientation and works as an acoustic focus area where only the sounds in that view are heard clearly. You can use this to draw attention to a specific part of the 360 viewport instead of sounds coming from all directions.

Head-locked stereo: Another nifty feature on the Facebook and Oculus platforms is the simultaneous support of both spatial as well as he ad-locked stereo tracks in the sound mix. For non-diegetic sounds such as background music, narration, 'Voice of God' attributed to off-screen elements (i.e, where the viewer cannot see the source of the sound), the designer can place these tracks in a generic stereo bus, and during playback have both the spatial headtracked elements as well as non-spatial headlocked elements play in harmony. This provides a great level of flexibility in the sound designers arsenal where they are not forced to spatialise everything or nothing. Achieving a good balance and distribution is key to designing a believable, yet hyper realistic acoustic space.

Encoding & Asset Preparation

The Facebook 360 Encoder application takes a video file and combines the audio files into the video container, suitable for playback on Facebook and other supported platforms. Additionally, it also allows adding metadata to the file describing values for the Focus feature.

This process also injects relevant metadata into the tool, making the final asset ready for upload to supported platforms.

Supported Applications for Sound designers

The Spatial Workstation plugins come in both VST and AAX formats. The recommended Digital Audio workstations which support multichannel ambisonic audio are Pro Tools HD (AAX Plugins) and Reaper. Steinberg's Nuendo is also supported with the older versions of the plugins as of this article but native ambisonic support for Nuendo expected to be released later this year (for v3.0 and above for the plugins).

Supported platforms for Playback

  1. Facebook 360 Video format (8 or 10 channel audio): Can be viewed on Facebook News Feed, or Oculus Video on Gear VR
  2. .tbe format: 3rd party Apps with Rendering SDK (JauntVR, Whirligig media player, Ecco VR, With.in)
  3. YouTube 360
  4. Other platforms with support for Ambisonics or quad-binaural format. Note that these platforms have specific instructions for preparing assets, such as Samsung VR player

Preview and Publishing

In summary, to design spatial audio for 360 videos, you will need to craft a spatial audio mix that is also mindful of the final platform for delivery. For efficient workflows, you should incorporate live previews as often as possible since sound experiences in VR often end up needing to be hyper real, and more exaggerated with scale. Things like loudness, DACs on different devices, and headphones playback can all affect the intended and final mix on playback. Spatial audio mixes are played in real-time incorporating the user's viewing angle or head-rotation and a dynamic mix is generated every time, and although this sounds like a small change, the addition of a dynamic playback is the biggest change that linear sound designers have to deal with when moving from traditional, fixed focus rectilinear content to 360 or VR videos. This is why the spatial workstation plugins come in-built with an ambisonic Loudness meter and a video player for live HMD/Desktop previews during the mix process that can often affect the tonality, immersion and scale of a mix.

Note that each platform has their own delivery and publication requirements, and the ecosystem can be quite fragmented when it comes to spatial audio. Some platforms only support stereo audio (not recommended), or only first order Ambisonics (4channel spatial mix, such as YouTube). Facebook and Oculus platforms have the most advanced spatial audio support till date, with support for up to 2nd order Ambisonics and head-locked stereo and interactive features such as Mix focus. The following specs are supported on Oculus video (Gear VR/ Oculus Go) and Facebook (Desktop Chrome or Firefox browsers, android and iOS Fb apps)

Output platformInput (Audio)headlocked stereoFocusContainerEncoder option to use
Facebook News feed+ Facebook 360 app9 channel 2nd order ambiXYESYES.mp4Facebook 360
8 channel TBEYESYES.mp4Facebook 360
4 channel 1st order ambiXYESYES.mp4Facebook 360






Oculus video (Mobile)
[sideloaded videos]
9 channel 2nd order ambiXYESYES.mkvFB360 Matroska
8 channel TBEYESYES.mkvFB360 Matroska
4 channel 1st order ambiXYESYES.mkvFB360 Matroska

Oculus video (Mobile)
[online streaming]
9 channel 2nd order ambiXYESYES.mp4Facebook 360
8 channel TBEYESYES.mp4Facebook 360
4 channel 1st order ambiXYESYES.mp4Facebook 360

Oculus video (Rift)
[sideloaded video ONLY]
9 channel 2nd order ambiXYESYES.mkvRift: Oculus Video
8 channel TBEYESYES.mkvRift: Oculus Video
4 channel 1st order ambiXYESYES.mkvRift: Oculus Video

YouTube4channel 1st order ambiXNONO.mp4YouTube video

In addition, there are other options available in the Encoder for audio only conversion, for example to mix down to quad-binaural option for Samsung VR app.

Further Reading

Here are some key resources to get started with Facebook 360 Spatial Workstation.