top of page

Immersive Audio Fundamentals

Updated: May 14


When entering this mesmerising and exciting yet elaborate world of immersive audio, it is crucial to establish fundamentals that will allow you, later in your journey, to operate with confidence and progress further. The first key step is to learn the difference between vital concepts, most notably: audio representations, formats, and speaker configurations.


Audio representations

Also referred to as: Audio types, Formats.

  

There are three main audio representations. Channel-based, object-based, and scene-based.



Channel-based audio (CBA)

It's an audio representation where each audio signal has a designated output channel. This means that it can only be played optimally on a predetermined speaker setup (loudspeaker-centric). So if a mix was created using a stereo configuration, it should be played back using the same configuration. However, there are many devices, particularly AV receivers, that are capable of up-mixing and down-mixing audio from, for example, stereo to 5.1 using technologies such as DTS Neural:X or Dolby Pro Logic II.


  • Crucial innovation: 1878 - mono, 1933 - stereo

  • Typically a 2D format, BUT doesn't need to be, as it can be used for speaker configurations with height channels such as 5.1.2 (it will be covered later)

  • Usual speaker configurations: mono, stereo, quad, 5.1, 7.1, 9.1, etc.


 

Object-based audio (OBA)

This audio representation utilises objects, which consist of the audio signal along with its metadata to describe its parameters, such as position in the sound space and volume. This means that the audio signal is no longer assigned to a specific speaker (speaker agnostic), but rather to a point in space chosen by the engineer. Such a solution allows for leaving the constraints of a single speaker configuration and creating and playing back mixes on multiple different setups.

 

  • Crucial innovation: 2012 - Dolby Atmos

  • Typically a 3D format

  • Usual speaker configurations: 5.1.2, 5.1.4, 7.1.2, 7.1.4, 9.1.4, 9.1.6, 11.1.8, etc.



Scene-based audio (SBA)

The third audio representation focuses on generating a mix as a single scene/sound field. This means that just like in the case of object-based audio, there are no speaker configuration constraints, and a song mixed on one type of setup can be played on a different one.


  • Crucial innovation: 1975 - Ambisonics

  • Typically 3D

  • Usual speaker layout: 9 speaker cube, 16 speaker hemisphere, 25 speaker sphere, etc.



Binaural?

This is where things get interesting. Binaural is not an audio representation. They are often put together for convenience, but binaural is slightly different from the rest. Without delving too deeply into the science, binaural audio is a two-channel audio format for headphones that achieves a spatialization effect using the Head-Related Transfer Function (HRTF). It can either be recorded using highly specialised microphones, such as the Neumann KU100, or applied using the renderer after the song is already mixed. This means that binaural either precedes or succeeds the audio representations, but nevertheless, it's more of a delivery format. So is the transaural, which is essentially binaural for speakers, but it is also slightly different, as it requires adjusted calculations to account for speaker crosstalk.


  • Crucial innovations: 1927 - binaural, 1966 - transaural

  • Can be played on any headphones or speakers

  • Most often not as spatially precise as object or scene-based audio through a professional speaker configuration with height channels



Formats

Also referred to as: Spatial audio technologies.


Before delving into different formats, it is essential to distinguish between format, container, codec, and tool, as this is an often-made mistake.



Format is a framework that defines how audio is created, encoded, and played back across devices. It is not just a single piece of software, codec, or tool, but rather an ecosystem. This ecosystem can be end-to-end integrated like Dolby Atmos, which has its own tools, codecs, and even certified playback systems. On the other hand, they can be like MPEG-H, which some consider controversial due to being a container and codec without an audio engine. Spatial audio technologies usually manage these key processes:


  1. Authoring - the creative stage where audio elements are positioned in 3D space

  2. Rendering - simulating how the mix translates across different playback systems

  3. Mastering - final quality control, including loudness and metadata validation

  4. Encoding - exporting the project to the desired deliverable

  5. Distribution - delivering the encoded content to the distributors

  6. Playback - real-time decoding and rendering for end-user consumption



As can be seen from the list, the container, codec, and tool are only a small part of the format. The container is the 'packaging' for the codec, and the codec is the algorithm that compresses the audio data. In Dolby Atmos, an example of a container is ADM BWF, and an example of a codec is PCM. When it comes to tools, this is a much broader term, but usually refers to renderers, panners, and anything else provided by the format creators to work with the format. Sometimes it also refers to plug-ins used within DAWs to assist in the mix process, but I would use those two terms separately.



There are many spatial audio formats, with new ones emerging every year, each built around different audio representations. Because of this variety, only a selection will be covered here. Some formats are hybrids: they may primarily use one type of audio representation (like object-based audio) but also support others (such as scene-based or channel-based audio). To keep things simple, I’ll categorise each format by its primary representation.

Keep in mind that these technologies are evolving rapidly. While I’ll do my best to keep this list current, this section is intended to be a rough guide rather than a definitive manual. For the most up-to-date information, please refer to each company's website.



Channel-based audio

Auro 3D

Introduced in 2006, Auro 3D is a channel-based format available on home theatre systems, soundbars, headphones, smart speakers, and car systems. What makes it unique is that it's the only major fully channel-based 3D format, and it's the only one that, for some configurations, utilises the "Voice of God" (VoG) channel and speaker, projecting sound directly above the listener.

Official website: https://www.auro-3d.com/



AuroMax

Introduced in 2015, AuroMax builds on the Auro 3D design, incorporating additional object-based audio functionality to make it more flexible and, therefore, more immersive. It is also designed with cinema-only in mind, meaning it supports the required industry-standard formats such as SMPTE IAB.

Official website: https://www.auro-3d.com/wp-content/uploads/2023/08/AuroMax-Cinema-System-Requirements-Specification-v1-20160427.pdf



Mach 1 Spatial

Introduced in 2015, Mach 1 Spatial is an open ecosystem designed to work without proprietary codecs or metadata. It allows for transcoding to or from any spatial format and up-mixing and down-mixing any project. It utilises Virtual Vector-Based Panning (VVBP), a transparent and artefact-free method of creating a soundfield without renderer-dependent interpretation.

Official website: https://www.mach1.tech/


Object-based audio

Dolby Atmos

Introduced in 2012, Dolby Atmos is currently the industry standard format widely integrated into cinema, home theatre, streaming, music, gaming, mobile, and car systems. It is primarily object-based but also supports channel beds. It also works closely with Apple Spatial Audio Format, which uses Dolby's object-based audio architecture to enable immersive playback across Apple devices.

Official website: https://www.dolby.com/en-gb/technologies/dolby-atmos/



DTS:X

Introduced in 2015, DTS:X is mostly a cinema and home theatre system with possible IMAX Enhanced integration for the best viewing experience. However, their technology is also featured in cars and gaming headphones. What's unique about this format is that it is not as strict when it comes to speaker configurations as other formats, thanks to its adaptable technologies.

Official website: https://dts.com/



Apple Spatial Audio Format (ASAF)

Introduced in 2025, ASAF is an Apple-exclusive Dolby Atmos overlay that enables even greater positional accuracy and enhanced head tracking. It is paired with the new Apple Positional Audio Codec (APAC), which improves efficiency, scalability, and real-time responsiveness.

Official website: https://developer.apple.com/videos/play/meet-with-apple/223/



Sony 360 Reality Audio

Introduced in 2019, the Sony 360RA was designed from the ground up with music in mind. This is evident in its consumption methods, which are mostly headphones, soundbars, smartphones, some home theatre systems, and cars. It also has great streaming support working with services such as Tidal, Amazon, and Deezer.

Official website: https://www.sony.co.uk/electronics/360-reality-audio



Atmoky trueSpatial

Introduced in 2024, Atmoky trueSpatial is primarily object-based but does support Ambisonics. Its focus is on game integration through the provision of plugins for different game engines. It also supports 6 Degrees of Freedom (6DoF), enabling hyper-realistic spatial cues and accurate localisation as the user moves through the virtual environment, creating a lifelike experience. However, it isn't limited only to headphones as it also supports multichannel speaker configurations.

Official website: https://atmoky.com/



Eclipsa Audio

Introduced in 2025, Eclipsa Audio is a product of Google and Samsung, created to democratise immersive audio by removing licensing. It is free to use and expected to be widely adopted on different devices and streaming services in the future, but as of yet, it is only available on Samsung TVs and soundbars.

Official website: https://www.eclipsaapp.com/


Scene-based audio

Ambisonics

Introduced in 1978, Ambisonics is the oldest format on this list. It is also an open standard, meaning it is not owned by any company. Its strength lies in enabling real-time rotation of the entire sound field, which is critical for head-tracked audio in VR, AR, 360 videos, and research. Furthermore, Higher Order Ambisonics (HOA) provides very fine spatial detail that is often unachievable in other formats. Overall, it is a foundational encoding method which different formats and software borrow from.

Oxford's website: https://intothesoundfield.music.ox.ac.uk/what-is-ambisonics



Worth noting: currently, for immersive music making, object-based audio appears to be the preferred approach.



Speaker configurations

Also referred to as: Audio setups, Sound systems.


Speaker configurations describe the number of speakers in a setup, their positions, and their type. It can be described in different ways depending on the standards and individual naming conventions. Dolby, for instance, describes the configuration as follows: 7.1.4 (ear-level.subwoofers.height). Auro 3D will either use the same naming convention as Dolby or sum all full-range speakers (not subwoofers) into one number, and use the second number for the subwoofers. This is the case in their 13.1 setup, which, if broken down, is actually 7 surround, 5 height, 1 Voice of God, and 1 subwoofer. On the other hand, the International Telecommunication Union (ITU), which develops worldwide technical standards for telecom, broadcasting, and internet technologies, uses a re-ordered convention, e.g., 9.10.3 (height.ear-level.bottom). For that reason, it is crucial to quickly learn what is most relevant to your work, as each industry and sometimes even each workplace uses a different naming convention.



Feel free to check out other guides for more interesting information or the ever-growing glossary to learn some useful terms.


bottom of page