NEXT GENERATION SOUND MEETS AI

By Dan Daley

June 28, 2023

Reading Time:

7 Minutes

As the COVID pandemic’s grip began to loosen slightly in mid 2020, and live sports slowly returned to stadiums and arenas and to television, all the “nat sounds” of the games, as broadcast audio engineers call them, were suddenly, clearly audible: the sharp crack of a baseball bat connecting with a fastball and the metallic clang of a basketball on a backboard and hoop. But one key sonic element was missing, and it fell into that famous category of things you don’t realize you miss until they’re gone: crowd noise. More than any other signature aural component of sports, it was the sound of the fans in the stands that came to be missed most, as it became clear it was that collective clatter that acts as the proxy for fans everywhere except actually present at the game.

Rectifying that missing piece became its own Operation Overlord, as a slew of start-ups and leagues looked for ways to replace that critical audio element. They drew on recordings of previous broadcasts, stripping out the crowd sounds and developing ways to play those stems back in real time, isolating the distinct cheers of a big score or goal and differentiating them from the groan of a just-missed shot or the collective carping on a perceived bad call by an official, digitally able to apply a just-right amount of communal enthusiasm on a play-by-play basis. Amazingly, it worked.

One of those companies closest to the leading edge of this new paradigm was Salsa Sound, an academic start-up based at Manchester’s University of Salford, where partners Rob Oldfield and Ben Shirley, Ph.Ds in spatial audio research and broadcast engineering, respectively, came together in 2017 to figure out how the burgeoning disciplines of AI and immersive audio were going to apply to a world that increasingly wants its sound to act as though humans had more than two ears. Their vCROWD solution was applied by the Manchester City FC and by CBS Sports for the NWSL Championship during the depths of the pandemic to make televised sports feel a bit more normal.

Since then, they’ve been pioneering applications such as tethering the field sounds of sports to the visuals — for instance, putting the sound of soccer ball kicks in the on-air soundfield relative to where they appear on screen at global sports events — as broadcast sports moves further into immersive territory, and more recently developing the use of AI to do the actual real-time mixes of those games without human hands on the audio-console faders.

AI Enters The Picture

Oldfield muses about what those faux crowds might have sounded like just a few years later, as AI swarmed onto the media shore. “Generative neural-network algorithms were to some extent already around in those days, and we did think about it during the COVID times,” he recalls. “But I think we probably would do things a bit differently now, in that you could train a model to understand what sounds you wanted, and then it could generate it on the fly, and that would have removed the need for having previously recorded samples, which for a lot of companies was a big issue. For instance, when we worked with the Big 10 League, we created a different [crowd-sound] sample bank for each of their different teams.”

Instead, Oldfield continues, they might have applied some of the same algorithms they’ve since developed for synchronizing sports sound and picture, such as Salsa’s MixAir solution, which automatically manages crowd, commentary and sound-effects audio for televised sports. The most recent version of it now also allows for different crowd variants (i.e., home vs away) or additional commentary feeds in different languages. (Or, if the user prefers, blissfully, no talking heads at all.) Its next iteration will utilize what he calls “semantic analysis of crowd sound,” which would use AI to evaluate a crowd’s emotional sentiment minute by minute to trigger larger sound samples to amplify those feelings.

‍

Salsa Sound MixAir Solution. Image courtesy Salsa Sound.

‍

It could push things towards an audio version of some of the disruption that AI is causing now for video. On the other hand, it can also come to the rescue of broadcast matches experiencing offensive crowd behavior, such as racially tinged chants, which have become more of a problem from Madrid to Topeka. “You can use the semantics of that to then drive the crowd, triggering or looping samples of more desirable sound,” he says.

Anything that pairs a sensory experience with AI is likely to lead to unexpected outcomes, though Oldfield says there are plenty of positive uses for this kind of audio manipulation, including with cinematic Foley effects, which can be precisely matched with picture, such as footfalls as a character walks on screen (which can also be matched with appropriate ambiences). Or creating more authentic audience reactions for sitcoms and other productions.

Next-Gen Audio Meets 3D Live

Immersive audio is already paired with three-dimensional visuals — it happens every day in IMAX theaters. But bringing those visuals to a live stage, in some cases in the form of holographic presentations, will want some version of what Salsa Sound wants to bring to the multimedia table. “That's exactly the kind of space that we're moving into, and spatial rendering is a huge part of that,” says Oldfield. “The ability to pan sounds around as well as have the sources separate fits really well into an NGA [Next-Generation Audio] world, which is exciting for me, personally, because of my background in object-based audio. I just think that that’s crying out for AI.”

It will ultimately bring about the emergence of a new kind of sound mixer, who will be more of an audio manager, who can wrangle not only an increasingly large number of elements of a soundfield into a coherent experience but do so as the soundfield itself gets larger, incorporating overhead sources and infrasonic effects that are more felt than actually heard. AI is going to be critical to successfully managing all that.

“Being able to separate out the different elements spatially, I think is the main challenge,” he suggests. “Separating out stems in a live context can be really, really tricky. That’s where perhaps AI is useful, for cleaning up some of those sounds or creating new versions of them. As the number of audio elements increases, and the amount of desired movement of them grows, the mixes will have to become more automated.” Ultimately, templates can be assembled that have the mix moves baked in for any type of environment and venue, and any expectation of reactions.

Rob Oldfield, co-founder Salsa Sound. Image courtesy Salsa Sound.

Getting Personal

Personalization of audio is emerging as one of the pathways for creating truly dimensional and immersive sound, particularly via the use of the head-related transfer function (HRTF), a mathematical model that characterizes how an ear perceives a sound from a point in space. The concept is already available via a bespoke set of headphones, precisely measured to an individual’s ear structure, or add-ons to conventional headphones. Personalization of immersive audio would allow spectators to move about a holographical environment and not lose directional perspective of its associated audio.

“If you’re looking at a hologram you want to make sure that all of the audio sources perfectly line up with the visual sources, binaural rendering, like personalized HRTF and head tracking, can go a long way in sorting out some of the parallax errors that you can get within a hologram space,” Oldfield observes. “Typically, you want to have a soundfield synthesis type of approach if you want to have holograms, which is a multichannel, multi-loudspeaker approach of creating soundfields rather than just having one loudspeaker a little bit louder than the other. And that way as you walk around the space it keeps the sound objects in the correct location.”

The integration of immersive audio with three-dimensional imaging, with boosts from automation and AI, could produce a once-in-a-lifetime inflection point in presentation technology — once the pathways towards it become clearer. What Oldfield contends is that, even if the aural and the visual versions of that continue to develop on independent tracks, they do need to acknowledge each other.

“It’s been known for ages how much audio affects people’s impression of any [visual] technologies,” he says. “If you have inferior audio then even if you've got really great visuals, people will think the entire production is less than wonderful. So I think that’s one thing that it would be good to keep front and center: to remember, recognize, and appreciate the importance of audio, because it punches above its weight in terms of the value that it can add to those experiences.”

‍

Go Back