The future of voice recognition: meet your AI-controlled 'digital twin'
Speech is a much more natural way of interacting with devices than poking at buttons and screens, and its popularity has exploded in recent years, with voice-enabled digital assistants now integrated into virtually every household device imaginable.
That growth has been made possible by the works of companies like XMOS. The name might not be immediately familiar, but if you’ve ever used an Amazon Echo speaker then you’ve benefited from its technology.
XMOS is a fabless semiconductor company specializing in voice processing. Its algorithms are capable of detecting softly-spoken voice commands from across a room – even in challenging conditions (like rooms with a lot of hard surfaces). So why has voice taken off so rapidly?
“I think it makes life easier,” says Alex Craciun, algorithm engineer at XMOS. “You don’t have so many cables and complicated instructions that you have to take care of. You can just give commands and the device tunes itself, or tells you something that you want it to. That’s a lot easier.”
“I play IT support to my parents, and we think voice is going to end that, because your technology will tell you how it works,” adds director of corporate marketing Esther Connock. “It won’t need to come with a remote; it won’t need to come with an instruction booklet – you just talk to it in a very natural, conversational way, and that for us democratizes technology because you don’t need to learn how to use it. You don’t need to come at it with knowledge.
“So if you think about people with low literacy or low levels of education, suddenly it’s a much more open playing field. Vulnerable sectors of society can use technology and become less isolated. So for us, voice is the most natural thing in the world.”
It's good to talk
XMOS part of the blossoming tech industry in Bristol emerging from the city’s two universities, which also includes Ultrahaptics (which uses ultrasound to create a sensation of touch in mid-air), Reach Robotics (creator of the Mekamon augmented reality robot) and Graphcore (a spin-off from XMOS).
Its speech detection and isolation tech includes beamforming (which tracks a person’s voice as they move around a room and moves the microphone to follow them), acoustic echo cancelation (separating the user’s voice from sound being played by the device itself), deverberation (compensating for echoes), noise suppression, barge-in (which stops audio playback when the device’s wake-word is detected), and fixed or automatic gain control (ensuring all voices in conference calls are heard at the same volume, regardless of how loudly the person is speaking).
The company was founded in 2005, built on research from the University of Bristol. “They developed a micro-controller that could do a lot of processing, had a lot of power and capability, and could perform a lot of tasks concurrently,” explains Connock, “so that was hugely exciting.”
Apple’s decision to kill off the FireWire port in 2008 opened up the market for USB audio, where XMOS found its niche. The company diversified, working for big players like Harmon Kardon and Yamaha, but also for DJs with their mixing decks, before turning to multi-channel audio.
“With a board with a lot of processing power, we could produce something with up to 32 channels of output, so we could get fantastic multi-channel audio,” explains Connock. “And that specialism in sound and audio led us into voice as it started to emerge. One of our clients said, ‘With all your expertise, you should be thinking about microphones and capturing voice.’ And that’s exactly what we did.”
In 2017, XMOS gained Amazon certification for its far-field voice interface. “We’re still their only qualified partner with a stereo solution, so for anyone developing TVs and soundbars and set-top boxes and doing work in true stereo, we’re the only provider that can do acoustic cancelation in stereo,” says Connock. “That’s really important to us, and something that we’re focusing heavily on this year at CES. But we’ve also just qualified with Baidu, so that’s very exciting, and we’re doing some work with NTT Docomo as well. We’re expanding across the regions.”
Outside the home
XMOS currently specializes in edge-of-room voice applications, but it’s investigating other areas too, including in-car interfaces.
“The technology that we’ve been developing over in Boston – sound source separation, which extracts multiple voices in a conversation – works really well for automotive,” says Connock. “So if you can imagine that I can be on the phone to you and I’m driving, it strips out everything that you can hear except for my voice. The kids can be shouting in the back, they can have a film that’s playing, and all you’ll get is my voice.”
The company also has an interesting prediction for the future of voice: as a personal assistant (in a flexible, wearable smartphone) that will sit between us and the big companies that currently provide voice recognition services.
“If I look at Amazon and Google (and to a degree Apple, with Apple music), they have a bias because they’re trying to sell us stuff. And I love Amazon for selling me stuff, but what I don’t want is voice spam, and the minute that starts to happen, people will switch away from voice,” explains Connock.
The solution would be a kind of mid-layer that filters out any spam, and points you to the service that has the most relevant content for you (which it will learn based on your preferences).
Your digital twin
It’s not just a theory – XMOS is already having conversations to make it happen. “It will happen quickly,” Connock says, “so we are looking at partnering, building, buying to create that ecosystem. So there’s a lot within that – there are lots of people we know operating in that space today. It’s open and it’s ready and we want to be taking advantage of it.
According to Connock, this will result in the creation of a ‘digital twin’ – a term that she admits sounds a bit twee, but is useful. It will learn and adapt to the way you use it. For example, it could learn that you don’t want it to speak to you unless you’ve spoken first.
“It will learn not just my music preferences, but my everything preferences. When I want to be disturbed, my friends that I will prioritize talking to – everything.”
However, even with a truly personal assistant to filter out any spam, voice recognition still faces some resistance.
“When you look at this,” Connock says, picking up her smartphone, “this is always on, it has a camera, it can always hear you, it’s got sensors, it gathers a lot of data, you type everything into it, and because we’re so used to it and so reliant on it, and it’s so close to us, people don’t see this as a privacy issue at all.
“And yet when you put a speaker in the middle of the room, everyone says ‘Oh, it’s listening!’ Well it is, but not as much as [the phone] is!”
Connock believes that relevant, trusted content will be the key to voice becoming widely accepted. The moment the industry puts sales ahead of the user’s experience, it will have a problem, so XMOS is making sure it’s on the front foot, and prepared to react in case that happens.
There’s also the question of natural speech, as opposed to commands. Alexa Skills are very handy, but they’re not the same as talking to another human. XMOS’s algorithm engineers are working on making the interaction much more organic.
“You need to feel like the machine understands your emotions – like it’s frictionless – then it will take off,” says Connock.
It might sound like science fiction, but Craciun says it’s closer than we realize. “I think it’s already happening,” she says. “We’re seeing lots of developments from Amazon; every single month there’s something new coming up that you can read about. So the field is advancing really, really fast. It could even be tomorrow that something more natural comes up there.”