A new audio system confuses smart devices that try to eavesdrop

You may know them as Siri or Alexa. Called personal assistants, these smart devices are attentive listeners. Just say a few words and they will play a favorite song or lead the way to the nearest gas station. But all this listening poses a risk to privacy. To help people protect themselves from eavesdropping devices, a new system reproduces soft, calculated sounds. This masks the conversations to confuse the devices.

Mia Chickie is a graduate student at Columbia University. She works in a computer science research laboratory headed by Karl Vondrik. Mia Chickie

Smart devices use automated speech recognition – or ASR – to translate sound waves into text, explains Mia Chickie. She studied computer science at Columbia University in New York. The new program deceives ASR by reproducing sound waves that vary depending on your speech. These added waves mix the beep to make it harder for the ASR to recognize the sounds of your speech. This “completely confuses this transcription system,” says Chikie.

She and her colleagues describe their new system as “voice camouflage”.

The power of masking sounds is not crucial. In fact, these sounds are quiet. Chiquier likens them to the sound of a small air conditioner in the background. The trick to making them effective, she says, is for these so-called “attacking” sound waves to fit into what someone is saying. To work, the system predicts the sounds that someone will say in a short time in the future. He then quietly emits sounds chosen to confuse the intelligent speaker’s interpretation of these words.

Chiquier described it on April 25 at a virtual international conference to present the training.

I know you

First step in creating great voice camouflage: Get to know the speaker.

If you write a lot, your smartphone will start predicting what the next few letters or words in the message will be. It also gets used to what types of messages you send and the words you use. The new algorithm works almost the same way.

“Our system listens to the last two seconds of your speech,” Chicchi explains. “Based on that speech, he predicts the sounds you may make in the future.” And not just somewhere in the future, but half a second later. This prediction is based on the characteristics of your voice and your language patterns. This data helps the algorithm learn and calculate what the team calls a predicted attack.

This attack equals the sound the system plays along with the speaker’s words. And it keeps changing with every sound someone makes. When the attack plays along with the words provided by the algorithm, the combined sound waves become an acoustic mixture that confuses any ASR system within audibility.

Predictive attacks are also difficult for the ASR outsmart system, Chiquier said. For example, if someone tries to disturb the ASR by playing a sound in the background, the device can extract that noise from the speech sounds. This is true even if the masking sound changes periodically over time.

Instead, the new system generates sound waves based on what the speaker has just said. So the sounds of his attack are constantly changing – and in an unpredictable way. According to Chickie, this makes it “very difficult.” [an ASR device] to defend against. “

Attacks in action

To test their algorithm, the researchers simulated a real-life situation. They released a recording of someone speaking English in a room with a medium level of background noise. An ASR device listens and transcribes what it has heard. The team then repeated this test after adding white noise to the background. Finally, the team did this with a voice masking system included.

The voice camouflage algorithm did not allow ASR to hear words correctly 80 percent of the time. The most difficult to disguise were common words like “ours” and “ours.” But those words don’t carry much information, the researchers added. Their system was much more efficient than white noise. It even performed well against ASR systems designed to remove background noise.

The algorithm may one day be embedded in a real-world application, Chikie said. To ensure that the ASR system cannot listen reliably, “you will simply open the application,” she says. “That is all.” The system can be added to any device that emits sound.

However, this is a little ahead of things. More tests follow.

This is “good work,” says Bhiksha Raj. He is an electrical engineer and computer engineer at Carnegie Mellon University in Pittsburgh, Pennsylvania. He did not participate in this study. But he is also studying how people can use technology to protect the privacy of their speech and voice.

Smart devices now control how a user’s voice and conversations are protected, Raj said. But he believes that instead, control should be left to the speaker.

“There are so many aspects to sound,” Raj explains. Words are one aspect. But the voice can also contain other personal information, such as one’s accent, gender, health, emotional state, or physical size. Companies could potentially take advantage of these features by targeting users with different content, ads, or prices. They could even sell voice information to others, he said.

When it comes to voice, “the challenge is to figure out exactly how we can hide it,” says Raj. “But we need to have some control over at least parts of it.”

Leave a Comment

Your email address will not be published.