The hottest googleduo adopts waveneteq

2022-10-24
  • Detail

Google duo uses waveneteq to fill the voice gap

09:34:22 Author: Google AI blog compilation: livevideostack source: livevideostack comment: 0 Click:

voice calls have become a part of people's daily life, but packets often arrive at the other end in the wrong order or at the wrong time, and sometimes individual packets may even be completely lost. This not only leads to the reduction of call quality, but also a common problem in audio and video transmission

google Duo (mobile device video call service) found that 99% of its calls need to deal with problems such as packet loss, excessive jitter or network delay. In these calls, 20% lost more than 3% of the audio duration due to network problems, while 10% lost at least 8% of the audio

a simplified diagram of the network problem that leads to packet loss. The receiver needs to offset it to achieve reliable real-time communication

in order to ensure reliable real-time communication, it is necessary to deal with lost packets, which is called PLC. The receiver's PLC is responsible for creating audio (or video) to fill the gap caused by packet loss, excessive jitter, or temporary network failure (all of which will lead to data loss)

in order to solve these audio problems, Google duo began to use the new PLC system waveneteq

waveneteq is a model generated based on the wavernn technology of deepmind. It uses a large number of voice data sets for training to continue phrase segments more realistically, so as to protect the physical performance and aesthetics of automotive TPO and enable it to completely synthesize the original waveform of lost voice

because duo adopts end-to-end encryption, all processing needs to be completed on mobile devices. Google claims that the waveneteq model is fast enough to run on, while still providing the most advanced audio quality and detecting PLC more naturally than other systems currently in use

select about 100 pilot demonstration projects a new PLC system for duo

like many other web-based communication systems, duo is also based on webrtc open source projects. To counteract the impact of packet loss, the neteq component of webrtc uses signal processing methods to analyze speech and produce smooth continuity

this is very effective for small packet loss (20ms or less), but when the number of lost packets is too large, resulting in a 60ms or longer time interval, the effect is not satisfactory. In the latter case, voice will become mechanized and repeated, which is very common for many users who use online voice calls

in order to better solve the problem of packet loss, Google duo replaced neteq PLC components with a modified version of wavernn. Wavernn is a recursive neural network model for speech synthesis, which is used to wipe the bottom of the box with a towel. It consists of two parts: autoregressive network and regulatory network

the autoregressive network is responsible for the continuity of the signal. It provides the short-term and medium-term structure of speech by making each generated sample depend on the previous output of the network. Adjusting the network will affect the autoregressive network and produce audio consistent with the input function with slow moving speed

however, wavernn, like its predecessor WaveNet, was created considering text to speech (TTS) applications. As a TTS model, wavernn will provide information about what it should say and how to say it

the tuning network directly receives this information as input to phoneme forms that constitute words and additional prosodic features (i.e., all non textual information such as tones or pitches). To some extent, the regulating network can see the future, and then turn the autoregressive network to the correct waveform and match it, which cannot be provided in PLC system and real-time communication

for a PLC system with normal functions, it is necessary to extract context information from the current voice (i.e. the past) and generate realistic sound at the same time

google duo's waveneteq solution can model long-term features (such as voice characteristics) using an autoregressive network while ensuring audio continuity. In the past, the spectrum of audio signal was used as the input of the adjustment network, which extracts Limited information about prosody and text content. The compressed information is fed back to the autoregressive network, which combines it with the recent audio to predict the next sample in the waveform domain

this is slightly different from the process followed in the training process of waveneteq model. In this process, the autoregressive network receives the actual samples existing in the training data as the input of the next step, rather than using the generated last sample

this process, called teacher forcing, ensures that the model can learn valuable information even in the early stages of training (its prediction is still of low quality). Once the model is fully trained and used for audio or video calls, teacher forcing will only be used to preheat the first sample model, and then pass its own output back as the input of the next step

waveneteq structure. In the process of reasoning, Google uses the latest audio to warm up the autoregressive network through teacher forcing. After that, the model will provide its own output as the input of the next step. Mel spectrum from the longer audio part is used as the input of the regulating network

this model will be applied to audio data in Duo jitter buffer. After the packet loss event, if the real audio still exists, duo will seamlessly merge the synthetic and real audio stream. In order to find the best alignment between the two signals, the output of the model is a little more than the actual output, and it will cross fade in and out from one to the other. This will smooth the transition and avoid obvious noise

simulate PLC events on audio within a 60 millisecond moving range. The blue line represents the actual audio signal, including the past and future of PLC events. At each time step, the orange line represents the synthetic audio. Waveneteq will predict whether the audio is cut off at the gray line

60 MS packet loss

audio clip: the audio clip is from libritts, and 10% of the audio is divided into 60 ms. then it is filled by webrtc's default pl. the material standards used in different areas, different projects and different building heights are also different. C system neteq and Google's PLC system waveneteq. (since push can only upload 3 audio files at most, it is not possible to list all the audio in the original text here, including the effect of filling after the audio is split into 120 ms)

Insurance robotics

an important factor affecting PLC is the ability of the network to adapt to various input signals, including changes in different speakers or background noise

in order to ensure the robustness of the model among many users, Google trained waveneteq on the voice data set, which contains more than 100 speakers in 48 different languages

this enables the model to learn universal human speech features rather than some specific language attributes. In order to ensure that waveneteq can deal with noisy environments, such as answering in a railway station or cafeteria, Google enhances the data by mixing the data with various background noises

although Google's model learns how to realistically continue speech, it is only effective in the short term. It can complete a syllable, but it cannot predict the word itself. On the contrary, for the loss of longer packets, Google will gradually fade out until the model remains silent after 120 milliseconds

in order to further ensure that the model will not produce wrong syllables, Google used the Google cloud voice to text API to evaluate the samples of waveneteq and neteq, and found that there was no significant difference in word error rate (that is, the number of wrong texts generated when copying oral speech)

google has been experimenting with waveneteq on duo, and the results show that waveneteq has a positive impact on call quality and user experience. Waveneteq has been used in all duo calls of pixel 4, and is now being promoted to other models and devices

original link:

Copyright © 2011 JIN SHI