web synth docs

amazon-chime

Amazon Chime is an online videoconferencing and meeting application that runs in the web browser. It includes a feature called "voice focus" which enhances spoken voice by removing noise from sources such as shuffling papers and cars driving by.

Here's an Amazon blog post written by [jmvalin] which goes into the technical implementation of that efficient noise suppression: https://www.amazon.science/blog/how-amazon-chimes-challenge-winning-noise-cancellation-works

^ This is quite a good read; it goes into a lot of detail for the implementation itself but doesn't go off the deep end with math and ML theory stuff. One of the goals for their implementation was apparently being simple and boring, and it seems like they achieved that while still obtaining a very good result.

The noise suppression itself shares many similarities to the open-source [rnnoise] library, which makes sense since since [jmvalin] has worked on both of them. For example, both of them store their weights as 8-bit integers rather than 32-bit floats to make the model lighter-weight and faster during inferrence. It also goes into detail about some topics that have a lot of crossover with [vocal-synthesis].

For example, it talks about how the [spectral-envelope] and [aperiodicity] (they don't use that exact word but I'm pretty sure that's what they're referring to) are used along with [comb-filter]s for post-processing of their model's output. The comb filters are tuned to the frequency of the spoken voice (which I assume is determined via [fundamental-frequency-estimation]) which preserves the voice while filtering out noise.

The ML model they use is a DNN which contains some convolutional layers as well as some [GRU]s. The [rnnoise] library also makes use of GRUs. The paper also mentions that LSTMs would work as well since they perform a similar function to GRUs (adding memory to the neural network).

Referred in

rnnoise

rnn

amazon-chime