Current advances in language fashions showcase spectacular zero-shot voice conversion (VC) capabilities. However, prevailing VC fashions rooted in language fashions often make the most of offline conversion from supply semantics to acoustic options, necessitating the whole lot of the supply speech and limiting their software to real-time eventualities.
On this analysis, a group of researchers from Northwestern Polytechnical College, China, and ByteDance introduce StreamVoice. StreamVoice is a novel streaming language mannequin (LM)-based technique for zero-shot voice conversion (VC), permitting real-time conversion with any speaker prompts and supply speech. StreamVoice achieves streaming functionality by using a completely causal context-aware LM with a temporal-independent acoustic predictor.
This mannequin alternately processes semantic and acoustic options at every autoregression time step, eliminating the necessity for full supply speech. To mitigate potential efficiency degradation in streaming processing because of incomplete context, two methods are employed:
1) teacher-guided context foresight, the place a trainer mannequin summarises current and future semantic context throughout coaching to information the mannequin’s forecasting for lacking context.
2) semantic masking technique, selling acoustic prediction from previous corrupted semantic and acoustic enter to boost context-learning capacity. Notably, StreamVoice stands out as the primary LM-based streaming zero-shot VC mannequin with none future look-ahead. Experimental outcomes showcase StreamVoice’s streaming conversion functionality whereas sustaining zero-shot efficiency corresponding to non-streaming VC techniques.
The above determine demonstrates the idea of the streaming zero-shot VC using the broadly used recognition-synthesis framework. StreamVoice is constructed on this fashionable paradigm. The experiments performed illustrate that StreamVoice displays the potential to conduct speech conversion in a streaming style, reaching excessive speaker similarity for each acquainted and unfamiliar audio system. It maintains efficiency ranges corresponding to non-streaming voice conversion (VC) techniques. Because the preliminary language mannequin (LM)-based zero-shot VC mannequin with none future lookahead, StreamVoice’s complete pipeline incurs solely 124 ms latency for the conversion course of. That is notably 2.4 occasions quicker than real-time on a single A100 GPU, even with out engineering optimizations. The group’s future work entails utilizing extra coaching information to enhance StreamVoice’s modeling capacity. In addition they plan to optimize the streaming pipeline, incorporating a high-fidelity codec with a low bitrate and a unified streaming mannequin.
Take a look at the Paper. All credit score for this analysis goes to the researchers of this undertaking. Additionally, don’t overlook to comply with us on Twitter. Be part of our 36k+ ML SubReddit, 41k+ Fb Neighborhood, Discord Channel, and LinkedIn Group.
In the event you like our work, you’ll love our publication..
Don’t Neglect to hitch our Telegram Channel
Janhavi Lande, is an Engineering Physics graduate from IIT Guwahati, class of 2023. She is an upcoming information scientist and has been working on the planet of ml/ai analysis for the previous two years. She is most fascinated by this ever altering world and its fixed demand of people to maintain up with it. In her pastime she enjoys touring, studying and writing poems.