Toward General-Purpose Text-Instruction-Guided Voice Conversion

Chun-Yi Kuan1, Chen An Li1, Tsu-Yuan Hsu1, Tse-Yang Lin1, Ho-Lam Chung1
Kai-Wei Chang1, Shuo-yiin Chang2, Hung-yi Lee1
1National Taiwan University, Taiwan
2Google, USA
[Paper on ArXiv]

Dataset

In the realm of instruction-guided voice conversion, indispensable data elements include source speech samples, instructions, and target speech samples. Source speech denotes the original, unmodified vocal expression, whereas target speech refers to the altered vocal output post-stylistic adaptation guided by instructions. Recognizing a deficit of such datasets that incorporate text instruction, we seek inspiration from the InstructSpeech dataset proposed by PromptTTS [1] and assemble a bespoke dataset to meet our precise requirements.

[1] PromptTTS: controllable text-to-speech with text descriptions [ArXiv]


Demo for Our Results: Here

Signal Procssing Effect Dataset

Note:

(1) You can refer to the documentation page of Sox for the more detailed introduction and explanation of the commands.

(2) In Sox commands, "src.wav" represents the source speech, while "tgt.wav" stands for the target speech.

Sox Command Instruction Source Speech Target Speech
sox -G src.wav tgt.wav chorus 0.7 0.9 55 0.4 0.25 2 -t Add a conspicuous chorus effect to the audio.
sox -G src.wav tgt.wav delay 1 Hold off on playing the audio for 1 second.
sox -G src.wav tgt.wav echo 0.9 0.9 10 0.8 Add an audio effect that generates a low-delay echo and high-volume sound.
sox -G src.wav tgt.wav fade t 5 Give the audio a gradual increase in volume for 5 seconds from the onset.
sox -G src.wav tgt.wav loudness 10 Increase the volume of the audio and enhance its impact.
sox -G src.wav tgt.wav repeat 1 Play the audio twice.
sox -G src.wav tgt.wav reverb Enlarge the scope and widen the reach of the sound quality.
sox -G src.wav tgt.wav tempo 1.75 Speed up the audio to a significant degree.
sox -G src.wav tgt.wav vol 1.5 Boost the volume of the audio for maximum audibility.
sox -G src.wav tgt.wav pitch -250 Drop the pitch of the audio down to an extreme degree.
sox -G src.wav tgt.wav contrast 100 Amplifying the sound to deliver a clearer and brighter rendition.
sox -G src.wav tgt.wav reverse Backtrack the sound.
sox -G src.wav tgt.wav bass -20 Considerably abate the bass frequencies.
sox -G src.wav tgt.wav bass 20 Enlarge the depth of the lower frequencies significantly.
sox -G src.wav tgt.wav treble -6 Mildly decrease the emphasis on the higher frequencies.
sox -G src.wav tgt.wav treble 20 Intensify the sound of the higher frequencies.

InstructSpeech Dataset

Instruction Source Speech Target Speech
Women bass and quiet, she grunts that.
A boy with a low tone and volume and a high rate, he is murmuring.
Please generate a slow and gaily female sound.
The womanlike voice is raspy but small use this kind of voice to say with happy-go-lucky.
The amused feminine raised his tone of voice and volume at the same time.
A heart-broken woman who speaks slowly, with normal volume and low pitch.
A quiet and sad low pitched female voice reminder.
Give me a bass boy with fast speech speed and small volume to bark.
Please give a baritone for me to outshout and turn its intensity down.
This webpage's template is referenced from here