BEdit-TTS: Advancing Text-Based Speech Editing and Data Augmentation for ASR
Online Supplement


The code is available at github repository.

This project is also used in our paper Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation (Interspeech 2023).

Some of the page's functionality requires javascript. Try to open with a different up-to-date browser.

Abstract

In recent years, text-to-speech (TTS) models have made great progress and received widespread applications. Recorded audio plays a critical role and is irreplaceable in our daily communication in spite of the high-fidelity speech produced by TTS models. It is of practical value to manipulate the recorded audio flexibly in many scenarios such as audio post-production. However, it is still challenging to "edit" the recording based on current TTS models given the diversity and variability lied in the realistic speech. In this paper, we extend the neural TTS model by incorporating the contextual spectrum and prosody features to construct a text-based speech editing system, which is named BEdit-TTS. Our proposed model allows deletion, insertion and replacement operations on the recording. The objective and subjective evaluations on English and Chinese demonstrate the effectiveness of the proposed model and exhibit superior performance over several competitive baseline systems.

Baseline models

The implementations of baseline systems are based on the same FastSpeech 2 model.
Baseline 1: This system generates the complete speech audio from the edited text.
Baseline 2: This system generates only the shot speech segment with the input words of the modified region, then inserts the segment into the corresponding position of the original speech.
Baseline 3: This system also generates a complete speech, but the shot speech segment corresponding to the modified region words is cut from the generation and inserted into the corresponding position of the original speech.

Replacement

Each sample has 5 audios, of which are synthesized audios from baseline and proposed systems.

Original text: Or whatever was the present task.
Edited text: Or whatever was the current task.
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: Many others have already gone broke.
Edited text: Some others have already gone broke.
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: And upon them were set sails patterned after the wonderful new invention of master fletcher of rye.
Edited text: And upon them were set sails patterned after the wonderful new invention of inventor fletcher of rye.
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: But I stood over him until he had done his work thoroughly.
Edited text: But I stood over him until he had done his book thoroughly.
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: My old remembrances went from me wholly and all the ways of men so vain and melancholy.
Edited text: My old remembrances went from me partly and all the ways of men so vain and melancholy.
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: 星期六对吧。
Edited text: 星期一和星期二对吧。
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: 珍珠湖。
Edited text: 涵泽湖。
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: 安徽六岁女童被伤案告破女童母亲及同居男子被拘。
Edited text: 安徽六岁女童被伤案告破犯罪嫌疑人被拘。
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: 这标志着网络安全已经上升为国家的重要战略。
Edited text: 这标志着环境保护已经上升为国家的重要战略。
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: 陈晓东也因为女有了女儿变得更加温暖。
Edited text: 陈晓东也因为女儿结婚变得更加温暖。
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS

Insertion

Each sample has 5 audios, of which are synthesized audios from baseline and proposed systems.

Short

Original text: Irreverently tearing open her mother's telegram and reading it as she came.
Edited text: Irreverently tearing open her young mother's telegram and reading it as she came.
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: Or whatever was the present task.
Edited text: Or whatever was the new present task.
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: Many others have already gone broke.
Edited text: There're many others have already gone broke.
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: Made open ports below of no necessity.
Edited text: Made open large ports below of no necessity.
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: You are no longer in the midst of broken desolate wastes.
Edited text: You are no longer in the middle midst of broken desolate wastes.
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: 梁家仁出演的电视剧有什么。
Edited text: 梁家仁曾经出演的电视剧有什么。
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: 来一首故乡。
Edited text: 来一首关于故乡。
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: 又被电梯外门夹住头和身子。
Edited text: 又被突然关闭的电梯外门夹住头和身子。
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: 河南平顶山化工厂氨气泄漏小孩儿咳血动物瘫倒。
Edited text: 河南平顶山化工厂发生氨气泄漏小孩儿咳血动物瘫倒。
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS
Original text: 之后将学校告上了法庭。
Edited text: 之后将自己的学校告上了法庭。
Original audio Baseline 1 Baseline 2 Baseline 3 BEdit-TTS

Deletion

The deletion operation of the BEdit-TTS can be operated in two fashions.
delete-1: The first is to simply delete the corresponding Mel-spectrogram features and fed them to vocoder for audio generation.
delete-2: The second is the change the deletion operation as a replacement operation, by deleting the selected words and one more neighboring word, and then replace to that word, to fulfill the deletion operation.

Edited text: And still spoke of her nephew dick with bated breath and a sigh.
Original audio BEdit-TTS delete-1 BEdit-TTS delete-2
Edited text: Or whatever was the present task.
Original audio BEdit-TTS delete-1 BEdit-TTS delete-2
Edited text: But I stood over him until he had done his work thoroughly.
Original audio BEdit-TTS delete-1 BEdit-TTS delete-2
Edited text: I should simply have become a luntic.
Original audio BEdit-TTS delete-1 BEdit-TTS delete-2
Edited text: And most people had begun to use pink and blue wool on their needles.
Original audio BEdit-TTS delete-1 BEdit-TTS delete-2

Insertion Length Robustness

In this section, taking insertion for example, we demonstrate BEdit-TTS's ability to conduct text-based speech editing with different lengths of the edited words.

The short insertion samples have been shown in Section Insertion.

Original text: The pleasant season did my heart employ.
Edited text (middle): The pleasant spring summer autumn and winter season did my heart employ.
Edited text (long): The pleasant spring summer autumn winter spring summer autumn and winter season did my heart employ.
Original audio Middle Long Full Sentence
Original text: And nothing but the truth.
Edited text (middle): And tell me and tell him nothing but the truth.
Edited text (long): And tell me, tell him, tell her, tell you nothing but the truth.
Original audio Middle Long Full Sentence
Original text: What if my nephew Dick should be needing one.
Edited text (middle): What if my nephew Dick comes back to his room should be needing one.
Edited text (long): What if my nephew Dick coming back to his room and doing homework should be needing one.
Original audio Middle Long Full Sentence
Original text: And most people had begun to use pink and blue wool on their needles.
Edited text (middle): And most sensible and beautiful people had begun to use pink and blue wool on their needles.
Edited text (long): And most sensible and beautiful and sensible and beautiful people had begun to use pink and blue wool on their needles.
Original audio Middle Long Full Sentence

Reconstruction

In this section, we demonstrate BEdit-TTS's ability with the setting of unseen speaker.

Text: They were old friends.
Original audio Vocoder BEdit-TTS
Text: We are prostrated and worn out with fatigue.
Original audio Vocoder BEdit-TTS
Text: The milk is very good.
Original audio Vocoder BEdit-TTS
Text: Approaching the dining table he carefully placed the article in the centre and removed the cloth.
Original audio Vocoder BEdit-TTS
Text: There was in that city a young cavalier about two and twenty years of age whom wealth high birth a wayward disposition inordinate indulgence and profligate companions impelled to do things which disgraced his rank.
Original audio Vocoder BEdit-TTS
Text: 我放弃这个想法。
Original audio Vocoder BEdit-TTS
Text: 汽车厂商几乎承包了一半的面积。
Original audio Vocoder BEdit-TTS
Text: 给我切换到东南卫视。
Original audio Vocoder BEdit-TTS
Text: 整体定位和发展战略不会发生变化
Original audio Vocoder BEdit-TTS
Text: 南京雨花台警方捣毁一诈骗团伙。
Original audio Vocoder BEdit-TTS