E3TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications
Online Supplement


The code is available at github repository.

Some of the page's functionality requires javascript. Try to open with a different up-to-date browser.

Abstract

Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E3TTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E3TTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E3TTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios. E3TTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation.

Baseline models

The implementations of baseline systems are based on the same FastSpeech 2 model.
Baseline 1: This system generates the complete speech audio from the edited text.
Baseline 2: This system generates only the shot speech segment with the input words of the modified region, then inserts the segment into the corresponding position of the original speech.
Baseline 3: This system also generates a complete speech, but the shot speech segment corresponding to the modified region words is cut from the generation and inserted into the corresponding position of the original speech.

Replacement

Each sample has 5 audios, of which are synthesized audios from baseline and proposed systems.

Original text: Or whatever was the present task.
Edited text: Or whatever was the current task.
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: Many others have already gone broke.
Edited text: Some others have already gone broke.
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: And upon them were set sails patterned after the wonderful new invention of master fletcher of rye.
Edited text: And upon them were set sails patterned after the wonderful new invention of inventor fletcher of rye.
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: But I stood over him until he had done his work thoroughly.
Edited text: But I stood over him until he had done his book thoroughly.
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: My old remembrances went from me wholly and all the ways of men so vain and melancholy.
Edited text: My old remembrances went from me partly and all the ways of men so vain and melancholy.
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: 星期六对吧。
Edited text: 星期一和星期二对吧。
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: 珍珠湖。
Edited text: 涵泽湖。
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: 安徽六岁女童被伤案告破女童母亲及同居男子被拘。
Edited text: 安徽六岁女童被伤案告破犯罪嫌疑人被拘。
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: 这标志着网络安全已经上升为国家的重要战略。
Edited text: 这标志着环境保护已经上升为国家的重要战略。
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: 陈晓东也因为女有了女儿变得更加温暖。
Edited text: 陈晓东也因为女儿结婚变得更加温暖。
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS

Insertion

Each sample has 5 audios, of which are synthesized audios from baseline and proposed systems.

Short

Original text: Irreverently tearing open her mother's telegram and reading it as she came.
Edited text: Irreverently tearing open her young mother's telegram and reading it as she came.
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: Or whatever was the present task.
Edited text: Or whatever was the new present task.
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: Many others have already gone broke.
Edited text: There're many others have already gone broke.
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: Made open ports below of no necessity.
Edited text: Made open large ports below of no necessity.
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: You are no longer in the midst of broken desolate wastes.
Edited text: You are no longer in the middle midst of broken desolate wastes.
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: 梁家仁出演的电视剧有什么。
Edited text: 梁家仁曾经出演的电视剧有什么。
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: 来一首故乡。
Edited text: 来一首关于故乡。
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: 又被电梯外门夹住头和身子。
Edited text: 又被突然关闭的电梯外门夹住头和身子。
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: 河南平顶山化工厂氨气泄漏小孩儿咳血动物瘫倒。
Edited text: 河南平顶山化工厂发生氨气泄漏小孩儿咳血动物瘫倒。
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS
Original text: 之后将学校告上了法庭。
Edited text: 之后将自己的学校告上了法庭。
Original audio Baseline 1 Baseline 2 Baseline 3 E3TTS

Deletion

The deletion operation of the E3TTS can be operated in two fashions.
delete-1: The first is to simply delete the corresponding Mel-spectrogram features and fed them to vocoder for audio generation.
delete-2: The second is the change the deletion operation as a replacement operation, by deleting the selected words and one more neighboring word, and then replace to that word, to fulfill the deletion operation.

Edited text: And still spoke of her nephew dick with bated breath and a sigh.
Original audio E3TTS delete-1 E3TTS delete-2
Edited text: Or whatever was the present task.
Original audio E3TTS delete-1 E3TTS delete-2
Edited text: But I stood over him until he had done his work thoroughly.
Original audio E3TTS delete-1 E3TTS delete-2
Edited text: I should simply have become a luntic.
Original audio E3TTS delete-1 E3TTS delete-2
Edited text: And most people had begun to use pink and blue wool on their needles.
Original audio E3TTS delete-1 E3TTS delete-2

Insertion Length Robustness

In this section, taking insertion for example, we demonstrate E3TTS's ability to conduct text-based speech editing with different lengths of the edited words.

The short insertion samples have been shown in Section Insertion.

Original text: The pleasant season did my heart employ.
Edited text (middle): The pleasant spring summer autumn and winter season did my heart employ.
Edited text (long): The pleasant spring summer autumn winter spring summer autumn and winter season did my heart employ.
Original audio Middle Long Full Sentence
Original text: And nothing but the truth.
Edited text (middle): And tell me and tell him nothing but the truth.
Edited text (long): And tell me, tell him, tell her, tell you nothing but the truth.
Original audio Middle Long Full Sentence
Original text: What if my nephew Dick should be needing one.
Edited text (middle): What if my nephew Dick comes back to his room should be needing one.
Edited text (long): What if my nephew Dick coming back to his room and doing homework should be needing one.
Original audio Middle Long Full Sentence
Original text: And most people had begun to use pink and blue wool on their needles.
Edited text (middle): And most sensible and beautiful people had begun to use pink and blue wool on their needles.
Edited text (long): And most sensible and beautiful and sensible and beautiful people had begun to use pink and blue wool on their needles.
Original audio Middle Long Full Sentence

Reconstruction

In this section, we demonstrate E3TTS's ability with the setting of unseen speaker.

Text: They were old friends.
Original audio Vocoder E3TTS
Text: We are prostrated and worn out with fatigue.
Original audio Vocoder E3TTS
Text: The milk is very good.
Original audio Vocoder E3TTS
Text: Approaching the dining table he carefully placed the article in the centre and removed the cloth.
Original audio Vocoder E3TTS
Text: There was in that city a young cavalier about two and twenty years of age whom wealth high birth a wayward disposition inordinate indulgence and profligate companions impelled to do things which disgraced his rank.
Original audio Vocoder E3TTS
Text: 我放弃这个想法。
Original audio Vocoder E3TTS
Text: 汽车厂商几乎承包了一半的面积。
Original audio Vocoder E3TTS
Text: 给我切换到东南卫视。
Original audio Vocoder E3TTS
Text: 整体定位和发展战略不会发生变化
Original audio Vocoder E3TTS
Text: 南京雨花台警方捣毁一诈骗团伙。
Original audio Vocoder E3TTS