Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Fusion
ABSTRACT
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting, which has attracted more attention nowadays. While recognizing the significance of the CTTS task, the prior studies have not thoroughly investigated the speech emphasis expression, which is essential for expressing the underlying intention and attitude in human-machine interaction scenarios, due to the scarcity of conversational emphasis datasets and the difficulty of context understanding. In this paper, we propose a novel Emphasis Rendering scheme for CTTS model, termed ER-CTTS, that includes two main components: 1) we take into account textual and acoustic contexts simultaneously, with both global and local semantic modeling to comprehensively understand the conversation context; 2) we deeply integrate multi-modal and multi-scale context to learn the influence of context on the emphasis expression of the current utterance. At last, the inferred emphasis feature is fed to the neural speech synthesizer to gen- erate the conversational speech. To address the data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk). Both ob- jective and subjective evaluations suggest that our model outperforms the baseline models in emphasis rendering within a conversational setting. The code and audio samples can be found at https://github.com/CodeStoreTTS/ER-CTTS.
EXPERIMENTS
Comparative Experiment:
1) FastSpeech2 is a TTS model without emphasis and contextual modeling, representing state-of-the-art non-dialogue TTS systems.
2) FastSpeech2 w/ Emphasis focuses on synthesizing emphasis speech for individual sentences. It leverages FastSpeech2 as the backbone and is studied to change the degree of emphasis by adjusting specific acoustic features, such as pitch and energy, etc.
3) DailyTalk is an advanced CTTS baseline. It proposed coarse-grained text context modeling in dialogue history to enhance speech expressiveness.
4) FCTalker: further employed coarse-grained and fine-grained context modeling.
5) M2-CTTS: further employed coarse-grained and fine-grained context modeling of text and audio.
6) GCN adopted the homogeneous graph to model the multi-modal context in conversation.
7) ECSS is a powerful expressive conversational TTS. It utilized heterogeneous graph-based context modeling to achieve expressive rendering for CTTS.
8) ER-CTTS: Our proposed dialogue emphasizes speech synthesis.
Ablation Experiment:
9) ER-CTTS w/o Coarse-grained Encoders
10) ER-CTTS w/o Fine-grained Encoders
11) ER-CTTS w/o Hybrid-grained Fusion
12) ER-CTTS w/o Cross-modaily Fusion
13) ER-CTTS w/o Bidirectional Context Modeling
14) ER-CTTS w/o Memory Enhancement
15) ER-CTTS w/o Emphasis Intensity
That'd be great. What kind of camera do you have?
Conversation history
text
speech
1th
So... what kind of things do you do in your free time?
2th
Umm I'm really into watching foreign films. what about you?
3th
I like to do just about anything outdoors. Do you enjoy camping?
4th
Camping for an evening is OK, but I couldn't do it for much longer than one night!
5th
Have you ever been camping in the Boundary Waters?
6th
No, but I've always wanted to do that. I've heard it's a beautiful place to go.
7th
It's fantastic. My family and I are very fond of the place.
8th
Do you have any photos of any of your camping trips there?
9th
Sure, would you like to see them?
current
text
Ground Truth
FastSpeech2
FastSpeech2 w/o Emphasis
DailyTalk
FCTalker
M2-CTTS
GCN
ECSS
ER-CTTS
10th
That'd be great. What kind of camera do you have?
ER-CTTS w/o Coarse-grained Encoders
ER-CTTS w/o Fine-grained Encoders
ER-CTTS w/o Hybrid-grained Fusion
ER-CTTS w/o Cross-modaily Fusion
ER-CTTS w/o Bidirectional Context Modeling
ER-CTTS w/o Memory Enhancement
ER-CTTS w/o Emphasis Intensity
All you hear is the crickets and the breeze.
Conversation history
text
speech
1th
Cheers! To our first night in our new apartment.
2th
It's so quiet. I'm not really used to it.
current
text
Ground Truth
FastSpeech2
FastSpeech2 w/o Emphasis
DailyTalk
FCTalker
M2-CTTS
GCN
ECSS
ER-CTTS
3th
All you hear is the crickets and the breeze.
ER-CTTS w/o Coarse-grained Encoders
ER-CTTS w/o Fine-grained Encoders
ER-CTTS w/o Hybrid-grained Fusion
ER-CTTS w/o Cross-modaily Fusion
ER-CTTS w/o Bidirectional Context Modeling
ER-CTTS w/o Memory Enhancement
ER-CTTS w/o Emphasis Intensity
No problem. Step in, please.
Conversation history
text
speech
1th
Is this taxi taken?
2th
No, madam. May I help you?
3th
Yes. Will you take me to the station?
current
text
Ground Truth
FastSpeech2
FastSpeech2 w/o Emphasis
DailyTalk
FCTalker
M2-CTTS
GCN
ECSS
ER-CTTS
4th
No problem. Step in, please.
ER-CTTS w/o Coarse-grained Encoders
ER-CTTS w/o Fine-grained Encoders
ER-CTTS w/o Hybrid-grained Fusion
ER-CTTS w/o Cross-modaily Fusion
ER-CTTS w/o Bidirectional Context Modeling
ER-CTTS w/o Memory Enhancement
ER-CTTS w/o Emphasis Intensity
How about Thursday?
Conversation history
text
speech
1th
Why don't you come round for a meal one evening next week?
2th
I'd love to.
3th
Which day would suit you?
4th
Any day except Tuesday.
current
text
Ground Truth
FastSpeech2
FastSpeech2 w/o Emphasis
DailyTalk
FCTalker
M2-CTTS
GCN
ECSS
ER-CTTS
5th
How about Thursday?
ER-CTTS w/o Coarse-grained Encoders
ER-CTTS w/o Fine-grained Encoders
ER-CTTS w/o Hybrid-grained Fusion
ER-CTTS w/o Cross-modaily Fusion
ER-CTTS w/o Bidirectional Context Modeling
ER-CTTS w/o Memory Enhancement
ER-CTTS w/o Emphasis Intensity
It could be the battery. Let me Check it.
Conversation history
text
speech
1th
Excuse me, could you please take a picture for us?
2th
Sure. Umm where would you like to stand?
3th
Over here with the waterfall in the background, please.
4th
OK.
5th
Then just press the black button all the way down.
6th
Are you ready? Here we go. Say Cheese!
current
text
Ground Truth
FastSpeech2
FastSpeech2 w/o Emphasis
DailyTalk
FCTalker
M2-CTTS
GCN
ECSS
ER-CTTS
7th
It could be the battery. Let me Check it.
ER-CTTS w/o Coarse-grained Encoders
ER-CTTS w/o Fine-grained Encoders
ER-CTTS w/o Hybrid-grained Fusion
ER-CTTS w/o Cross-modaily Fusion
ER-CTTS w/o Bidirectional Context Modeling
ER-CTTS w/o Memory Enhancement
ER-CTTS w/o Emphasis Intensity
What’s your rule about pets?
Conversation history
text
speech
1th
Hi, I’m the super intendent of this building. What can I do for you?
2th
Hi, I’m Paul. Could you show me the apartment on the first floor?