Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Fusion
ABSTRACT
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting, which has attracted more attention nowadays. While recognizing the significance of the CTTS task, the prior studies have not thoroughly investigated the speech emphasis expression, which is essential for expressing the underlying intention and attitude in human-machine interaction scenarios, due to the scarcity of conversational emphasis datasets and the difficulty of context understanding. In this paper, we propose a novel Emphasis Rendering scheme for CTTS model, termed ER-CTTS, that includes two main components: 1) we take into account textual and acoustic contexts simultaneously, with both global and local semantic modeling to comprehensively understand the conversation context; 2) we deeply integrate multi-modal and multi-scale context to learn the influence of context on the emphasis expression of the current utterance. At last, the inferred emphasis feature is fed to the neural speech synthesizer to gen- erate the conversational speech. To address the data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk). Both ob- jective and subjective evaluations suggest that our model outperforms the baseline models in emphasis rendering within a conversational setting. The code and audio samples can be found at https://github.com/CodeStoreTTS/ER-CTTS.
EXPERIMENTS
Comparative Experiment:
1) FastSpeech2 is a TTS model without emphasis and contextual modeling, representing state-of-the-art non-dialogue TTS systems.
2) FastSpeech2 w/ Emphasis focuses on synthesizing emphasis speech for individual sentences. It leverages FastSpeech2 as the backbone and is studied to change the degree of emphasis by adjusting specific acoustic features, such as pitch and energy, etc.
3) DailyTalk is an advanced CTTS baseline. It proposed coarse-grained text context modeling in dialogue history to enhance speech expressiveness.
4) FCTalker: further employed coarse-grained and fine-grained context modeling.
5) M2-CTTS: further employed coarse-grained and fine-grained context modeling of text and audio.
6) GCN adopted the homogeneous graph to model the multi-modal context in conversation.
7) ECSS is a powerful expressive conversational TTS. It utilized heterogeneous graph-based context modeling to achieve expressive rendering for CTTS.
8) ER-CTTS: Our proposed dialogue emphasizes speech synthesis.
Ablation Experiment:
9) ER-CTTS w/o Coarse-grained Encoders
10) ER-CTTS w/o Fine-grained Encoders
11) ER-CTTS w/o Hybrid-grained Fusion
12) ER-CTTS w/o Cross-modaily Fusion
13) ER-CTTS w/o Bidirectional Context Modeling
14) ER-CTTS w/o Memory Enhancement
15) ER-CTTS w/o Emphasis Intensity

That'd be great. What kind of camera do you have?
Conversation history text speech
1th So... what kind of things do you do in your free time?
2th Umm I'm really into watching foreign films. what about you?
3th I like to do just about anything outdoors. Do you enjoy camping?
4th Camping for an evening is OK, but I couldn't do it for much longer than one night!
5th Have you ever been camping in the Boundary Waters?
6th No, but I've always wanted to do that. I've heard it's a beautiful place to go.
7th It's fantastic. My family and I are very fond of the place.
8th Do you have any photos of any of your camping trips there?
9th Sure, would you like to see them?
current text Ground Truth FastSpeech2 FastSpeech2 w/o Emphasis DailyTalk FCTalker M2-CTTS GCN ECSS ER-CTTS
10th That'd be great. What kind of camera do you have?
ER-CTTS w/o Coarse-grained Encoders ER-CTTS w/o Fine-grained Encoders ER-CTTS w/o Hybrid-grained Fusion ER-CTTS w/o Cross-modaily Fusion ER-CTTS w/o Bidirectional Context Modeling ER-CTTS w/o Memory Enhancement ER-CTTS w/o Emphasis Intensity
All you hear is the crickets and the breeze.
Conversation history text speech
1th Cheers! To our first night in our new apartment.
2th It's so quiet. I'm not really used to it.
current text Ground Truth FastSpeech2 FastSpeech2 w/o Emphasis DailyTalk FCTalker M2-CTTS GCN ECSS ER-CTTS
3th All you hear is the crickets and the breeze.
ER-CTTS w/o Coarse-grained Encoders ER-CTTS w/o Fine-grained Encoders ER-CTTS w/o Hybrid-grained Fusion ER-CTTS w/o Cross-modaily Fusion ER-CTTS w/o Bidirectional Context Modeling ER-CTTS w/o Memory Enhancement ER-CTTS w/o Emphasis Intensity
No problem. Step in, please.
Conversation history text speech
1th Is this taxi taken?
2th No, madam. May I help you?
3th Yes. Will you take me to the station?
current text Ground Truth FastSpeech2 FastSpeech2 w/o Emphasis DailyTalk FCTalker M2-CTTS GCN ECSS ER-CTTS
4th No problem. Step in, please.
ER-CTTS w/o Coarse-grained Encoders ER-CTTS w/o Fine-grained Encoders ER-CTTS w/o Hybrid-grained Fusion ER-CTTS w/o Cross-modaily Fusion ER-CTTS w/o Bidirectional Context Modeling ER-CTTS w/o Memory Enhancement ER-CTTS w/o Emphasis Intensity
How about Thursday?
Conversation history text speech
1th Why don't you come round for a meal one evening next week?
2th I'd love to.
3th Which day would suit you?
4th Any day except Tuesday.
current text Ground Truth FastSpeech2 FastSpeech2 w/o Emphasis DailyTalk FCTalker M2-CTTS GCN ECSS ER-CTTS
5th How about Thursday?
ER-CTTS w/o Coarse-grained Encoders ER-CTTS w/o Fine-grained Encoders ER-CTTS w/o Hybrid-grained Fusion ER-CTTS w/o Cross-modaily Fusion ER-CTTS w/o Bidirectional Context Modeling ER-CTTS w/o Memory Enhancement ER-CTTS w/o Emphasis Intensity
It could be the battery. Let me Check it.
Conversation history text speech
1th Excuse me, could you please take a picture for us?
2th Sure. Umm where would you like to stand?
3th Over here with the waterfall in the background, please.
4th OK.
5th Then just press the black button all the way down.
6th Are you ready? Here we go. Say Cheese!
current text Ground Truth FastSpeech2 FastSpeech2 w/o Emphasis DailyTalk FCTalker M2-CTTS GCN ECSS ER-CTTS
7th It could be the battery. Let me Check it.
ER-CTTS w/o Coarse-grained Encoders ER-CTTS w/o Fine-grained Encoders ER-CTTS w/o Hybrid-grained Fusion ER-CTTS w/o Cross-modaily Fusion ER-CTTS w/o Bidirectional Context Modeling ER-CTTS w/o Memory Enhancement ER-CTTS w/o Emphasis Intensity
What’s your rule about pets?
Conversation history text speech
1th Hi, I’m the super intendent of this building. What can I do for you?
2th Hi, I’m Paul. Could you show me the apartment on the first floor?
3th Sure. Let’s go.
4th Umm I like this one. How much is the rent?
5th Eight fifty dollars a month.
6th Does the rent include utilities?
7th No. Utilities are extra.
8th Where’s the laundry room?
9th It’s on the other side of this floor.
current text Ground Truth FastSpeech2 FastSpeech2 w/o Emphasis DailyTalk FCTalker M2_CTTS GCN ECSS ER-CTTS
10th What’s your rule about pets?
ER-CTTS w/o Coarse-grained Encoders ER-CTTS w/o Fine-grained Encoders ER-CTTS w/o Hybrid-grained Fusion ER-CTTS w/o Cross-modaily Fusion ER-CTTS w/o Bidirectional Context Modeling ER-CTTS w/o Memory Enhancement ER-CTTS w/o Emphasis Intensity