Specialized Semantic Enrichment of Speech Representations

Abstract

SSL is now commonly used to capture multilingual speech representation, by exploiting huge audio speech data in several languages. In parallel, some text-based large neural models trained on huge multilingual textual documents have been introduced in order to capture the general semantics of a sentence, independently of the language, and to represent it under the form of a sentence embedding. Very recently, an approach has been introduced that takes benefit of such sentence embedding in order to continue the training of an SSL speech model in order to inject some multilingual semantic information. In a previous work, we made a layer-wise analysis in order to better understand how this semantic information is integrated into a wav2vec2.0 model. In this new study, we show how this semantic information can be specialized to a targeted downstream task dedicated to a task-oriented spoken language understanding by exploiting a small amount of transcribed data. We also show that the use of in-domain data from a close language can also be very beneficial in order to make the semantic representation captured by this enriched SSL model more accurate.

Publication
2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)