A study of continuous space word and sentence representations applied to ASR error detection

Abstract

This paper presents a study of continuous word representations applied to automatic detection of speech recognition errors. A neural network architecture is proposed, which is well suited to handle continuous word representations, like word embeddings. We explore the use of several types of word representations: simple and combined linguistic embeddings, and acoustic ones associated to prosodic features, extracted from the audio signal. To compensate certain phenomena highlighted by the analysis of the error average span, we propose to model the errors at the sentence level through the use of sentence embeddings. An approach to build continuous sentence representations dedicated to ASR error detection is also proposed and compared to the Doc2vec approach. Experiments are performed on automatic transcriptions generated by the LIUM ASR system applied to the French ETAPE corpus. They show that the combination of linguistic embeddings, acoustic embeddings, prosodic features, and sentence embeddings in addition to more classical features yields very competitive results. Particularly, these results show the complementarity of acoustic embeddings and prosodic information, and show that the proposed sentence embeddings dedicated to ASR error detection achieve better results than generic sentence embeddings.

Publication
Speech Communication Journal