Sahar Ghannay is an associate professor at Université Paris-Saclay, in the CNRS, LISN research center, since September 2018.
She received a PhD in Computer Science from Le Mans University on Septembre 2017. Her thesis work is part of the ANR VERA (AdVanced ERror Analysis for speech recognition) project. During her PhD, she spent a few months as a visiting researcher at Apple within the Siri Speech team.
As a postdoctoral researcher at LIUM, she worked on neural end-to-end systems for the detection of named entities, speech understanding, as part of the Chist-Era M2CR (Multimodal Multilingual Continuous Representation for Human Language Understanding) project.
Her main research interests are continuous representations learning and their application to natural language processing and speech recognition tasks, semantic information extraction form spoken and writen language and dialog system.
PhD in Computer Science, 2017
Le Mans university
MS in in Computer Science, 2013
Le Mans university
BS in in Computer Science, 2011
Le Mans university
My research interests are continuous representations learning and their application to natural language processing and speech recognition tasks like ASR error detection, natural/spoken language understanding, named entity recognition, etc. I am also interested to end-to-end approaches for speech understanding in addition to sementic textual similarity and its application to dialog system.
Over the past few years, self-supervised learned speech representations have emerged as fruitful replacements for conventional surface representations when solving Spoken Language Understanding (SLU) tasks. Simultaneously, multilingual models trained on massive textual data were introduced to encode language agnostic semantics. Recently, the SAMU-XLSR approach introduced a way to make profit from such textual models to enrich multilingual speech representations with language agnostic semantics. By aiming for better semantic extraction on a challenging Spoken Language Understanding task and in consideration with computation costs, this study investigates a specific in-domain semantic enrichment of the SAMU-XLSR model by specializing it on a small amount of transcribed data from the downstream task. In addition, we show the benefits of the use of same-domain French and Italian benchmarks for low-resource language portability and explore cross-domain capacities of the enriched SAMU-XLSR.
Recent years have shown unprecedented growth of interest in Vision-Language related tasks, with the need to address the inherent challenges of integrating linguistic and visual information to solve real-world applications. Such a typical task is Visual Question Answering (VQA), which aims to answer questions about visual content. The limitations of the VQA task in terms of question redundancy and poor linguistic variability encouraged researchers to propose Knowledge-aware Visual Question Answering tasks as a natural extension of VQA. In this paper, we tackle the KVQAE (Knowledge-based Visual Question Answering about named Entities) task, which proposes to answer questions about named entities defined in a knowledge base and grounded in visual content. In particular, besides the textual and visual information, we propose to leverage the structural information extracted from syntactic dependency trees and external knowledge graphs to help answer questions about a large spectrum of entities of various types. Thus, by combining contextual and graph-based representations using Graph Convolutional Networks (GCNs), we are able to learn meaningful embeddings for Information Retrieval tasks. Experiments on the ViQuAE public dataset show how our approach improves the state-of-the-art baselines while demonstrating the interest of injecting external knowledge to enhance multimodal information retrieval.
SSL is now commonly used to capture multilingual speech representation, by exploiting huge audio speech data in several languages. In parallel, some text-based large neural models trained on huge multilingual textual documents have been introduced in order to capture the general semantics of a sentence, independently of the language, and to represent it under the form of a sentence embedding. Very recently, an approach has been introduced that takes benefit of such sentence embedding in order to continue the training of an SSL speech model in order to inject some multilingual semantic information. In a previous work, we made a layer-wise analysis in order to better understand how this semantic information is integrated into a wav2vec2.0 model. In this new study, we show how this semantic information can be specialized to a targeted downstream task dedicated to a task-oriented spoken language understanding by exploiting a small amount of transcribed data. We also show that the use of in-domain data from a close language can also be very beneficial in order to make the semantic representation captured by this enriched SSL model more accurate.
In conventional domain adaptation for speaker diarization, a large collection of annotated conversations from the target domain is required. In this work, we propose a novel continual training scheme for domain adaptation of an end-to-end speaker diarization system, which processes one conversation at a time and benefits from full self-supervision thanks to pseudo-labels. The qualities of our method allow for autonomous adaptation (e.g. of a voice assistant to a new household), while also avoiding permanent storage of possibly sensitive user conversations. We experiment extensively on the 11 domains of the DIHARD III corpus and show the effectiveness of our approach with respect to a pre-trained baseline, achieving a relative 17% performance improvement. We also find that data augmentation and a well-defined target domain are key factors to avoid divergence and to benefit from transfer.
In the last five years, the rise of the self-attentional Transformerbased architectures led to state-of-the-art performances over many natural language tasks. Although these approaches are increasingly popular, they require large amounts of data and computational resources. There is still a substantial need for benchmarking methodologies ever upwards on under-resourced languages in data-scarce application conditions. Most pre-trained language models were massively studied using the English language and only a few of them were evaluated on French. In this paper, we propose a unified benchmark, focused on evaluating models quality and their ecological impact on two well-known French spoken language understanding tasks. Especially we benchmark thirteen well-established Transformer-based models on the two available spoken language understanding tasks for French: MEDIA and ATIS-FR. Within this framework, we show that compact models can reach comparable results to bigger ones while their ecological impact is considerably lower. However, this assumption is nuanced and depends on the considered compression method.