Article | CochleaNet: A Robust Language-independent Audio-Visual Model for Speech Enhancement

Article

CochleaNet: A Robust Language-independent Audio-Visual Model for Speech Enhancement

Details

Citation

Gogate M, Dashtipour K, Adeel A & Hussain A (2020) CochleaNet: A Robust Language-independent Audio-Visual Model for Speech Enhancement. Information Fusion, 63, pp. 273-285. https://doi.org/10.1016/j.inffus.2020.04.001

Abstract
Noisy situations cause huge problems for suffers of hearing loss as hearing aids often make speech more audible but do not always restore the intelligibility. In noisy settings, humans routinely exploit the audio-visual (AV) nature of speech to selectively suppress the background noise and focus on the target speaker. In this paper, we present a language, noise and speaker independent AV deep neural network (DNN) architecture for causal or real-time speech enhancement (SE). The model jointly exploits the noisy acoustic cues and noise robust visual cues to focus on the desired speaker and improve speech intelligibility. The proposed SE framework is evaluated using a first of its kind AV binaural speech corpus, called ASPIRE, recorded in real noisy environments including cafeteria and restaurant. We demonstrate superior performance of our approach in terms of objective measures and subjective listening tests over the state-of-the-art SE approaches as well as recent DNN based SE models. In addition, our work challenges a popular belief that, scarcity of multi-language large vocabulary AV corpus and a wide variety of noises is a major bottleneck to build a robust language, speaker and noise independent SE systems. We show that a model trained on synthetic mixture of Grid corpus (with 33 speakers and a small English vocabulary) and ChiME 3 Noises (consisting of bus, pedestrian, cafeteria, and street noises) generalise well not only on large vocabulary corpora, wide variety of speakers/noises but also on completely unrelated language (such as Mandarin).

Keywords
Audio-Visual; Speech Enhancement; Speech SeparationDeep Learning; Real Noisy Audio-Visual Corpus; Speaker Independent; Causal

Journal
Information Fusion: Volume 63

Status	Published
Funders
Publication date	30/11/2020
Publication date online	21/04/2020
Date accepted by journal	11/04/2020
URL
Publisher	Elsevier BV
ISSN	1566-2535

People (1)

Dr Ahsan Adeel

Assoc. Prof. in Artificial Intelligence, Computing Science and Mathematics - Division

Files (1)

Accepted manuscript

我要吃瓜

CochleaNet: A Robust Language-independent Audio-Visual Model for Speech Enhancement

Details

People (1)

Files (1)