Feature Extraction using Multimodal Convolutional Neural Networks for Visual Speech Recognition.

Proceedings of ICASSP'2017, March 2017, New Orleans, USA

Abstract : This article addresses the problem of continuous speech recognition from visual information only, without exploiting any audio signal. Our approach combines a video camera and an ultrasound imaging system for monitoring simultaneously the speaker’s lips and the movement of the tongue. We investigate the use of convolutional neural networks (CNN) to extract visual features directly from the raw ultrasound and video images. We propose different architectures among which a multimodal CNN processing jointly the two visual modalities. Combined with an HMM-GMM decoder, the CNN-based approach outperforms our previous baseline based on Principal Component Analysis. Importantly, the recognition accuracy is only 4% lower than the one obtained when decoding the audio signal, which makes it a good candidate for a practical visual speech recognition system.

Documents : paper

Main menu

Search form

Equipes-actions

Projets exploratoires

Projets de pré-maturation

Plateformes pédagogiques

Thèses

Bourses de master

Ecoles thématiques

Partenaires

Feature Extraction using Multimodal Convolutional Neural Networks for Visual Speech Recognition.

Feature Extraction using Multimodal Convolutional Neural Networks for Visual Speech Recognition.

Faits marquants

Actualités

Derniers tweets