Interspeech 2016 Special Session: Auditory-visual expressive speech and gesture in humans and machines

Session title

Auditory-visual expressive speech and gesture in humans and machines 

Link to Interspeech (opens in a new window)

Link to Special Issue in Speech Communication (opens in a new window)

Names and affiliations of organizers

Jeesun Kim, Associate Professor, The MARCS Institute, Western Sydney University, Australia,

Kim researches human speech communication with a particular focus on face-to-face interaction. She has authored over 100 peer-reviewed papers on speech and language processing. Her recent research topics include visual prosody, expressive speech perception by the elderly, emotion perception in tone and non-tone languages, how vision influences auditory prediction and attention. She is a regular Interspeech attendee and reviewer. She organised a special session on 'Auditory Visual Speech Processing' at Interspeech 2004; and was an area chair of 'Human Speech Perception, Interaction Production-Perception and Face-to-Face communication' at Interspeech 2013.

GĂ©rard Bailly, Professor, GIPSA-Lab/Speech & Cognition dpt., CNRS/Grenoble-Alpes University, France,

Bailly is a senior CNRS Research Director appointed to GIPSA-Lab, Grenoble-France. He was a deputy director of the lab (2007-2012) and now leads the CRISSP (Cognitive Robotics, Interactive Systems & Speech Processing) team. He has been working in the field of speech communication for 30 years. He supervised 28 PhD Thesis, authored more than 40 journal papers, 25 book chapters and more than 170 papers in major international conferences. He coedited "Talking Machines: Theories, Models and Designs" (Elsevier, 1992), "Improvements in Speech Synthesis" (Wiley, 2002) and "Audiovisual speech processing" (CUP, 2012). He is an associate editor of two journals (JASMP & JMUI). He is a board member of the International Speech Communication Association (ISCA) and a founder member of the ISCA SynSIG and of SproSIG special-interest groups. His current interest is in multimodal interaction with conversational agents (virtual avatars and humanoid robots) using speech, eye gaze, hand and head movements.


The importance of the topic: Human spoken communication is embedded in and supported by a rich orchestration of visible motion. That is, the meaning of speech is augmented and even changed by co-verbal/speech behaviours/gestures including the talker's facial expression, eye-contact, gaze-direction, arm movements, hand gestures, body motion and orientation, posture, proximity, physical contact, and so on. Understanding how and when various kinds of messages are conveyed by auditory and visual signals is crucial for a science ultimately interesting in the correct interpretation of transmitted meaning.

The research domain: The topic 'Auditory-visual expressive speech and gesture in humans and machines' encompasses many research fields and will be of interest to researchers who (for example): investigate the role of the talker's face and head movements (visual speech) in human face-to-face communication; are interested in the relationship between speech and gesture; are working to develop platforms for human-machine communication (e.g., a key topic for sociable humanoid robots).

The objectives of the session: The session aims to bring together researchers from many disciplines to share techniques and investigative methods as well as research findings. It will provide a forum for researchers to explore the extent to which results concerning human communication are important for enabling social machines. Conversely, it will provide an opportunity for researchers working with machines (e.g., computer vision; machine learning, robot design, etc) to showcase developments in their field. The feedback between the two communities will be stimulating and rewarding.

Why the topic cannot be covered appropriately in regular sessions: The research area is inherently interdisciplinary and although research papers within this area will be sprinkled throughout the regular sessions of the conference, bringing human and machine researchers together in a single session will provide the focus and critical mass for effective interaction.

The format

The format will be a series of short oral presentations and possibly posters (including a 3 min oral introduction) and a final panel discussion.

List of potential topic areas include (although not limited to)

1.        Taxonomies of emotions, attitudes and interactive behaviors
2.        Patterns and functions of auditory and visual prosody (and co-speech gestures)
3.        Production, analysis and perception of expressive auditory visual speech (and co-speech gestures)
4.        Synthesis and recognition of expressive auditory visual speech (and co-speech gesture)
5.        Data, models and evaluation of multimodal interactive behaviors
6.        Virtual and robotic conversational agents
7.        Theories of auditory visual speech (and co-speech gesture)

Important dates

·         Submission opens: Monday, 1 February 2016
·         Submission deadline: Wednesday, 23 March 2016
·         Deadline for final PDF of paper submission to be uploaded: Wednesday, 30 March 2016
·         Disposition notifications sent: Friday, 10 June 2016
·         Camera-ready paper due: Friday, 24 June 2016