Special Sessions/Challenges

The Organizing Committee of INTERSPEECH 2023 can confirm the following Special Sessions, Challenges and Panel Session.

Special Sessions


Biosignals such as of articulatory or neurological activities provide information about the human speech process and thus can serve as an alternative modality to the acoustic speech signal. As such, they can be the primary driver for speech-driven human-computer interfaces intended to support humans when acoustic speech is not available or perceivable. For instance, articulatory-related biosignals, such as Electromyography (EMG) or Electromagnetic Articulography (EMA), can be leveraged to synthesize the acoustic speech signal from silent articulation. By the same token, neuro-steered hearing aids process neural activities, reflected in signals such as Electroencephalography (EEG), to detect the human selective auditory attention to single out and enhance the attended speech stream. Progress in the field of speech-related biosignal processing will lead to the design of novel biosignal-enabled speech communication devices and speech rehabilitation for everyday situations.


With the special session “Biosignal-enabled Spoken Communication”, we aim at bringing together researchers working on biosignals and speech processing to exchange ideas on the interdisciplinary topics. Topics include, but are not limited to:

  • Processing of biosignals related to spoken communication, such as brain activity captured by, e.g., EEG, Electrocorticography (ECoG), or functional magnetic resonance imaging (fMRI).
  • Processing of biosignals stemming from respiratory, laryngeal, or articulatory activity, represented by, e.g., EMA, EMG, videos, or similiar.
  • Application of biosignals for speech processing, e.g., speech recognition, synthesis, enhancement, voice conversion, or auditory attention detection.
  • Utilization of biosignals to increase the explainability or performance of acoustic speech processing methods.
  • Development of novel machine learning algorithms, feature representations, model architectures, as well as training and evaluation strategies for improved performance or to address common challenges.
  • Applications such as speech restoration, training and therapy, speech-related brain-computer interfaces (BCIs), speech communication in noisy environments, or acoustic-free speech communication for preserving privacy.




Dr. Siqi Cai, Human Language Technology Laboratory, National University of Singapore, Singapore

Kevin Scheck,  Cognitive Systems Lab, University of Bremen, Germany

Assoc. Prof. Hiroki Tanaka, Augmented Human Communication Labs, Nara Institute of Science and Technology, Japan

Prof. Dr.-Ing. Tanja Schultz, Cognitive Systems Lab, University of Bremen, Germany

Prof. Haizhou Li, The Chinese University of Hong Kong, Shenzhen, China; National University of Singapore, Singapore


Speech technology is increasingly embedded in everyday living, with its applications spanning from critical domains like medicine, psychiatry, education, to more commercial settings. This rapid growth can be largely attributed to the successful use of deep learning in modelling large amounts of speech data. However, performance of speech technology in related applications varies, depending on the demographics of the population, the data it has been trained on and is applied to. That is, inequity in speech technology appears across age, gender, people with vocal disorders or from atypical populations, people with non-native accents.

A large group vulnerable to the inequities of speech technology and its performance is children. The goal of this interdisciplinary session is to address the limitations and advances of speech-technology and speech-science, focusing on child speech, while bringing together researchers working within these domains.


We invite papers on the following topics, but not limited to:

  • Using speech science (knowledge from children’s speech acquisition, production, perception, and generally natural language understanding) to develop and improve speech technology applications.
  • Using techniques used for developing speech technology to learn more about child speech production, perception and processing.
  • Computational modelling of child speech.
  • Speech technology applications for children including (but not limited to), speech recognition,
  • voice-conversion, language identification, segmentation, diarization etc.
  • Use and/or modification of data creation techniques, feature extraction schemes, tools and training architectures developed for adult speech for developing child speech applications.
  • Speech technology for children from typical and non-typical groups (atypical, non-native speech, slow-learners, etc.)




Line H. Clemmensen, Technical University of Denmark, Denmark

Nina R. Benway, Syracuse University, USA

Odette Scharenborg, Delft University of Technology, the Netherlands

Sneha Das, Technical University of Denmark, Denmark

Tanvina Patel, Delft University of Technology, the Netherlands

Zhengjun Yue, Delft University of Technology, the Netherlands


Speech and language technology (SLT) has the potential to help educate, facilitate medical treatment, provide access to services and information, empower, support independent living, and enable communication and cultural exchange between communities.

While speech synthesis and automatic speech recognition have been used to aid accessibility for several decades, a wider range of speech and language technologies are powerful tools in applications useful to society. Dialog technology has been used in domains including public education and cultural exhibits, independent learning applications, anti-bullying initiatives, health, digital resources for minority or lesser spoken languages, and companion/assistive systems for the elderly. These applications have potential to provide societal benefits or public good by giving access to highly interactive services in sectors or contexts where dialogue and language is a critical interaction component, and where other interface paradigms would be less effective or have higher infrastructure barriers. Other applications seek to improve access to information ,and provide spoken word versions of written texts for education and entertainment (Daisy Digital Books), while machine translation and a wide range of NLP tools also have potential to aid communication and access to information.

The Dialog for Good special session (DiGo) aims to highlight the use of SLT for social good.  It will promote novel use cases, cutting edge research and technological developments in any domain which facilitates society, building awareness of the opportunities that SLT offers. We hope the workshop will foster networking among researchers and service providers, leading to further initiatives to develop this highly interdisciplinary area of speech and language research and technology.


We welcome submissions on dialog and speech and language technology and applications in areas including, but not limited to:

  • Education
  • Access to social services / participation in society
  • Lesser Resourced Languages
  • Health
  • Social/Public services
  • Culture
  • Mobility/Migration
  • Political Freedom
  • Agriculture
  • Sustainability


Emer Gilmartin (Inria, Paris)

Neasa Ni Chiarain (Centre for Language and Communications Studies, Trinity College, Dublin)

Jens Edlund (KTH, Stockholm)

Brendan Spillane (University College Dublin/ADAPT)

David Traum (ICT)


Pre-trained acoustic models learned in an unsupervised fashion have exploded in the domain of speech. The representations discovered by CPC, wav2vec 2.0, HuBERT, WavLM, and others, can be used to massively reduce the amount of labelled data to train speech recognizers; they also produce excellent speech resynthesis.

However, while pre-trained acoustic representations seem to be nicely isomorphic with phones or phone states under optimal listening conditions, very little work has addressed invariances. Do the representations remain consistent across instances of the same phoneme in different phonetic contexts (i.e., are they phonemic or merely allophone representations)? Do they hold up under noise and distortions? Are they invariant to different talkers and/or accents?

Progress on these issues could unlock new levels of performance on higher-level tasks such as word segmentation, named entity recognition, and language modelling, where using the discretized “units” discovered by pre-trained acoustic models still lag behind state-of-the-art text-based models. Importantly, progress on talker and accent robustness would contribute to the serious fairness problem that current ASR models have (including those using pre-trained acoustic models as features) whereby lower socioeconomic status is highly correlated with higher word error rate.

The 2023 Interspeech Special Session on Invariant and Robust Pretrained Acoustic Models (IRPAM) aims to address both the evaluation problem and the problem of invariance in pretrained acoustic models. The evaluation track will accept proposed systematic evaluation measures, test sets, or benchmarks for pre-trained acoustic models, including but not limited to context-invariance, talker-invariance, accent-invariance, robustness to noise and distortions, etc. The model track will propose new models or techniques and demonstrate empirically that they improve the invariance or robustness properties of pre-trained speech representations, evaluating using existing approaches or variants on existing benchmarks/measures. This could also include techniques for disentanglement in pre-trained acoustic models.




Ewan Dunbar, University of Toronto

Emmanuel Dupoux, École des Hautes Études en Sciences Sociales / École Normale Supérieure / Meta AI

Hung-yi Lee, National Taiwan University

Abdelrahman Mohamed


Developing methods that are able to handle multiple simultaneous speakers represents a major challenge for researchers in many fields of speech technology and speech science, for example, in speech enhancement, auditory modelling and machine listening or speaking.  Significant research activity has occurred in many of these fields in recent years and great advances have been made, but often in a siloed manner. This cross-disciplinary special session will bring together researchers from across the whole field to present and discuss their latest research on multi-talker methods, encouraging a sharing of ideas and fertilising future collaboration.


We welcome submissions on many different topics, including, but not limited to:

  • Single channel speech separation;
  • Automatic speech recognition of overlapped speech;
  • Speech enhancement in the presence of competing speakers;
  • Diarization of overlapped speech;
  • Target speaker ASR and speech enhancement;
  • Understanding human speech perception in multi-talker environments;
  • Improving speech synthesis in competing-speaker scenarios;
  • Multi-modal approaches to multi-talker speech processing: for example audio-visual methods, location-aware approaches;
  • Clinical applications of multi-talker methods, eg. for hearing impaired listeners;
  • Downstream technologies operating in multi-talker scenarios, eg. meeting transcription, human-robot interaction;
  • Evaluation methods for multi-talker speech technologies.

Note however that we intend the focus of the session to be on applications in single-channel or binaural conditions, rather than on methods pertaining specifically to microphone arrays or other specialist hardware.




Peter Bell, University of Edinburgh, UK

Michael Akeroyd, University of Nottingham, UK

Marc Delcroix, NTT, Japan

Liang Lu, Otter.ai, USA

Jonathan Le Roux, MERL, USA

Jinyu Li, Microsoft, USA

Cassia Valentini, University of Edinburgh, UK

DeLiang Wang, Ohio State University, USA

Jon Barker, University of Sheffield


This special session has the goal of serving as a central hub for researchers investigating how the human brain processes speech under various acoustic/linguistic conditions and in various populations. Understanding speech requires our brain to rapidly process a variety of acoustic and linguistic properties, with variability due to age, language proficiency, attention, and neurocognitive ability among other factors. Until recently, neurophysiology research was limited to studying the encoding of individual linguistic units in isolation (e.g., syllables) using tightly controlled and uniform experiments that were far from realistic scenarios. Recent advances in modelling techniques led to the possibility of studying the neural processing of speech with more ecologically constructed stimuli involving natural, conversational speech, enabling researchers to examine the contribution of factors such as native language and language proficiency, speaker sex, and age to speech perception.

One of the approaches, known as forward modelling, involves modelling how the brain encodes speech information as a function of certain parameters (e.g., time, frequency, brain region), contributing to our understanding of what happens to the speech signal as it passes along the auditory pathway. This framework has been used to study both young and ageing populations, as well as neurocognitive deficits. Another approach, known as backward modelling, involves decoding speech features or other relevant parameters from the neural response recorded during natural listening tasks. A noteworthy contribution of this approach was the discovery that auditory attention can be reliably decoded from several seconds of non-invasive brain recordings (EEG/MEG) in multi-speaker environments, leading to a new subfield of auditory neuroscience focused on neuro-enabled hearing technology applications.


Giovanni  Di Liberto, Trinity College Dublin (School of Computer Science and Statistics, ADAPT Centre, TCIN)

Alejandro Lopez Valdes, Trinity College Dublin (School of Engineering, Electronic and Electrical Engineering, Global Brain Health Institute, TCBE, TCIN)

Mick Crosse, SEGOTIA; Trinity College Dublin (School of Engineering)

Mounya Elhilali, Johns Hopkins University (Department of Electrical and Computer Engineering, Department of Psychological and Brain Sciences)


Now as spoken language translation (SLT) systems are becoming more mature thanks to new technologies and advances, there is an opportunity for the SLT community to focus on more challenging scenarios and problems beyond core quality. Current approaches to simultaneous and offline speech translation often blindly rely on large quantities of heterogeneous training data to learn high-quality models to support users at inference time. This “the larger, the better” mindset obscures the need to incorporate specific and targeted knowledge to address particular aspects of the translation process.

This special session focuses on raising attention within the SLT community on two different types of information that are fundamental to boost performance in speech translation applications:

  • Paralinguistic information: Important facets of communication are non-verbal and non-linguistic aspects of speech. For instance, human beings naturally communicate their underlying emotional states without explicitly describing them. The capability to leverage paralinguistic information (e.g. tones, emotions) in the source language speech has been lost in most current SLT approaches.
  • Linguistic information: Current speech translation models do not take advantage of specific source/target language knowledge such as syntax parsers, morphological analyzers, monolingual and bilingual glossaries, ontologies, knowledge bases, etc. This is particularly evident by the incapabilities of the models to correctly translate parts of the input that are rarely represented in the training data (e.g. named entities, terms) or are specific to some languages (e.g. idioms).

This special session will cover simultaneous and incremental ASR, MT, TTS models giving particular importance to their needs and uses in real-time application scenarios. By combining these themes, this session will create a positive environment that brings the wider speech and translation communities together to discuss innovative ideas, challenges, and opportunities for utilizing paralinguistic and linguistic knowledge within the scope of speech translation.




Satoshi Nakamura, NAIST

Marco Turchi, Zoom Video Communications

Juan Pino, Meta

Marcello Federico, AWS AI Labs

Colin Cherry, Google

Alex Waibel, CMU/KIT

Elizabeth Salesky, Johns Hopkins University


Technological advancements have been rapidly transforming healthcare in the last several years, with speech and language tools playing an integral role. However, this brings a multitude of unique challenges to consider when integrating speech and language tools in healthcare and health research settings. Many of these challenges are common to the two themes of this special session. The first theme, From Collection and Analysis to Clinical Translation, seeks to draw attention to all aspects of speech-health studies that affect the overall quality and reliability of any analysis undertaken on the data and thus affect user acceptance and clinical translation. These factors include increasing our understanding into how changes in health affect the neuroanatomical and neurophysiological mechanisms related to speech and language, and how best to go about capturing, analyzing and quantifying these changes. Alongside these efforts, the speech health community also needs to consider practical issues of feasibility to help advance the translational potential of speech as a health signal. The second theme, Speech and Language Technology For Medical Conversations, covers a growing field of ambient intelligence in which automatic speech recognition and natural language processing tools are combined to automatically transcribe and interpret clinician-patient conversations and generate subsequent medical documentation. This multifaceted area includes many foci centered around language technologies. Such as those for long-form conversations, for translation of conversations into accurate clinical documentation, for providing feedback to medical students, for diagnostic support from spontaneous conversations with physicians, or for novel applications for language technology.  By combining these themes, this session will bring the wider speech-health community together to discuss innovative ideas, challenges, and opportunities for utilizing speech technologies within the scope of healthcare applications.




Nicholas Cummins, King’s College London and Thymia

Thomas Schaaf, 3M

Heidi Christensen, University of Sheffield

Julien Epps, University of New South Wales

Matt Gormley, Carnegie Mellon University

Sandeep Konam, Abridge.ai

Emily Mower Provost, University of Michigan

Chaitanya Shivade, Amazon.com

Thomas Quatieri, MIT Lincoln Laboratory


Given the ubiquity of Machine Learning (ML) systems and their relevance in daily lives, it is important to ensure private and safe handling of data alongside equity in human experience. These considerations have gained considerable interest in recent times under the realm of Trustworthy ML. Speech processing in particular presents a unique set of challenges, given the rich information carried in linguistic and paralinguistic content including speaker trait, interaction and state characteristics. This special session on Trustworthy Speech Processing (TSP) was created to bring together new and experienced researchers working on trustworthy ML and speech processing. We invite novel and relevant submissions from both academic and industrial research groups showcasing theoretical and empirical advancements in TSP.


Topics of interest cover a variety of papers centered on speech processing, including (but not limited to):

  • Differential privacy
  • Bias and Fairness
  • Federated learning
  • Ethics in speech processing
  • Model interpretability
  • Quantifying & mitigating bias in speech processing
  • New datasets, frameworks and benchmarks for TSP
  • Discovery and defense against emerging privacy attacks
  • Trustworthy ML in applications of speech processing like ASR




Anil Ramakrishna, Amazon Inc.

Shrikanth Narayanan, University of Southern California

Rahul Gupta, Amazon Inc.

Isabel Trancoso, University of Lisbon

Bhiksha Raj, Carnegie Mellon University

Theodora Chaspari, Texas A&M University



The DIarization of SPeaker and LAnguage in Conversational Environments (DISPLACE) challenge entails a first of kind task to perform speaker and language diarization on the same set of recordings, as the data contains multi-speaker social conversations in multi-lingual code-mixed speech. In multi-lingual communities, social conversations frequently involve code-mixed and code-switched speech. In such cases, speech processing systems need to perform the speaker and language segmentation before any downstream task. The current speaker diarization systems are not equipped to handle multi-lingual conversations, while the language recognition systems may not be able to handle the same talker speaking in multiple languages within a single recording.

With this motivation, the DISPLACE challenge attempts to benchmark and improve Speaker Diarization (SD) in multi-lingual settings and Language Diarization (LD) in multi-speaker settings, using the same underlying dataset. For this challenge, a natural multi-lingual, multi-speaker conversational dataset will be distributed for development and evaluation purposes. There will be no training data given and the participants are free to use any resource for training the models. The challenge reflects the theme of Interspeech 2023 – “Inclusive Spoken Language Science and Technology – Breaking Down Barriers” in its true sense.

Registrations are open for this challenge which will contain two tracks –

  1. Speaker diarization in multilingual scenarios
  2. Language diarization in multi-speaker cases

A baseline system and an open leaderboard will also be made available to the participants. For more details, dates, and to register, kindly visit the DISPLACE challenge website.




Dr. Shikha Baghel, LEAP lab, Indian Institute of Science, Bangalore

Prof. Deepu Vijayasenan, National Institute of Technology Karnataka, Surathkal

Prof. Sriram Ganapathy, LEAP lab, Indian Institute of Science, Bangalore


The inaugural MERLIon CCS Challenge focuses on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom.

Due to a bias towards standard speech varieties, non-standard, accented speech remains an ongoing challenge for automatic processing. Although existing works have explored automatic speech recognition and language diarization in code-switching speech corpora, those tasks are still challenging for natural in-the-wild speech containing more than one language, particularly when the code-switching occurs in short language spans.

Aligning closely with Interspeech 2023’s theme, ‘Inclusive Spoken Language Science and Technology – Breaking Down Barriers’, we present the challenge of developing robust language identification and language diarization systems that are reliable for non-standard accented, bilingual, child-directed speech collected via a video call platform.

As video calls become increasingly ubiquitous, we present a unique first-of-its-kind Zoom video call dataset.  The MERLIon CCS Challenge will tackle automatic language identification and language diarization in a subset of audio recordings from the Talk Together Study, where parents narrated an onscreen wordless picture book to their child.

The main objectives of this inaugural challenge are:

  1. to benchmark the current and novel language identification and language diarization systems in a code-switching scenario, including extremely short utterances;
  2. to test the robustness of such systems under accented speech;
  3. to challenge the research community to propose novel solutions in terms of adaptation, training, and novel embedding extraction for this particular set of tasks.

Techniques developed in the challenge may benefit other related fields allowing a greater understanding of how code-switching occurs in real-life situations.

The challenge will feature language identification and language diarization. Two tracks, open and closed, are available. The tracks differ by the data used during system training.




Leibny Paola Garcia Perera, John Hopkins University

YH Victoria Chua, Nanyang Technological University

Hexin Liu, Nanyang Technological University

Fei Ting Woon, Nanyang Technological University

Andy Khong, Nanyang Technological University

Justin Dauwels, TU Delft

Sanjeev Khudanpur, John Hopkins University

Suzy J Styles, Nanyang Technological University


Thanks to shared datasets and benchmarks, impressive advancements have been made in the field of speech processing. Historically, these tasks have been centered around automatic speech recognition (ASR), speaker identification, and other key activities at the lower level tasks.

There is an increasing demand for advanced spoken language understanding (SLU) tasks, including using end-to-end models, but there are not many labeled datasets available to tackle them. What’s more, the few existing datasets tend to be relatively limited in size and comprise synthetic data. Recent research reveals that it is possible to pre-train generic representations and then refine them for various tasks with only a small amount of labeled data.

For this special session, we will provide a Spoken Language Understanding Evaluation (SLUE) benchmark suite. SLUE (Phase 1) includes annotation for ASR, named entity recognition (NER), and sentiment analysis with the toolkit to pre-process and fine-tune scripts for baseline models.

While we invite general submissions about this topic, the 2nd special session of the low-resource SLU series incorporates a unique challenge – SLUE 2023 – which will focus on named entity recognition using the SLUE-Vox Populi dataset with resource constraints that can be found in website below.


We also invite contributions for any relevant work in low-resource SLU problems, which include, but are not limited to:

  • Training/fine-tuning approach using self/semi-supervised model for SLU tasks
  • Comparison between pipeline and end-to-end SLU systems
  • Self/semi-supervised learning approach focusing on SLU
  • Multi-task/transfer/student-teacher learning focusing on SLU tasks
  • Theoretical or empirical study on low-resource SLU problems

We will consider papers that are both challenge and non-challenge submissions.




Suwon Shon, ASAPP

Felix Wu, ASAPP

Ankita Pasad, Toyota Technological Institute at Chicago

Chyi-Jiunn Lin, National Taiwan University

Siddhant Arora, Canergie Mellon University

Roshan Sharma, Canergie Mellon University

Wei-Lun Wu, National Taiwan University

Hung-Yi Lee, National Taiwan University

Karen Livescu, Toyota Technological Institute at Chicago

Shinji Watanabe, Canergie Mellon University

Panel Sessions

Note: there are no paper submissions for the panel session.


Speech processing system capacity for learning about human speech and enabling speech-based human computer interaction has afforded many possibilities. While the community has worked hard on the topic of speech processing system errors, we have not really grappled with the risks and negative impacts of speech applications – not because they don’t happen, but presumably because these topics are rarely in scope for the activity. The field of trustworthy and responsible AI seeks to explore limitations of technology and reduce its risks to individuals, communities, and society. There is mounting evidence of significant AI risks, such as AI bias causing harms to certain groups (e.g., in facial recognition technology), leading research communities to pay more attention to these concerns.

This special session will focus on this topic in the context of speech processing systems. It will consist of moderated discussion of bias in speech processing from a broader, more holistic socio-technical perspective, that (1) includes and goes beyond computational and statistical biases in the data and model pipelines, to include systemic bias, default culture, the role of domain expertise and contextual considerations, and human-cognitive biases across AI lifecycle, (2) centers on impacts, how risks in AI lead to those impacts, and how design considerations and organizational practices can be developed and normalized to address risks, (3) will elicit thoughts and questions from session attendees and panels made up of the session organizers.

The session’s goals are to encourage the speech community to:

  • develop a research roadmap for evaluating and mitigating bias propagation beyond the model pipeline in speech applications
  • avoid pitfalls of other AI tasks/applications in bias
  • consider how we can:
  • explore limitations within speech applications?
  • evaluate speech application impacts in real-world settings?
  • improve our capacity for bringing socio-technical context into the design and development of speech applications?
  • know which variables within human speech are being learned by speech processing systems that have contributed to risk or unintended impacts?


Aylin Caliskan, University of Washington

Craig Greenberg, National Institute of Standards and Technology

John Hansen, University of Texas, Dallas

Abigail Jacobs, University of Michigan

Nina Markl, University of Edinburgh

Doug Reynolds, National Security Agency; MIT Lincoln Laboratory

Hilke Schellmann, New York University

Reva Schwartz, National Institute of Standards and Technology

Mona Sloane, NYU; University of Tübingen