The following Tutorials are available at INTERSPEECH 2023. These will take place on Sunday, 20th August in the Convention Centre Dublin.

Morning Tutorials

Fei Chen (Department of Electrical and Electronic Engineering, Southern University of Science and Technology, China) and Yu Tsao (The Research Center for Information Technology Innovation (CITI), Academia Sinica, Taiwan).

An important measure of the effectiveness of speech technology applications is the intelligibility and quality of the processed speech signals provided by these applications. A number of speech evaluation metrics have been derived to quantitatively measure specific properties of speech signals. Objective speech assessment metrics have been developed as surrogates for human listening tests. Speech assessment metrics based on deep learning-based models have garnered significant attention.

Fei Chen received the B.Sc. and M.Phil. degrees from the Department of Electronic Science and Engineering, Nanjing University in 1998 and 2001, respectively, and the Ph.D. degree from the Department of Electronic Engineering, The Chinese University of Hong Kong in 2005. He continued his research as post-doctor and senior research fellow in University of Texas at Dallas (supervised by Prof. Philipos C. Loizou) and The University of Hong Kong. He is now a full professor at Department of Electrical and Electronic Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China. Dr. Chen is leading the speech and physiological signal processing (SPSP) research group in SUSTech, with research focus on speech perception, speech intelligibility modeling, speech enhancement, and assistive hearing technology. He published over 100 journal papers and over 100 conference papers in IEEE journals/conferences, Interspeech, Journal of Acoustical Society of America, etc. He was tutorial speakers of Interspeech 2022, EUSIPCO2022, APSIPA 2021, Interspeech 2020, and APSIPA 2019, and organized special session “Signal processing for assistive hearing devices” at ICASSP 2015. He received the best presentation award in the 9th Asia Pacific Conference of Speech, Language and Hearing. Dr. Chen is an APSIPA distinguished lecturer (2022-2023), and is now serving as associate editor of Biomedical Signal Processing and Control.

Yu Tsao received the B.S. and M.S. degrees in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1999 and 2001, respectively, and the Ph.D. degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2008. From 2009 to 2011, he was a Researcher with the National Institute of Information and Communications Technology, Tokyo, Japan, where he engaged in research and product development in automatic speech recognition for multilingual speech-to-speech translation. He is currently a Research Fellow (Professor) and the Deputy Director with the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan. He is also a Jointly Appointed Professor with the Department of Electrical Engineering, Chung Yuan Christian University, Taoyuan, Taiwan. His research interests include assistive oral communication technologies, audio coding, and bio-signal processing. He is currently an Associate Editor for the IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING and IEEE SIGNAL PROCESSING LETTERS. He was the recipient of the Academia Sinica Career Development Award in 2017, national innovation awards in 2018– 2021, Future Tech Breakthrough Award 2019, and Outstanding Elite Award, Chung Hwa Rotary Educational Foundation 2019–2020. He is the corresponding author of a paper that received the 2021 IEEE Signal Processing Society (SPS), Young Author, Best Paper Award.

Shi-Xiong Zhang (Tencent AI lab, Bellevue, USA), Yong Xu (Tencent AI lab, Bellevue, USA), Shinji Watanabe (Carnegie Mellon University, Pittsburgh, USA) and Dong Yu (Tencent AI lab, Bellevue, USA).

A new trend in today’s speech fields is to develop systems towards solving more wild and more challenging scenarios such as multiple simultaneous speakers in meetings or cocktail party environments. Significant research activity has occurred in recent years in these fields and great advances have been made. This tutorial will bring together all the state-of-the-art researches on solving “Who said What and When” in multi-talker scenarios, including: 1) front-end speech separation and beamforming; back-end speaker diarization and speech recognition; 2) modeling techniques for single-channel, multi- channel or audio-visual inputs; 3) the pipeline systems of multiple speech modules vs the end-to-end integrated neural networks. The goal is to give audiences a complete picture of this cross-disciplinary field and enlighten the future directions and collaborations.

Shi-Xiong (Austin) Zhang received the Ph.D. degree in the Cambridge University in 2014. From 2014 to 2018, he was a senior speech scientist at Microsoft, speech group. Currently he is a principal researcher at Tencent AI Lab leading the multi-modal research for speech recognition, speaker diarization, speech separation. He was granted the “IC Greatness award” in Microsoft in 2015 for his contribution on the “Personalized Hey Cortana” system in Windows 10. He was nominated a 2011 Interspeech Best Student Paper Award for his paper “Structured Support Vector Machines for Noise Robust Continuous Speech Recognition”. Shi-Xiong has served as a Program Committee member of APSIPA and the Area Chair of several international conferences, including ICASSP, Interspeech and ASRU in 2021 and 2022.

Yong Xu is a principal researcher in Tencent AI lab, Bellevue, USA. His current research interests include speech enhancement, neural beamforming, etc. In recent years, he proposed and published several all-neural beamforming technologies. He received the 2018 IEEE Signal Processing Society Best Paper award for his work on DNN-based speech enhancement. He is a member of the IEEE Signal Processing Society Speech and Language Technical Committee (SLTC).

Shinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar at Georgia institute of technology, Atlanta, GA, in 2009, and a senior principal research scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA USA from 2012 to 2017. Before the move to Carnegie Mellon University, he was an associate research professor at Johns Hopkins University, Baltimore, MD, USA, from 2017 to 2020. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published over 300 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. He is a Senior Area Editor of the IEEE Transactions on Audio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA), IEEE Signal Processing Society Speech and Language Technical Committee (SLTC), and Machine Learning for Signal Processing Technical Committee (MLSP). He is an IEEE Fellow.

Dong Yu (M’97 SM’06 F’18) is an IEEE Fellow, an ISCA Fellow, and an ACM distinguished scientist. He currently works as a distinguished scientist and vice general manager at Tencent AI Lab. Prior to joining Tencent in 2017, he was a principal researcher at Microsoft Research (Redmond), Microsoft, where he joined in 1998. He has been focusing his research on speech processing and esp. neural speech enhancement and separation in recent years and has published two monographs and 300+ papers. His works have been cited over 50,000 times per Google Scholar and have been recognized by the prestigious IEEE Signal Processing Society 2013, 2016, 2020, and 2022 best paper awards.

Pin-Yu Chen (IBM AI and MIT-IBM Watson AI Lab, NY, USA), C. -H Huck Yang (Amazon Alexa Speech, WA, USA), Shalini Ghosh (Amazon Alexa Speech, WA, USA), Jia-Hong Huang (Universiteit van Amsterdam, the Netherlands) and Marcel Worring (Universiteit van Amsterdam, the Netherlands).

In this tutorial, the first session will introduce the theoretical advantages of large-scale pre-trained foundation models by the universal approximation theory and how to update the large-scale speech and acoustic models effectively using parameter-efficient learning. Next, our second session will introduce how we can do effective cross-modal pre-training of representations across visual, speech, and language modalities, which can be learned without necessarily needing aligned data across modalities and can benefit tasks in individual modalities as well. Finally, our third session will explore different applications on multimedia processing benefited from the pre-training of acoustic and language modelling with benchmark performance.

Pin-Yu Chen is a principal research scientist at IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. He is also the chief scientist of RPI-IBM AI Research Collaboration and PI of ongoing MIT-IBM Watson AI Lab projects. Dr. Chen received his Ph.D. in electrical engineering and computer science from the University of Michigan, Ann Arbor, USA, in 2016. Dr. Chen’s recent research focuses on adversarial machine learning and robustness of neural networks. His long-term research vision is to build trustworthy machine learning systems. He received the IEEE GLOBECOM 2010 GOLD Best Paper Award and UAI 2022 Best Paper Runner-Up Award.

Huck Yang works in the Language Modeling team at Amazon Alexa Speech Recognition. He did his Ph.D. in the School of Electrical and Computer Engineering at the Georgia Institute of Technology in Atlanta, GA, USA, with Wallace H. Coulter fellowship. He received a B.S. degree from National Taiwan University. His recent research interests focus on in-context parameter-efficient learning, such as residual adapter, model reprogramming, prompt-tuning, and differential privacy for speech processing. Previously, he was a research intern at Google, Amazon, Hitachi, and EPFL.

Dr. Shalini Ghosh is a Principal Research Scientist at Amazon Alexa AI. Previously, she served as a Director/Principal Scientist at Samsung Research America, a Principal Scientist at SRI International, and a Visiting Scientist at Google Research. She earned her Ph.D. from the University of Texas at Austin. With over 15 years of experience as an ML researcher, Dr. Ghosh specializes in deep learning with applications to various domains, including multimedia processing, video understanding, and multi-modal pre-training using image, text, and audio data.

Dr. Marcel Worring is full professor in Multimedia Analytics in the Informatics Institute where he leads the MultiX group. The research in MultiX focuses on developing AI techniques for getting the richest information possible from the data (image/video/text/graphs) interactions surpassing human and machine intelligence, and visualizations blending it all in effective interfaces for applications in health, forensics and law enforcement, cultural heritage, urban livability, and social media analysis.

Juliana Saba (Centre for Robust Speech Systems, The University of Texas, USA), Ram C.M.C Shekar (Centre for Robust Speech Systems, The University of Texas, USA), Oldooz Hazrati (Food & Drug Administration, USA) and John H.L. Hansen (Centre for Robust Speech Systems, The University of Texas, USA).

This tutorial will provide an overview of speech and sound perception specific for cochlear implant users and discuss how human subjects research is conducted through the use of research platforms. Signal processing aspects, such as sound coding strategies, the translation of acoustic parameters in the electric space, and performance of these clinical devices in various listening situations will be discussed. Two types of strategies will be provided: speech-specific and non-speech. A brief explanation of advancements in speech processing, research platforms, and cloud-based technology as well as future directions will be discussed.

Juliana N. Saba graduated from the University of Texas at Dallas, Richardson, TX with degrees in Biomedical Engineering (B.S., 2015; M.S., 2019) and Electrical Engineering (PhD, 2021). After graduating in the first accredited class of bioengineering, she began work on her doctoral dissertation focused on incorporating physiological and subject-specific features in signal processing strategies for cochlear implant users. She has a diverse publication profile from involvement in the maintenance of cochlear implant research platforms and collaborations in the dental field. Prior to joining the Cochlear Implant Processing Laboratory, she was involved in a bioengineering laboratory assisting in the design and development of novel implant-abutment systems and lead investigations related to electrochemical and cytotoxicity effects of dental cements in vitro. Juli is currently a postdoctoral researcher in the CILab at UT-Dallas designing Lombard Effect perturbation strategies for cochlear implant users funded by NIH-NIDCD.

Ram Charan Chandra Shekar graduated from the University of Texas at Dallas, Richardson, TX with Ph.D in Electrical Engineering (2022). After finishing his Bachelor of Engineering and Master’s in Technology in India, Ram worked at IBM as an Associate System Engineer. In 2016, Ram was admitted into the doctoral program in Electrical Engineering at UT Dallas. His doctoral thesis primarily focused on analysis and development of novel techniques for the improvement of perception of environmental sounds among cochlear implant users. He has also served as a teaching assistant for digital signal processing, linear algebra, digital systems and other courses. He has gained vast experience working in diverse research topics like: real time dynamic range compression and speech enhancement for hearing aids, safety analysis for cochlear implants, and novel techniques for advancement of non-speech (non-linguistic) sound perception among cochlear implant users. During his PhD, Ram interned with Texas Instruments and Meta (formerly Facebook) and worked on developing deep learning techniques for obstacle recognition using ultrasonic sensors, and advanced hear through filter for efficient representation of spatial audio. Currently, Ram is engaged as a post-doctoral researcher focusing on improving speech intelligibility among non-native English speakers, funded by NSF EAGER grant. Ram is also pursuing an post-graduate machine learning certificate.

Oldooz Hazrati received her Ph.D. in Electrical Engineering from The University of Texas at Dallas (UTD) in 2012, under the supervision of Dr. Philip Loizou with a research assistantship supported by grants from NIH and Cochlear Limited. She was a post-doctoral researcher in the Cochlear Implant and Speech Processing laboratories in Erik Jonsson School of Engineering & Computer Science, The University of Texas at Dallas from 2013-2015. Her research was supported by Cochlear Limited (PI: Oldooz Hazrati) and NIH. In September 2015, she became an adjunct faculty in the department of Electrical Engineering at UTD. Since January 2016, she is a senior staff fellow at the Food and Drug Administration (FDA), and currently serves as Lead Reviewer for FDA in the Communication Assistive Technologies area. She has published 12 journal papers and 17 conference papers during her work at UTD on speech processing for Cochlear Implants.

John H.L. Hansen received his Ph.D. and M.S. degrees in Electrical Engineering from Georgia Institute of Technology, Atlanta, Georgia, in 1988 and 1983, and B.S.E.E. degree from Rutgers University, New Brunswick, N.J. in 1982. He joined University of Texas at Dallas (UTD), Erik Jonsson School of Engineering & Computer Science in 2005, where he is Associate Dean for Research and Professor of Electrical Engineering, and Professor in Brain and Behavioral Sciences (Speech & Hearing). At UTD, he holds the Distinguished Chair in Telecommunications Engineering, and established the Center for Robust Speech Systems (CRSS). From 1999-2005, he was with Univ. of Colorado Boulder, as Dept. Chair and Professor in Speech, Language, Hearing Sciences, and Professor in Electrical Engineering, and co-founded the Center for Spoken Language Research. From 1988-1998, he was with Duke Univ., Departments of Electrical and Biomedical Engineering, and founded the Robust Speech Processing Laboratory. He has served as IEEE Distinguished Lecturer, member of IEEE Signal Processing Society: Speech Technical Committee (TC Chair 2012-14) and Educational Technical Committee, Technical Advisor to U.S. Delegate for NATO (IST/TG-01), Associate Editor for IEEE Trans. Speech & Audio Proc., Associate Editor for IEEE Signal Proc. Letters, Editorial Board Member for IEEE Signal Proc. Magazine, member of Speech Communications Technical Committee for Acoustical Society of America, and served as General Chair for Interspeech-2002 and Technical Chair for IEEE ICASSP-2010. Also served as Co-Chair for ISCA Interspeech-2022, and Tech. Chair for IEEE ICASSP-2024. He has supervised 99 thesis candidates, was recipient of the 2005 Univ. of Colorado Teacher Recognition Award, and author/co-author of 865 journal & conference papers in the field of speech processing and language technology, co-author of Discrete-Time Processing of Speech Signals, (IEEE Press, 2000). He served as ISCA President (2017-2021) and currently continues to serve on the ISCA Board as Treasurer.

Kazuyoshi Yoshii (Kyoto University, Japan/SSU Team with AIP, RIKEN, Tokyo, Japan), Aditya Arie Nugraha (SSU Team with AIP, RIKEN, Tokyo, Japan), Mathieu Fontaine (LTCI, Télécom Paris, Palaiseau, France/SSU Team with AIP, RIKEN, Tokyo, Japan) and Yoshiaki Bando (Artificial Intelligence Research Centre (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan/SSU Team with AIP, RIKEN, Tokyo, Japan).

This tutorial aims to enlighten audio and speech researchers who are interested in source separation and speech enhancement on how to formulate a physics-aware probabilistic model that explicitly stands for the generative process of observed audio signals (direct problem) and how to derive its maximum likelihood estimator (inverse problem) in a principled manner. Under mismatched conditions and/or with less training data, the separation performance of supervised methods might be degraded drastically in the real world, as is often the case with deep learning-based methods that work well in controlled benchmarks. We show first that the state-of-the-art blind source separation (BSS) methods can work comparably or even better in the real world and play avital role for drawing the full potential of deep learning-based methods. Secondly, this tutorial introduces how to develop an augmented reality (AR) application for smart glasses with real-time speech enhancement and recognition of target speakers.

Kazuyoshi Yoshii received M.S. and PhD degrees in informatics from Kyoto University, Kyoto, Japan, in 2005 and 2008, respectively. He is currently an Associate Professor with the Graduate School of Informatics, Kyoto University, and concurrently the Leader of the Sound Scene Understanding Team with the Center for Advanced Intelligence Project (AIP), RIKEN, Tokyo, Japan. He is also a member of the Audio and Acoustic Signal Processing (AASP) Technical Committee, Signal Processing Society (SPS), IEEE and a Distinguished Lecturer of Asia-Pacific Signal and Information Processing Association (APSIPA). His research interests include music informatics, audio and speech signal processing, and statistical machine learning. He presented 19 papers in IEEE TASLP, 3 papers in IEEE SPL, 28 papers in IEEE ICASSP, 11 papers in EUSIPCO, and 37 papers in ISMIR.

Aditya Arie Nugraha received the B.S. and M.S. degrees in electrical engineering from Institut Teknologi Bandung, Bandung, Indonesia, in 2008 and 2011, respectively, the M.E. degree in computer science and engineering from Toyohashi University of Technology, Toyohashi, Japan, in 2013, and the Ph.D. degree in informatics from Université de Lorraine, Nancy, France, and INRIA Nancy Grand-Est, France, in 2017. He is currently a Research Scientist of the Sound Scene Understanding Team with the AIP, RIKEN, Tokyo, Japan. His research interests include audio-visual signal processing and machine learning.

Mathieu Fontaine received an M.S. degree in applied and fundamentals mathematics from Université de Poitiers, Poitiers, France, in 2015, and a Ph.D. degree in informatics from Université de Lorraine and INRIA Nancy Grand-Est, France, in 2018. He was a Postdoctoral Researcher with the AIP, RIKEN, Tokyo, Japan. He is currently an Associate Professor with LTCI, Télécom Paris, Palaiseau, France. He is also a Visiting Researcher with the AIP, RIKEN, Tokyo, Japan. His research interests include machine listening topics, such as audio source separation, sound event detection, and speaker diarization using microphone arrays.

Yoshiaki Bando received M.S. and Ph.D. degrees in informatics from Kyoto University, Kyoto, Japan, in 2015 and 2018, respectively. He is currently a Senior Researcher with the Artificial Intelligence Research Center (AIRC), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan. He is also a Visiting Researcher with the AIP, RIKEN, Tokyo, Japan. His research interests include microphone array signal processing, deep Bayesian learning, and robot audition.

Jee-weon Jung (Naver corporation, Korea / Carnegie Mellon University, USA), Hye-jin Shim (University of Finland, Finland), Hemlata Tak (EURECOM, France) and Xin Wang (National Institute of Informatics, Japan).

This tutorial will delve into the latest advances in audio anti-spoofing and audio deepfake detection, driven by the application of graph neural networks and self-supervised learning. We will provide a comprehensive overview of the latest state-of-the-art techniques, including in-depth analysis and hands-on coding demonstrations. By attending this tutorial, participants will gain a thorough understanding of state-of-the-art audio anti-spoofing models and will be knowledgeable enough to experiment with these models and leverage them as future baselines.

Jee-weon Jung is a postdoctoral researcher at Carnegie Mellon University, USA. Previously he was a research Scientist at Clova, Naver Corporation, Republic of Korea. He received his Ph.D. degree from the University of Seoul, Republic of Korea. He has worked on speaker recognition, acoustic scene classification, audio spoofing detection, and related tasks. He was the main organizer of the Spoofing-Aware Speaker Verification Challenge, a special session at Interspeech 2022, and is one of the organizers of VoxSRC since 2022. He has published several papers with state-of-the-art models, including AASIST, for audio anti-spoofing using graph neural networks.

Hye-jin Shim is a postdoctoral researcher at the University of Eastern Finland. Her research interest includes Audio Anti-spoofing, Speaker Recognition, and Acoustic Scene Classification. She received her Ph.D. and M.S. degrees in computer science from the University of Seoul in 2022, and 2019, respectively. She is one of organizer of Spoofing-Aware Speaker Verification (SASV) and ASVspoof Challenges.

Hemlata Tak is a PhD candidate at EURECOM, France. She received her Master’s degree in 2018 from DA-IICT, Gandhinagar, India. She co-organized the inaugural edition of the Spoof-Aware Speaker Verification (SASV) Challenge 2022. She is also a co-organiser of the ASVspoof 5 Challenge. Her research interests include voice biometrics, audio deepfake detection and anti-spoofing.

Xin Wang is a project assistant professor at the National Institute of Informatics, Japan. He received his Ph.D. degree from the Department of Informatics, SOKENDAI located at the National Institute of Informatics in 2018. He is one of the organisers of the ASVspoof Challenge 2019, 2021, and its latest edition. He is also one of the organisers of the Voice Privacy Challenge 2020 and 2022. He is on the appointed team of ISCA SIG on Security and Privacy in Speech Communication (SPSC). His research interests include speech anti-spoofing, speech privacy protection, and speech synthesis.

Tulika Saha (University of Liverpool, United Kingdom), Abhisek Tiwari (Indian Institute of Technology Patna, India) and Sriparna Saha (Indian Institute of Technology Patna, India).

In the past few years, dozens of surveys have revealed a scarcity of healthcare professionals, particularly psychiatrists, limiting access to healthcare for severely ill individuals. With the motivation of efficiently utilizing doctors’ time and providing an accessible platform for early diagnosis, clinical assistance using artificial intelligence is gaining immense popularity and demand in both research and industry communities. As a result, telemedicine has grown substantially in recent years, particularly since the COVID outbreak. The tutorial aims to present a comprehensive overview of the use of conversational agents in healthcare, including recent advancements and future prospects. The tutorial will also provide a demonstration of our newly developed virtual disease diagnosis assistant. The tutorial has been crafted with fundamentals to advanced concepts in mind, which makes it beneficial for researchers who are beginners or experts.

Dr. Tulika Saha is a Lecturer of Computer Science at the University of Liverpool, United Kingdom (UK). Her current research interests include ML, DL, NLP typically Dialogue Systems, AI for Social Good, Social Media Analysis etc. She was a postdoctoral research fellow at the National Centre for Text Mining, University of Manchester, UK. Previously she earned her Ph.D. from Indian Institute of Technology Patna, India. Her research articles are published in top-tier conferences such as ACL, ACM SIGIR etc. and several peer-reviewed journals such as Plos One, IEEE TCSS etc and is currently serving as ACs for several top-tier conferences such as ACL.

Abhisek Tiwari is a research scholar (Prime Minister Research Fellow) in Computer science and Engineering, Indian Institute of Technology, Patna. His research interests include AI for Social Good, NLP, typically Conversational AI, and RL. He is also serving as a guest lecturer at NSIT Bihta, India. His research works have been published in reputable conferences, such as CIKM, IJCNLP, and peer-reviewed journals. Abhisek has delivered several tutorials, including the GIAN Course on DL Techniques for Conversational AI, and conducted birds-of-a-feather sessions at top-tier conferences such as ACL, ICLR, and NeurIPS.

Dr. Sriparna Saha is currently serving as an Associate Professor in the Department of Computer Science and Engineering, Indian Institute of Technology Patna, India. She has authored or co-authored more than 400 papers. Her current research interests include machine learning, deep learning, bioinformatics, natural language processing, multiobjective optimization, and biomedical information extraction. Her h-index is 33 and the total citation count of her papers is 6710 (according to Google scholar). She is also a senior member of IEEE. She is the recipient of the Google India Women in Engineering Award, 2008, NASI YOUNG SCIENTIST PLATINUM JUBILEE AWARD 2016, BIRD Award 2016, IEI Young Engineers’ Award 2016, SERB WOMEN IN EXCELLENCE AWARD 2018, SERB Early Career Research Award 2018, Humboldt Research Fellowship, Indo-U.S. Fellowship for Women in STEMM (WISTEMM) Women Overseas Fellowship program 2018 and CNRS fellowship. She is currently also serving as the Associate Editor of IEEE/ACM Transactions on Computational Biology and Bioinformatics, IEEE Transactions on Computational Social Systems, ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Expert Systems with Applications, PLOS ONE, Machine Learning with Applications, IEEE Internet Computing, Engineering Applications of Artificial Intelligence, Elsevier journal (impact factor: 6.2, h5-index: 65). Her name is included in the list of top 2% of scientists of their main subfield discipline (Artificial Intelligence and Image Processing), across those that have published at least five papers (a survey conducted by Stanford University).

Fangjun Kuang (Xiaomi, China) Matthew Wiesner (John Hopkins University, USA), Piotr Zelasko (Meaning, USA), Desh Raj (John Hopkins University, USA), Dan Povey (Xiaomi, China) Sanjeev Khudanpur (John Hopkins University, USA), Leibny Paola Garcia Perera (Johns Hopkins University, USA) and Jan “Yenda” Trmal (Johns Hopkins University, USA).

The focus of this tutorial is on the new features in Lhotse and Icefall such as: efficient algorithms and architectures that enable fast and memory-efficient training of Transducers, even in academic environments using modest GPU resources; novel fast decoding algorithms; sequential data storage and I/O to enable easy storage and processing of large corpora (>30,000 hrs); new Lhotse workflows with Whisper and Wav2Vec2.0, new ASR recipes focusing on corpora with 5000+ hrs of speech, and demonstrating how Lhotse can be used to support full-stack speech processing with blind source separation in multi- talker multi-microphone recordings. Finally, we present a new ASR server framework in Python, called Sherpa,10that supports both streaming and non- streaming recognition. We hope this tutorial will encourage the wider community, including industrial and academic researchers, to develop and deploy full-stack, Transducer based ASR solutions trained on large corpora such as the Gigaspeech, or SPGI Speech corpora.

Fangjun Kuang received his master’s degree from the University of Stuttgart, Germany, in 2017, and his bachelor’s degree from Central South University, China, in 2011. He is currently a speech researcher at Xiaomi and his main interest is speech recognition. He is a member of the next-gen Kaldi team and is interested in developing open-source frameworks for speech recognition, including training as well as deployment.

Matthew Wiesner received his PhD from Johns Hopkins University in 2021, where he is currently a research scientist at the Human Language Technology Center of Excellence. He has worked extensively on various multilingual aspects of speech processing ranging from zero-shot speech recognition to speech translation. He was an organizer for the 2021 IWSLT Multilingual speech translation task, and for the CHiME-7 DASR challenge.

Dr. Piotr Żelasko is an expert in ASR and spoken language understanding, with extensive experience in developing practical and scalable ASR solutions for industrial-strength use. He worked at Johns Hopkins University’s Center for Language and Speech Processing, as well as with successful speech processing start-ups – Techmo (Poland) and IntelligentWire (USA, acquired by Avaya). Currently he is the head of research at Meaning.Team Inc., a speech processing start-up.

Desh Raj is currently a Ph.D. student at Johns Hopkins University, where he is advised by Sanjeev Khudanpur and Dan Povey. His research involves problems such as multi-talker speech recognition and speaker diarization, and solving them through end-to-end methods. He has interned with the speech groups at Microsoft and Meta AI, where he built transducer-based systems for overlapped speech. He is a core contributor to Lhotse, and an organizer for the CHiME-7 DASR challenge. At JHU, his research is funded by an Amazon AI2AI fellowship.

Daniel Povey completed his PhD at Cambridge University in 2003. He spent about ten years working for industry research labs (IBM Research and then Microsoft Research), and 7 years as non-tenure-track faculty at Johns Hopkins University; he moved to Beijing, China in November 2019 to join Xiaomi Corporation as Chief Voice Scientist. He is known for many different contributions to the technology of speech recognition. He is an IEEE Fellow as of 2023.

Dr. Sanjeev Khudanpur is a professor at Johns Hopkins University with over 25 years of experience working on almost all aspects of human language technology, including ASR, machine translation, and information retrieval. He has led a number of research project from NSF, DARPA, IARPA, and industry sponsors, and published extensively. He has trained more than 40 PhD and Masters students to use speech recognition tools for their dissertation work. His research interests are in the application of information theoretic and statistical methods to human language technologies.

Dr. Leibny Paola Garcia Perera (PhD 2014, University of Zaragoza, Spain) joined Johns Hopkins University after extensive research experience in academia and industry, including highly regarded laboratories at Agnitio and Nuance Communications. She led a team of 20+ researchers from four of the best laboratories worldwide in far-field speech diarization and speaker recognition under the auspices of the JHU summer workshop 2019 in Montreal, Canada. She was also a researcher at Tec de Monterrey, Campus Monterrey, Mexico, for ten years. She was a Marie Curie researcher for the Iris project in 2015, exploring assistive technology for children with autism in Zaragoza, Spain. She was a visiting scholar at Georgia Institute of Technology (2009) and Carnegie Mellon (2011). Recently, she has been working on children’s speech, including child speech recognition and diarization in day-long recordings. She collaborates with DARCLE.org and CCWD, which analyze child-centered speech. She is also part of the JHU CHiME5, CHiME6, SRE18 and SRE19, SRE20, SRE21, LRE22 teams. Her interests include diarization, speech recognition, speaker recognition, machine learning, and language processing.

Jan “Yenda” Trmal received his PhD in 2013 from the University of West Bohemia, Czech
Republic. From 2013 to 2015, he worked as a Postdoctoral Fellow and later as an Assistant Research Scientist at the Center for Language and Speech Processing (CLSP). Since 2017, he has been an Associate Research Scientist with CLSP.