Audio-Visual Generation

Overview

CALL FOR POSTERS
We will call for extended abstract submission of works already accepted by recent CVPR/ICCV/ECCV/ICML/ICLR/NeurIPS or IJCV/TPAMI/TMLR. Additionally, we will invite accepted papers from both the main conference (ICCV 2025) and the IJCV special issue on audio-visual generation for the presentations.

CALL FOR DEMOS
~~We will also call for demos from the industry.~~ Currently we do not accept industrial demos anymore.

In this workshop, we aim to shine a spotlight on this exciting yet underinvestigated field by prioritizing new approaches in audio-visual generation, as well as covering a wide range of topics related to audio-visual learning, where the convergence of auditory and visual signals unlocks a plethora of opportunities for advancing creativity, understanding, and also machine perception. We hope our workshop can bring together researchers, practitioners, and enthusiasts from diverse disciplines in both academia and industry to delve into the latest developments, challenges, and breakthroughs in audio-visual generation and learning. The workshop will mainly cover but not limited to the following topics:

Audio-visual generation, including joint audio-visual or cross-modal generation
- Image/video-driven audio generation
- Audio-driven visual media generation
- Dancing video and talking head animation
Audio-visual foundation models
Audio-visual representation learning and transfer learning
Audio-visual learning applications
- Application on scene understanding
- Application on localization
Audio-visual benchmarks, such as datasets and evaluation metrics
Ethical considerations in audio-visual research

Keynote Speakers

William T. Freeman

Professor, MIT

Andrew Owens

Assistant Professor, University of Michigan

Chenliang Xu

Associate Professor, University of Rochester

Tae-Hyun Oh

Associate Professor, KAIST

Organizer Talk

Organizers

Shiqi Yang

Research Scientist, SB Intuitions, SoftBank

Zhixiang Wang

Research Scientist, CyberAgent AI Lab

Rodrigo Mira

Research Scientist, Google DeepMind

Shoukang Hu

Senior Researcher, Microsoft Research Asia - Tokyo

Vicky Kalogeiton

Assistant Professor, Ecole Polytechnique

Stavros Petridis

Scientific Research Manager, Meta

Honorary Research Fellow, ICL

Tae-Hyun Oh

Associate Professor, KAIST

Ming-Hsuan Yang

Professor, UC Merced

Program

Schedule Oct 20, 2025

1:00 - 1:10 pm

Organizers

Opening remarks

1:10 - 1:50 pm

Keynote Speaker

William T. Freeman

1:50 - 2:30 pm

Keynote Speaker

Tae-Hyun Oh

2:30 - 2:45 pm

Sponsor

Abaka AI

2:45 - 3:15pm

Poster and Break

Papers accepted by ICCV25 and IJCV SI

3:15 - 3:30 pm

Industrial Demo

Veo 3 by Mohammad Babaeizadeh from Google Deepmind

3:30 - 4:10 pm

Keynote Speaker

Chenliang Xu (online)

4:10 - 4:50 pm

Keynote Speaker

Andrew Owens

4:50 - 5:00 pm

Organizers

Closing remarks

Poster Session

How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes (ICCV 2025). Mahnoor Fatima Saad, Ziad Al-Halah.
Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation (ICCV 2025). Fa-Ting Hong, Zunnan Xu, Zixiang Zhou, Jun Zhou, Xiu Li, Qin Lin, Qinglin Lu, Dan Xu.
AV-Flow: Transforming Text to Audio-Visual Human-like Interactions (ICCV 2025). Aggelina Chatziagapi, Louis-Philippe Morency, Hongyu Gong, Michael Zollhoefer, Dimitris Samaras, Alexander Richard.
TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis (ICCV 2025). Tri Ton, Ji Woo Hong, Chang D. Yoo.
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations (ICCV 2025). Jeong Hun Yeo*, Minsu Kim*, Chae Won Kim, Stavros Petridis, and Yong Man Ro.
What's Making That Sound Right Now? Video-centric Audio-Visual Localization (ICCV 2025). Hahyeon choi, Junhoo Lee, Nojun Kwak.
GaussianSpeech: Audio-Driven Personalized 3D Gaussian Avatars (ICCV 2025). Shivangi Aneja, Artem Sevastopolsky, Tobias Kirschstein, Justus Thies, Angela Dai, Matthias Nießner.
AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs (ICCV 2025). Sanjoy Chowdhury, Hanan Gani, Nishit Anand, Sayan Nag, Ruohan Gao, Mohamed Elhoseiny, Salman Khan, Dinesh Manocha.
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait (ICCV 2025). Taekyung Ki, Dongchan Min, Gyeongsu Chae.
VGGSounder: Audio-Visual Evaluations for Foundation Models (ICCV 2025). Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke.
Taming Data and Transformers for Audio Generation (IJCV SI). Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, Vicente Ordonez.
Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization (IJCV SI). Sooyoung Park, Arda Senocak, Joon Son Chung.
Audio-Guided Video Scene Editing (IJCV SI). Kaixin Shen, Ruijie Quan, Linchao Zhu, Dong Zheng, Jun Xiao, Yi Yang.

Overview

Keynote Speakers

William T. Freeman

Andrew Owens

Chenliang Xu

Tae-Hyun Oh

Organizers

Shiqi Yang

Zhixiang Wang

Rodrigo Mira

Shoukang Hu

Vicky Kalogeiton

Stavros Petridis

Tae-Hyun Oh

Ming-Hsuan Yang

Program

Schedule Oct 20, 2025

Organizers

Keynote Speaker

Keynote Speaker

Sponsor

Poster and Break

Industrial Demo

Keynote Speaker

Keynote Speaker

Organizers

Poster Session

Demo Session Showcase the latest products and innovations

Sponsors