Audio-Visual Generation

Overview

CALL FOR REVIEWERS
We're looking for experienced reviewers to help shape the latest research in this exciting domain! If you’re interested, please fill out this form.

~~CALL FOR PAPERS~~
~~We're accepting high-quality papers on the topic of audio-visual generation (and learning).~~
Flyer is here

The ability to simulate and reason about the physical world is central to human intelligence. We perceive our surroundings and construct mental models that allow us to internally simulate possible outcomes, enabling reasoning, planning, and action — what we might call “world simulators”. Similarly, developing a world simulator is crucial for building human-like AI systems that can interact effectively with dynamic and complex environments. Recent research has shown that high-fidelity video generation models are a promising path toward building such comprehensive and efficient world simulators. However, the physical world is inherently multimodal. Human perception mostly relies not only on visual stimuli but also on sound. Sound often conveys critical information complementing what we can see, providing a richer and more nuanced understanding of the environment. To create world simulators capable of mimicking human-like perception and reasoning, it is crucial to develop coherent audiovisual generative models. Despite this, most modern approaches focus on vision-only or language-visual modalities, often with less focus on understanding and generating integrated audiovisual signals.

This special issue aims to spotlight the exciting yet underexplored field of audio-visual generation as a key stepping stone towards achieving multi-modal world simulators. Our goal is to prioritize innovative approaches that explore this multimodal integration, advancing both the generation and analysis of audio-visual content. In addition to these approaches, we also aim to explore the broader impacts of this research. Moreover, in line with the classical concept of analysis-by-synthesis, advances in audiovisual generation can foster improvements in analysis and understanding methods, reinforcing the symbiotic relationship between these two areas. This research is not merely about content creation; it holds the potential to form a fundamental building block for more advanced, human-like AI systems.

Scopes

Audio and image/video generation. Physically accurate and coherent audio-video generation models would be a promising way to build comprehensive and efficient world simulators. Audio-driven image/video generation or joint cogeneration is crucial for enabling more immersive and interactive experiences by synchronizing visual outputs with auditory inputs. This subtopic includes joint audio-visual generation, audio-to-image/video translation, and audio guided image/video editing, but not limited to these areas, as long as the techniques serve as promising foundational components.

Audio-conditional X generation. In this subtopic, we consider the audio-driven cross-modal generation of other various modalities beyond just image and video, including 3D/4D scenes, human motion (e.g., dance and gestures), and virtual avatars. Although successful cross-modal generation may seem straightforward, it is challenging because it requires accurately capturing the underlying associations between audio and the target modality. This line of research broadens our understanding of the intersections and mutually exclusive information between different modalities. We are also interested in models where any-to-any generation can be achieved.

Speech and talking avatar generation. Speech is a unique type of audio signal compared to ambient environmental sounds, as it conveys both linguistic information and social cues. The speech and facial movements of a speaker are highly correlated. Generating talking faces, head movements, and gestures in 2D/3D not only enhances the realism and expressiveness of virtual characters in multimedia applications but also deepens our understanding of the relationship between linguistic information in speech and natural face/body movement, resulting in greater perceptual coherence. This will be a key world simulator that can help drive authentic human-machine interaction.

Advanced audio-visual adaptor or interface. Audio-visual generation involves understanding and extracting complex underlying information embedded in audio signals, and re-compiling that information into a completely different visual modality. This is often implemented by combining pre-trained modality encoders and decoders with simple, learnable adaptor modules. However, this complex process may not be effectively modeled by a simple stack of fully connected layers. In this subtopic, we consider the design of adaptors or interfaces that can effectively bridge the feature spaces of heterogeneous audio-visual modalities. The importance of these interfaces is highlighted in multi-modal large language models and any-to-any generation models as well. We also encourage research that analyzes and improves our understanding of the modality gap between audio and visual representations.

Benchmark and dataset. One obvious obstacle in the research of audio-visual generation is the lack of high-quality data both for training and for benchmarking the performance of existing methods. In this subtopic, we welcome contributions that introduce new datasets and evaluation metrics, which are beneficial for advancing the field of audiovisual generation.

Ethical considerations and social impact. Ethical considerations in audio-visual generation research are crucial to ensuring the responsible use of technology, protecting individual privacy, and preventing potential misuse or harmful societal impacts, such as hallucination effects. In this subtopic, we welcome discussions on ethical and privacy concerns, such as algorithmic bias, AI-generated voice mimicry, and other potential risks associated with audiovisual generation.

Generic topics and applications related to audio-visual generation. We also welcome submissions on broader audio-visual research areas that can advance audio-visual generation. These include innovative model architectures, novel learning algorithms, dedicated loss functions, and improved data processing techniques (such as curation, filtering, and augmentation). Furthermore, research on fusion methods, conditioning mechanisms, and other methods that facilitate the integration of audio and visual modalities is encouraged. Beyond the topics mentioned, we invite research on how audio-visual generation models and their components can facilitate downstream applications, such as robotic learning, game design, and advertisement/film creation as a proxy of the world simulator.

Guest Editors

The team is led by Tae-Hyun Oh

Tae-Hyun Oh

Associate Professor, KAIST

Shiqi Yang

Research Scientist, SB Intuitions

Zhixiang Wang

Research Scientist, CyberAgent AI Lab

Sergey Tulyakov

Director of Research, Snap Inc.

Stavros Petridis

Scientific Research Manager, Meta

Honorary Research Fellow, ICL

Vicky Kalogeiton

Assistant Professor, Ecole Polytechnique

Ming-Hsuan Yang

Professor, UC Merced; Yonsei University

Submission Guidelines

Please submit via IJCV Editorial Manager: www.editorialmanager.com/visi. Choose SI: Audio-Visual Generation from the dropdown. Refer to the official site for the details.

FQA

Q: Will you accept conference extention? A: Yes. Conference-based extended papers are expected to have a minimum of 30% additional scientific contribution, e.g., in the form of new or improved algorithms or analysis, new experiments or qualitative/quantitative comparisons. As long as it is noted in the submission that the paper is extended from a conference paper, these should be fine.

Q: Will audio-visual learning (discriminative learning rather than generation) suitable for the special issue? A: Yes. We welcome submissions from relevant topics.

Important Dates

Upon request, we have extended the submission deadline to May 2 '25. Papers submitted early will be processed on a rolling basis.

Manuscript Submission Deadline	May 02 '25 11:59 PM AoE
First Review Notification	June 25 '25 11:59 PM AoE
Revised Manuscript Submission	July 10 '25 11:59 PM AoE
Final Review Notification	August 10 '25 11:59 PM AoE
Final Manuscript Submission	September 20 '25 11:59 PM AoE