Call for Papers: IJCV Special Issue
Audio-Visual Generation
Introduction
The ability to simulate and reason about the physical world is central to human intelligence. We perceive our surroundings and construct mental models that allow us to internally simulate possible outcomes, enabling reasoning, planning, and action — what we might call “world simulators”. Similarly, developing a world simulator is crucial for building human-like AI systems that can interact effectively with dynamic and complex environments. Recent research has shown that high-fidelity video generation models are a promising path toward building such comprehensive and efficient world simulators. However, the physical world is inherently multimodal. Human perception mostly relies not only on visual stimuli but also on sound. Sound often conveys critical information complementing what we can see, providing a richer and more nuanced understanding of the environment. To create world simulators capable of mimicking human-like perception and reasoning, it is crucial to develop coherent audiovisual generative models. Despite this, most modern approaches focus on vision-only or language-visual modalities, often with less focus on understanding and generating integrated audiovisual signals.
This special issue aims to spotlight the exciting yet underexplored field of audio-visual generation as a key stepping stone towards achieving multi-modal world simulators. Our goal is to prioritize innovative approaches that explore this multimodal integration, advancing both the generation and analysis of audio-visual content. In addition to these approaches, we also aim to explore the broader impacts of this research. Moreover, in line with the classical concept of analysis-by-synthesis, advances in audiovisual generation can foster improvements in analysis and understanding methods, reinforcing the symbiotic relationship between these two areas. This research is not merely about content creation; it holds the potential to form a fundamental building block for more advanced, human-like AI systems.
Aims&Scope
Audio and image/video generation. Physically accurate and coherent audio-video generation models would be a promising way to build comprehensive and efficient world simulators. Audio-driven image/video generation or joint cogeneration is crucial for enabling more immersive and interactive experiences by synchronizing visual outputs with auditory inputs. This subtopic includes joint audio-visual generation, audio-to-image/video translation, and audio guided image/video editing, but not limited to these areas, as long as the techniques serve as promising foundational components.
Audio-conditional X generation. In this subtopic, we consider the audio-driven cross-modal generation of other various modalities beyond just image and video, including 3D/4D scenes, human motion (e.g., dance and gestures), and virtual avatars. Although successful cross-modal generation may seem straightforward, it is challenging because it requires accurately capturing the underlying associations between audio and the target modality. This line of research broadens our understanding of the intersections and mutually exclusive information between different modalities. We are also interested in models where any-to-any generation can be achieved.
Speech and talking avatar generation. Speech is a unique type of audio signal compared to ambient environmental sounds, as it conveys both linguistic information and social cues. The speech and facial movements of a speaker are highly correlated. Generating talking faces, head movements, and gestures in 2D/3D not only enhances the realism and expressiveness of virtual characters in multimedia applications but also deepens our understanding of the relationship between linguistic information in speech and natural face/body movement, resulting in greater perceptual coherence. This will be a key world simulator that can help drive authentic human-machine interaction.
Advanced audio-visual adaptor or interface. Audio-visual generation involves understanding and extracting complex underlying information embedded in audio signals, and re-compiling that information into a completely different visual modality. This is often implemented by combining pre-trained modality encoders and decoders with simple, learnable adaptor modules. However, this complex process may not be effectively modeled by a simple stack of fully connected layers. In this subtopic, we consider the design of adaptors or interfaces that can effectively bridge the feature spaces of heterogeneous audio-visual modalities. The importance of these interfaces is highlighted in multi-modal large language models and any-to-any generation models as well. We also encourage research that analyzes and improves our understanding of the modality gap between audio and visual representations.
Benchmark and dataset. One obvious obstacle in the research of audio-visual generation is the lack of high-quality data both for training and for benchmarking the performance of existing methods. In this subtopic, we welcome contributions that introduce new datasets and evaluation metrics, which are beneficial for advancing the field of audiovisual generation.
Ethical considerations and social impact. Ethical considerations in audio-visual generation research are crucial to ensuring the responsible use of technology, protecting individual privacy, and preventing potential misuse or harmful societal impacts, such as hallucination effects. In this subtopic, we welcome discussions on ethical and privacy concerns, such as algorithmic bias, AI-generated voice mimicry, and other potential risks associated with audiovisual generation.
Generic topics and applications related to audio-visual generation. We also welcome submissions on broader audio-visual research areas that can advance audio-visual generation. These include innovative model architectures, novel learning algorithms, dedicated loss functions, and improved data processing techniques (such as curation, filtering, and augmentation). Furthermore, research on fusion methods, conditioning mechanisms, and other methods that facilitate the integration of audio and visual modalities is encouraged. Beyond the topics mentioned, we invite research on how audio-visual generation models and their components can facilitate downstream applications, such as robotic learning, game design, and advertisement/film creation as a proxy of the world simulator.
Important Dates
Manuscript submission deadline: Mar. 15, 2025
First review notification: May 25, 2025
Revised manuscript submission: Jul. 10, 2025
Final review notification: Aug. 10, 2025
Final manuscript submission: Sep. 20, 2025
Publication year: 2025
Guest Editors
Tae-Hyun Oh: Pohang University of Science and Technology (POSTECH), South Korea, thoh.kaist.ac.kr@gmail.com
Shiqi Yang: SB Intuitions, SoftBank, Japan, shiqi.yang@sbintuitions.co.jp
Zhixiang Wang: CyberAgent AI Lab, Japan, wangzx1994@gmail.com
Sergey Tulyakov: Snap Inc, US, stulyakov@snap.com
Stavros Petridis: Meta and Imperial College London, UK, stavrosp@meta.com
Vicky Kalogeiton: Ecole Polytechnique, IP Paris, France, vicky.kalogeiton@polytechnique.edu
Ming-Hsuan Yang : University of California, Merced, US and Yonsei University, South Korea, mhyang@ucmerced.edu
Concact
If you have any question related to this IJCV SI, please contact Tae-Hyun Oh, Shiqi Yang, or Zhixiang Wang.