EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation

Libo Zhang1,§   Zekun Li2   Tianyu Li3   Zeyu Cao4   Rui Xu5   Xiao-Xiao Long6  
Wenjia Wang5   Jingbo Wang7   Yuan Liu8   Wenping Wang9  
Daquan Zhou10   Taku Komura5   Zhiyang Dou5,11,†,§  
1THU   2Brown   3Georgia Tech   4Cambridge   5HKU   6NJU  
7CUHK   8HKUST   9TAMU   10PKU   11MIT  
§, † denote work completed during Summer 2025 and corresponding authors, respectively.
Arxiv 2025

  • Code (Coming Soon)

Abstract


Humans exhibit adaptive, context-sensitive responses to egocentric visual input. However, faithfully modeling such reactions from egocentric video remains challenging due to the dual requirements of strictly causal generation and precise 3D spatial alignment. To tackle this problem, we first construct the Human Reaction Dataset (HRD) to address data scarcity and misalignment by building a spatially aligned egocentric video-reaction dataset, as existing datasets (e.g., ViMo) suffer from significant spatial inconsistency between the egocentric video and reaction motion, e.g., dynamically moving motions are always paired with fixed-camera videos. Leveraging HRD, we present EgoReAct, the first autoregressive framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time. We first compress the reaction motion into a compact yet expressive latent space via a Vector Quantised-Variational AutoEncoder and then train a Generative Pre-trained Transformer for reaction generation from the visual input. EgoReAct incorporates 3D dynamic features, i.e., metric depth, and head dynamics during the generation, which effectively enhance spatial grounding. Extensive experiments demonstrate that EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods, while maintaining strict causality during generation. We will release code, models, and data upon acceptance.


EgoReAct takes streaming egocentric video as input and synthesizes spatially grounded, realistic human reaction motions in real time, enabling responsive full-body behaviors that are tightly coupled with the ongoing egocentric observations.

Method


Pipeline of EgoReAct. We first learn a Motion VQ-VAE to discretize continuous 3D motions into compact token sequences. Building on this representation, EgoReAct takes streaming egocentric RGB frames as input, estimates their depth, and encodes the image, depth, and head dynamics cues to form ego-perception features, which guide an Autoregressive Transformer to sequentially generate spatially aligned and temporally causal reaction motions.

Spatially Aligned Human Reaction Dataset (HRD)


The automated pipeline for generating the Spatially Aligned Human Reaction Dataset (HRD). Given a scene caption, we first employ LLMs to produce video and motion prompts, then generate egocentric videos and reaction motions through text-driven generation, followed by spatial alignment via camera trajectory control.


Comparison between ViMo and our Spatially Aligned Human Reaction Dataset (HRD). The left side shows the ground-truth reaction motion. On the right, the top row presents the egocentric video from ViMo, while the bottom row shows the video from our HRD. Our dataset provides significantly more accurate spatial alignment between the egocentric video and the reaction motion.


Dataset Distribution of the HRD dataset. The dataset consists of three main categories: human-human (blue), animal-human (green), and scene-human (red) interactions.

Experimental Results


Quantitative comparison with state-of-the-art methods. Our method achieves the best performance across all metrics while maintaining strict causality during generation.


User study results. Percentage of participant selections (higher is better) for three criteria—Spatial Alignment, Reaction Plausibility, and Motion Quality—across competing methods. Our approach receives a clear majority of votes on all axes, indicating superior spatial grounding, more plausible reactions, and overall higher motion quality.


Real-world System Deployment. The figure demonstrates the performance of our method on real-world first-person videos. On the left is the first-person video (selected from YouTube), while on the right are the human reactions generated by our method, EgoReAct, in response to these dynamic visual inputs.


Citation

@article{zhang2025egoreact,
  title={EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation},
  author={Zhang, Libo and Li, Zekun and Li, Tianyu and Cao, Zeyu and Xu, Rui and Long, Xiao-Xiao and Wang, Wenjia and Wang, Jingbo and Liu, Yuan and Wang, Wenping and Zhou, Daquan and Komura, Taku and Dou, Zhiyang},
  journal={arXiv preprint arXiv:2512.22808},
  year={2025}
}

This page is Zotero translator friendly. Page last updated Jan. 2025. Imprint. Data Protection.