Frank Zhiyang Dou

I am an incoming Ph.D. student at MIT CSAIL, supervised by Prof. Wojciech Matusik. I will be affiliated with the Computational Design and Fabrication Group and the Computer Graphics Group.
I will obtain my MPhil degree in the Computer Graphics Group at The University of Hong Kong, supervised by Prof. Taku Komura. I received my B.Eng. degree with honors from Shandong University, advised by Prof. Shiqing Xin. I was a visiting scholar at the University of Pennsylvania, working with Prof. Lingjie Liu at the Graphics Lab and GRASP Lab. I also collaborate closely with Prof. Cynthia Sung in the Department of Mechanical Engineering and Applied Mechanics at UPenn.

Research Interests (Visualization): Computer Graphics, Character Animation, Geometric Modeling and Processing, Simulation, Human Behavior Analysis (Capture, Modeling and Simulation).

News

Jun. 2025:Three papers and one tutorial (Human Motion Generation) were accepted to ICCV 2025.

One short paper was accepted to the Building Physically Plausible World Models Workshop at ICML 2025.
Mar. 2025:Two papers were accepted to SIGGRAPH 2025.

My research statement was accepted to EG 2025 Doctoral Consortium.
Feb. 2025:Four papers accepted to CVPR 2025, including one Oral and one Highlight.

I was selected as a Meshy Fellowship Finalist—thanks, Meshy!
Jan. 2025:Three papers were accepted to ICLR 2025.
Sep. 2024:One paper was accepted to NeurIPS 2024.
Aug. 2024:My research statement was accepted to ECCV 2024 Doctoral Consortium.
Jul. 2024:One paper was accepted to SIGGRAPH Asia 2024. Five papers were accepted to ECCV 2024.

Coverage Axis (EG22) was recognized as a Top Cited Article in CGF for 2022–2023.
May. 2024:One paper was accepted to SGP 2024.
Mar. 2024:One paper was accepted to SIGGRAPH 2024.
Feb. 2024:One paper was accepted to CVPR 2024.
Jul. 2023:One paper was accepted to SIGGRAPH Asia 2023. One paper was accepted to ICCV 2023.
Mar. 2023:GCNO was accepted to SIGGRAPH 2023. We won SIGGRAPH 2023 The Best Paper Award.

[Show More]

Mar. 2023:One paper was accepted to PNAS Nexus 2023. Press release by EurekAlert!.

Aug. 2022:One paper was accepted to SIGGRAPH Asia 2022.

Feb. 2022:One paper was accepted to EUROGRAPHICS 2022.

Selected Research Works

* Equal Contributions; # Corresponding Authors; cs: coming soon.

Dynamic Realms: 4D Content Analysis, Recovery and Generation with Geometric, Topological and Physical Priors
Zhiyang Dou.
Doctoral Consortium — ECCV 2024 and EUROGRAPHICS 2025.
🗺️👆Click the figure for an overview.

paper
poster

🎧 MOSPA: Spatial Audio-Driven Human Motion Generation
Shuyang Xu*, Zhiyang Dou*#, Mingyi Shi, Leo Ho, Jingbo Wang, Yuan Liu, Yuexin Ma, Wenping Wang#, Taku Komura#.
2025.

project page (cs)
paper (cs)

abstract

Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities such as language, audio, and music to human motion generation. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. Furthermore, we develop a diffusion-based generative framework named MOSPA to capture the relationship between body motion and spatial audio with an effective fusion mechanism. After training, MOSPA could generate diverse realistic human motions conditioned on varying spatial audio inputs. We conducted extensive experiments to validate our method, which achieves state-of-the-art performance on this task. Our model and dataset will be open-sourced upon acceptance. Refer to our supplementary video for more details.

ModSkill: Physical Character Skill Modularization
Yiming Huang, Zhiyang Dou, Lingjie Liu.
ICCV 2025.

project page
paper

abstract

Human motion is highly diverse and dynamic, posing challenges for imitation learning algorithms that aim to generalize motor skills for controlling simulated characters. Previous methods typically rely on a universal full-body controller for tracking reference motion (tracking-based model) or a unified full-body skill embedding space (skill embedding). However, these approaches often struggle to generalize and scale to larger motion datasets. In this work, we introduce a novel skill learning framework, ModSkill, that decouples complex full-body skills into compositional, modular skills for independent body parts. Our framework features a skill modularization attention layer that processes policy observations into modular skill embeddings that guide low-level controllers for each body part. We also propose an Active Skill Learning approach with Generative Adaptive Sampling, using large motion generation models to adaptively enhance policy learning in challenging tracking scenarios. Our results show that this modularized skill learning framework, enhanced by generative sampling, outperforms existing methods in precise full-body motion tracking and enables reusable skill embeddings for diverse goal-driven tasks.

SIMS: Simulating Human-Scene Interactions with Real World Script Planning
Wenjia Wang, Liang Pan, Zhiyang Dou, Zhouyingcheng Liao, Yuke Lou, Lei Yang, Jingbo Wang, Taku Komura.
ICCV 2025.

project page
paper

abstract

Simulating long-term human-scene interaction is a challenging yet fascinating task. Previous works have not effectively addressed the generation of long-term human scene interactions with detailed narratives for physics-based animation. This paper introduces a novel framework for the planning and controlling of long-horizon physical plausible human-scene interaction. On the one hand, films and shows with stylish human locomotions or interactions with scenes are abundantly available on the internet, providing a rich source of data for script planning. On the other hand, Large Language Models (LLMs) can understand and generate logical storylines.
This motivates us to marry the two by using an LLM-based pipeline to extract scripts from videos, and then employ LLMs to imitate and create new scripts, capturing complex, time-series human behaviors and interactions with environments. By leveraging this, we utilize a dual-aware policy that achieves both language comprehension and scene understanding to guide character motions within contextual and spatial constraints. To facilitate training and evaluation, we contribute a comprehensive planning dataset containing diverse motion sequences extracted from real-world videos and expand them with large language models. We also collect and re-annotate motion clips from existing kinematic datasets to enable our policy learn diverse skills. Extensive experiments demonstrate the effectiveness of our framework in versatile task execution and its generalization ability to various scenarios, showing remarkably enhanced performance compared with existing methods. Our code and data will be publicly available soon.

CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects
Huaijin Pi, Zhi Cen, Zhiyang Dou, Taku Komura.
Arxiv 2025.

project page
paper

abstract

Synthesizing whole-body manipulation of articulated objects, including body motion, hand motion, and object motion, is a critical yet challenging task with broad applications in virtual humans and robotics. The core challenges are twofold. First, achieving realistic whole-body motion requires tight coordination between the hands and the rest of the body, as their movements are interdependent during manipulation. Second, articulated object manipulation typically involves high degrees of freedom and demands higher precision, often requiring the fingers to be placed at specific regions to actuate movable parts. To address these challenges, we propose a novel coordinated diffusion noise optimization framework. Specifically, we perform noise-space optimization over three specialized diffusion models for the body, left hand, and right hand, each trained on its own motion dataset to improve generalization. Coordination naturally emerges through gradient flow along the human kinematic chain, allowing the global body posture to adapt in response to hand motion objectives with high fidelity. To further enhance precision in hand-object interaction, we adopt a unified representation based on basis point sets (BPS), where end-effector positions are encoded as distances to the same BPS used for object geometry. This unified representation captures fine-grained spatial relationships between the hand and articulated object parts, and the resulting trajectories serve as targets to guide the optimization of diffusion noise, producing highly accurate interaction motion. We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility, and enables various capabilities such as object pose control, simultaneous walking and manipulation, and whole-body generation from hand-only data.

TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization
Liang Pan, Zeshi Yang, Zhiyang Dou, Wenjia Wang, Buzhen Huang, Bo Dai, Taku Komura, Jingbo Wang.
CVPR 2025 (Oral). Top 3.3% of accepted papers.
Also the 1st Workshop on Humanoid Agents at CVPR 2025 Spotlight.

project page
paper
code

abstract

Synthesizing diverse and physically plausible Human-Scene Interactions (HSI) is pivotal for both computer animation and embodied AI. Despite encouraging progress, current methods mainly focus on developing separate controllers, each specialized for a specific interaction task. This significantly hinders the ability to tackle a wide variety of challenging HSI tasks that require the integration of multiple skills, e.g., sitting down while carrying an object. To address this issue, we present TokenHSI, a single, unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism. Such a unified policy enables effective knowledge sharing across skills, thereby facilitating the multi-task training. Moreover, our policy architecture supports variable length inputs, enabling flexible adaptation of learned skills to new scenarios. By training additional task tokenizers, we can not only modify the geometries of interaction targets but also coordinate multiple skills to address complex tasks. The experiments demonstrate that our approach can significantly improve versatility, adaptability, and extensibility in various HSI tasks.

Align3R: Aligned Monocular Depth Estimation for Dynamic Videos
Jiahao Lu*, Tianyu Huang*, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, Yuan Liu.
CVPR 2025 (Highlight). Top 13.5% of accepted papers.

project page
paper
code

abstract

Recent developments in monocular depth estimation methods enable high-quality depth estimation of single-view images but fail to estimate consistent video depth across different frames. Recent works address this problem by applying a video diffusion model to generate video depth conditioned on the input video, which is training-expensive and can only produce scale-invariant depth values without camera poses. In this paper, we propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video. Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps. First, we fine-tune the DUSt3R model with additional estimated monocular depth as inputs for the dynamic scenes. Then, we apply optimization to reconstruct both depth maps and camera poses. Extensive experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.

Vid2Sim: Generalizable, Video-based Reconstruction of Appearance, Geometry and Physics for Mesh-free Simulation
Chuhao Chen, Zhiyang Dou, Chen Wang, Yiming Huang, Anjun Chen, Qiao Feng, Jiatao Gu, Lingjie Liu.
CVPR 2025.

project page
paper
code

abstract

Faithfully reconstructing textured shapes and physical properties from videos presents an intriguing yet challenging problem. Significant efforts have been dedicated to advancing such a system identification problem in this area. Previous methods often rely on heavy optimization pipelines with a differentiable simulator and renderer to estimate physical parameters. However, these approaches frequently necessitate extensive hyperparameter tuning for each scene and involve a costly optimization process, which limits both their practicality and generalizability. In this work, we propose a novel framework, Vid2Sim, a generalizable videobased approach for recovering geometry and physical properties through a mesh-free reduced simulation based on Linear Blend Skinning (LBS), offering high computational efficiency and versatile representation capability. Specifically, Vid2Sim first reconstructs the observed configuration of the physical system from video using a feed-forward neural network trained to capture physical world knowledge. A lightweight optimization pipeline then refines the estimated appearance, geometry, and physical properties to closely align with video observations within just a few minutes. Additionally, after the reconstruction, Vid2Sim enables highquality, mesh-free simulation with high efficiency. Extensive experiments demonstrate that our method achieves superior accuracy and efficiency in reconstructing geometry and physical properties from video data.

DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image
Qingxuan Wu, Zhiyang Dou#, Sirui Xu, Soshi Shimada, Chen Wang, Zhengming Yu, Yuan Liu, Cheng Lin, Zeyu Cao, Taku Komura, Vladislav Golyanik, Christian Theobalt, Wenping Wang, Lingjie Liu#.
ICLR 2025.

project page
paper
code

abstract

Reconstructing 3D hand-face interactions with deformations from a single image is a challenging yet crucial task with broad applications in AR, VR, and gaming. The challenges stem from self-occlusions during single-view hand-face interactions, diverse spatial relationships between hands and face, complex deformations, and the ambiguity of the single-view setting. The first and only method for hand-face interaction recovery, Decaf, introduces a global fitting optimization guided by contact and deformation estimation networks trained on studio-collected data with 3D annotations. However, Decaf suffers from a time-consuming optimization process and limited generalization capability due to its reliance on 3D annotations of hand-face interaction data. To address these issues, we present DICE, the first end-to-end method for Deformation-aware hand-face Interaction reCovEry from a single image. DICE estimates the poses of hands and faces, contacts, and deformations simultaneously using a Transformer-based architecture. It features disentangling the regression of local deformation fields and global mesh vertex locations into two network branches, enhancing deformation and contact estimation for precise and robust hand-face mesh recovery. To improve generalizability, we propose a weakly-supervised training approach that augments the training set using in-the-wild images without 3D ground-truth annotations, employing the depths of 2D keypoints estimated by off-the-shelf models and adversarial priors of poses for supervision. Our experiments demonstrate that DICE achieves state-of-the-art performance on a standard benchmark and in-the-wild data in terms of accuracy and physical plausibility. Additionally, our method operates at an interactive rate (20 fps) on an Nvidia 4090 GPU, whereas Decaf requires more than 15 seconds for a single image. Our code will be publicly available upon publication.

CBIL: Collective Behavior Imitation Learning for Fish from Real Videos
Yifan Wu*, Zhiyang Dou*, Yuko Ishiwaka, Shun Ogawa, Yuke Lou, Wenping Wang, Lingjie Liu, Taku Komura.
ACM Transactions on Graphics. SIGGRAPH ASIA 2024.

project page
paper

abstract

Reproducing realistic collective behaviors presents a captivating yet formidable challenge. Traditional rule-based methods rely on hand-crafted principles, limiting motion diversity and realism in generated collective behaviors. Recent imitation learning methods learn from data but often require ground truth motion trajectories and struggle with authenticity, especially in high-density groups with erratic movements. In this paper, we present a scalable approach, Collective Behavior Imitation Learning (CBIL), for learning fish schooling behavior directly from videos, without relying on captured motion trajectories. Our method first leverages Video Representation Learning, where a Masked Video AutoEncoder (MVAE) extracts implicit states from video inputs in a self-supervised manner. The MVAE effectively maps 2D observations to implicit states that are compact and expressive for following the imitation learning stage. Then, we propose a novel adversarial imitation learning method to effectively capture complex movements of the schools of fish, allowing for efficient imitation of the distribution for motion patterns measured in the latent space. It also incorporates bio-inspired rewards alongside priors to regularize and stabilize training. Once trained, CBIL can be used for various animation tasks with the learned collective motion priors. We further show its effectiveness across different species. Finally, we demonstrate the application of our system in detecting abnormal fish behavior from in-the-wild videos.

ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking
Tingyang Zhang, Chen Wang, Zhiyang Dou, Qingzhe Gao, Jiahui Lei, Baoquan Chen, Lingjie Liu.
Arxiv 2025.

project page
paper
code

abstract

In this paper, we propose ProTracker, a novel framework for robust and accurate long-term dense tracking of arbitrary points in videos. The key idea of our method is incorporating probabilistic integration to refine multiple predictions from both optical flow and semantic features for robust short-term and long-term tracking. Specifically, we integrate optical flow estimations in a probabilistic manner, producing smooth and accurate trajectories by maximizing the likelihood of each prediction. To effectively re-localize challenging points that disappear and reappear due to occlusion, we further incorporate long-term feature correspondence into our flow predictions for continuous trajectory generation. Extensive experiments show that ProTracker achieves the state-of-the-art performance among unsupervised and self-supervised approaches, and even outperforms supervised methods on several benchmarks. Our code and model will be publicly available upon publication.

MotionWavelet: Human Motion Prediction via Wavelet Manifold Learning
Yuming Feng*, Zhiyang Dou*#, Ling-Hao Chen, Yuan Liu, Tianyu Li, Jingbo Wang, Zeyu Cao, Wenping Wang, Taku Komura, Lingjie Liu#.
Arxiv 2024.

project page
paper

abstract

Modeling temporal characteristics and the non-stationary dynamics of body movement plays a significant role in predicting human future motions. However, it is challenging to capture these features due to the subtle transitions involved in the complex human motions. This paper introduces MotionWavelet, a human motion prediction framework that utilizes Wavelet Transformation and studies human motion patterns in the spatial-frequency domain. In MotionWavelet, a Wavelet Diffusion Model (WDM) learns a Wavelet Manifold by applying Wavelet Transformation on the motion data therefore encoding the intricate spatial and temporal motion patterns. Once the Wavelet Manifold is built, WDM trains a diffusion model to generate human motions from Wavelet latent vectors. In addition to the WDM, MotionWavelet also presents a Wavelet Space Shaping Guidance mechanism to refine the denoising process to improve conformity with the manifold structure. WDM also develops Temporal Attention-Based Guidance to enhance prediction accuracy. Extensive experiments validate the effectiveness of MotionWavelet, demonstrating improved prediction accuracy and enhanced generalization across various benchmarks. Our code and models will be released upon acceptance.

Surf-D: High-Quality Surface Generation for Arbitrary Topologies using Diffusion Models
Zhengming Yu*, Zhiyang Dou*, Xiaoxiao Long, Cheng Lin, Zekun Li, Yuan Liu, Norman Müller, Taku Komura, Marc Habermann, Christian Theobalt, Xin Li, Wenping Wang.
ECCV 2024.

project page
paper
code

abstract

In this paper, we present Surf-D, a novel method for generating high-quality 3D shapes as Surface with arbitrary topologies using Diffusion models. Specifically, we adopt Unsigned Distance Field (UDF) as the surface representation, as it excels in handling arbitrary topologies, enabling the generation of complex shapes. While the prior methods explored shape generation with different representations, they suffer from limited topologies and geometry details. Moreover, it's non-trivial to directly extend prior diffusion models to UDF because they lack spatial continuity due to the discrete volume structure. However, UDF requires accurate gradients for mesh extraction and learning. To tackle the issues, we first leverage a point-based auto-encoder to learn a compact latent space, which supports gradient querying for any input point through differentiation to effectively capture intricate geometry at a high resolution. Since the learning difficulty for various shapes can differ, a curriculum learning strategy is employed to efficiently embed various surfaces, enhancing the whole embedding process. With pretrained shape latent space, we employ a latent diffusion model to acquire the distribution of various shapes. Our approach demonstrates superior performance in shape generation across multiple modalities and conducts extensive experiments in unconditional generation, category conditional generation, 3D reconstruction from images, and text-to-shape tasks. Our code will be publicly available upon paper publication.

EMDM: Efficient Motion Diffusion Model for Fast, High-Quality Human Motion Generation
Wenyang Zhou, Zhiyang Dou†, Zeyu Cao, Zhouyingcheng Liao, Jingbo Wang, Wenjia Wang, Yuan Liu, Taku Komura, Wenping Wang, Lingjie Liu.
ECCV 2024.
† Project Lead.

project page
paper
video
code

abstract

We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation. Although previous motion diffusion models have shown impressive results, they struggle to achieve fast generation while maintaining high-quality human motions. Motion latent diffusion has been proposed for efficient motion generation. However, effectively learning a latent space can be non-trivial in such a two-stage manner. Meanwhile, accelerating motion sampling by increasing the step size, e.g., DDIM, typically leads to a decline in motion quality due to the inapproximation of complex data distributions when naively increasing the step size. In this paper, we propose EMDM that allows for much fewer sample steps for fast motion generation by modeling the complex denoising distribution during multiple sampling steps. Specifically, we develop a Conditional Denoising Diffusion GAN to capture multimodal data distributions conditioned on both control signals, i.e., textual description and denoising time step. By modeling the complex data distribution, a larger sampling step size and fewer steps are achieved during motion synthesis, significantly accelerating the generation process. To effectively capture the human dynamics and reduce undesired artifacts, we employ motion geometric loss during network training, which improves the motion quality and training efficiency. As a result, EMDM achieves a remarkable speed-up at the generation stage while maintaining high-quality motion generation in terms of fidelity and diversity.

TLControl: Trajectory and Language Control for Human Motion Synthesis
Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, Lingjie Liu.
ECCV 2024.

project page
paper
video
code

abstract

Controllable human motion synthesis is essential for applications in AR/VR, gaming, movies, and embodied AI. Existing methods often focus solely on either language or full trajectory control, lacking precision in synthesizing motions aligned with user-specified trajectories, especially for multi-joint control. To address these issues, we present TLControl, a new method for realistic human motion synthesis, incorporating both low-level trajectory and high-level language semantics controls. Specifically, we first train a VQ-VAE to learn a compact latent motion space organized by body parts. We then propose a Masked Trajectories Transformer to make coarse initial predictions of full trajectories of joints based on the learned latent motion space, with user-specified partial trajectories and text descriptions as conditioning. Finally, we introduce an efficient test-time optimization to refine these coarse predictions for accurate trajectory control. Experiments demonstrate that TLControl outperforms the state-of-the-art in trajectory accuracy and time efficiency, making it practical for interactive and high-quality animation generation.

Disentangled Clothed Avatar Generation from Text Descriptions
Jionghao Wang*, Yuan Liu*, Zhiyang Dou, Zhengming Yu, Yongqing Liang, Xin Li, Wenping Wang, Rong Xie, Li Song.
ECCV 2024.

project page
paper
code

abstract

In this paper, we introduced a novel text-to-avatar generation method that separately generates the human body and the clothes and allows high-quality animation on the generated avatar. While recent advancements in text-to-avatar generation have yielded diverse human avatars from text prompts, these methods typically combine all elements—clothes, hair, and body—into a single 3D representation. Such an entangled approach poses challenges for downstream tasks like editing or animation. To overcome these limitations, we propose a novel disentangled 3D avatar representation named Sequentially Offset-SMPL (SO-SMPL), building upon the SMPL model. SO-SMPL represents the human body and clothes with two separate meshes, but associates them with offsets to ensure the physical alignment between the body and the clothes. Then, we design an Score Distillation Sampling(SDS)-based distillation framework to generate the proposed SO-SMPL representation from text prompts. In comparison with existing text-to-avatar methods, our approach not only achieves higher exture and geometry quality and better semantic alignment with text prompts, but also significantly improves the visual quality of character animation, virtual try-on, and avatar editing.

Coverage Axis++: Efficient Skeletal Points Selection for 3D Shape Skeletonization
Zimeng Wang*, Zhiyang Dou*, Rui Xu, Cheng Lin, Yuan Liu, Xiaoxiao Long, Shiqing Xin, Taku Komura, Xiaoming Yuan, Wenping Wang.
ACM SIGGRAPH/Eurographics Symposium on Geometry Processing 2024.
A follow-up of Coverage Axis.

project page
paper
code

abstract

We introduce Coverage Axis++, a novel and efficient approach to 3D shape skeletonization. The current state-of-the-art approaches for this task often rely on the watertightness of the input or suffer from substantial computational costs, thereby limiting their practicality. To address this challenge, Coverage Axis++ proposes a heuristic algorithm to select skeletal points, offering a high-accuracy approximation of the Medial Axis Transform (MAT) while significantly mitigating computational intensity for various shape representations. We introduce a simple yet effective strategy that considers both shape coverage and uniformity to derive skeletal points. The selection procedure enforces consistency with the shape structure while favoring the dominant medial balls, which thus introduces a compact underlying shape representation in terms of MAT. As a result, Coverage Axis++ allows for skeletonization for various shape representations (e.g., water-tight meshes, triangle soups, point clouds), specification of the number of skeletal points, few hyperparameters, and highly efficient computation with improved reconstruction accuracy. Extensive experiments across a wide range of 3D shapes validate the efficiency and effectiveness of Coverage Axis++. The code will be publicly available once the paper is published.

Wonder3D: Single Image to 3D using Cross-Domain Diffusion
Xiaoxiao Long*, Yuanchen Guo*, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, Wenping Wang.
CVPR 2024 (Highlight). Top 11.9% of accepted papers.

project page
paper
code
Hugging Face Demo

abstract

In this work, we introduce Wonder3D, a novel method for efficiently generating high-fidelity textured meshes from single-view images.Recent methods based on Score Distillation Sampling (SDS) have shown the potential to recover 3D geometry from 2D diffusion priors, but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast, certain works directly produce 3D information via fast network inferences, but their results are often of low quality and lack geometric details. To holistically improve the quality, consistency, and efficiency of image-to-3D tasks, we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure consistency, we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly, we introduce a geometry-aware normal fusion algorithm that extracts high-quality surfaces from the multi-view 2D representations. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results, robust generalization, and reasonably good efficiency compared to prior works.

C·ASE: Learning Conditional Adversarial Skill Embeddings for Physics-based Characters
Zhiyang Dou, Xuelin Chen, Qingnan Fan, Taku Komura, Wenping Wang.
SIGGRAPH Asia 2023.

project page
paper
video
code

abstract

We present C·ASE, an efficient and effective framework that learns Conditional Adversarial Skill Embeddings for physics-based characters. C·ASE enables the physically simulated character to learn a diverse repertoire of skills while providing controllability in the form of direct manipulation of the skills to be performed. This is achieved by dividing the heterogeneous skill motions into distinct subsets containing homogeneous samples for training a low-level conditional model to learn the conditional behavior distribution. The skill-conditioned imitation learning naturally offers explicit control over the character’s skills after training. The training course incorporates the focal skill sampling, skeletal residual forces, and element-wise feature masking to balance diverse skills of varying complexities, mitigate dynamics mismatch to master agile motions and capture more general behavior characteristics, respectively. Once trained, the conditional model can produce highly diverse and realistic skills, outperforming state-of-the-art models, and can be repurposed in various downstream tasks. In particular, the explicit skill control handle allows a high-level policy or a user to direct the character with desired skill specifications, which we demonstrate is advantageous for interactive character animation.

TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer
Zhiyang Dou*, Qingxuan Wu*, Cheng Lin, Zeyu Cao, Qiangqiang Wu, Weilin Wan, Taku Komura, Wenping Wang.
ICCV 2023.

project page
paper
code

abstract

In this paper, we introduce a set of simple yet effective TOken REduction (TORE) strategies for Transformer-based Human Mesh Recovery from monocular images. Current SOTA performance is achieved by Transformer-based structures. However, they suffer from high model complexity and computation cost caused by redundant tokens. We propose token reduction strategies based on two important aspects, i.e., the 3D geometry structure and 2D image feature, where we hierarchically recover the mesh geometry with priors from body structure and conduct token clustering to pass fewer but more discriminative image feature tokens to the Transformer. Our method massively reduces the number of tokens involved in high-complexity interactions in the Transformer. This leads to a significantly reduced computational cost while still achieving competitive or even higher accuracy in shape recovery. Extensive experiments across a wide range of benchmarks validate the superior effectiveness of the proposed method. We further demonstrate the generalizability of our method on hand mesh recovery. Our code will be publicly available once the paper is published.

Globally Consistent Normal Orientation for Point Clouds by Regularizing the Winding-Number Field
Rui Xu, Zhiyang Dou, Ningna Wang, Shiqing Xin, Shuangmin Chen, Mingyan Jiang, Xiaohu Guo, Wenping Wang, Changhe Tu.
ACM Transactions on Graphics. SIGGRAPH 2023.

SIGGRAPH 2023 Best Paper Award; See more here.

project page
paper
video
code

abstract

Estimating normals with globally consistent orientations for a raw point cloud has many downstream geometry processing applications. Despite tremendous efforts in the past decades, it remains challenging to deal with an unoriented point cloud with various imperfections, particularly in the presence of data sparsity coupled with nearby gaps or thin-walled structures. In this paper, we propose a smooth objective function to characterize the requirements of an acceptable winding-number field, which allows one to find the globally consistent normal orientations starting from a set of completely random normals. By taking the vertices of the Voronoi diagram of the point cloud as examination points, we consider the following three requirements: (1) the winding number is either 0 or 1, (2) the occurrences of 1 and the occurrences of 0 are balanced around the point cloud, and (3) the normals align with the outside Voronoi poles as much as possible. Extensive experimental results show that our method outperforms the existing approaches, especially in handling sparse and noisy point clouds, as well as shapes with complex geometry/topology.

RFEPS: Reconstructing Feature-line Equipped Polygonal Surface
Rui Xu, Zixiong Wang, Zhiyang Dou, Chen Zong, Shiqing Xin, Mingyan Jiang, Tao Ju, Changhe Tu.
ACM Transactions on Graphics. SIGGRAPH Asia 2022.

project page
paper
video
code

abstract

Feature lines are important geometric cues in characterizing the structure of a CAD model. Despite great progress in both explicit reconstruction and implicit reconstruction, it remains a challenging task to reconstruct a polygonal surface equipped with feature lines, especially when the input point cloud is noisy and lacks faithful normal vectors. In this paper, we develop a multistage algorithm, named RFEPS, to address this challenge. The key steps include (1)denoising the point cloud based on the assumption of local planarity, (2)identifying the feature-line zone by optimization of discrete optimal transport, (3)augmenting the point set so that sufficiently many additional points are generated on potential geometry edges, and (4) generating a polygonal surface that interpolates the augmented point set based on restricted power diagram. We demonstrate through extensive experiments that RFEPS, benefiting from the edge-point augmentation and the feature-preserving explicit reconstruction, outperforms state-of-the-art methods in terms of the reconstruction quality, especially in terms of the ability to reconstruct missing feature lines.

Coverage Axis: Inner Point Selection for 3D Shape Skeletonization
Zhiyang Dou, Cheng Lin, Rui Xu, Lei Yang, Shiqing Xin, Taku Komura, Wenping Wang.
Computer Graphics Forum. EUROGRAPHICS 2022.

Top Cited Article in CGF 2022-2023. [Link]
Fast-Forward Attendees Award at EG22, 2nd Place.

project page
paper
code
suppl.

abstract

In this paper, we present a simple yet effective formulation called Coverage Axis for 3D shape skeletonization. Inspired by the set cover problem, our key idea is to cover all the surface points using as few inside medial balls as possible. This formulation inherently induces a compact and expressive approximation of the Medial Axis Transform (MAT) of a given shape. Different from previous methods that rely on local approximation error, our method allows a global consideration of the overall shape structure, leading to an efficient high-level abstraction and superior robustness to noise. Another appealing aspect of our method is its capability to handle more generalized input such as point clouds and poor-quality meshes. Extensive comparisons and evaluations demonstrate the remarkable effectiveness of our method for generating compact and expressive skeletal representation to approximate the MAT.

Popularization of High-Speed Railway Reduces the Infection Risk via Close Contact Route during Journey
Nan Zhang, Xiyue Liu, Shuyi Gao, Boni Su, Zhiyang Dou#.
Sustainable Cities and Society (SCS) 2023.

paper

abstract

The risk of COVID-19 infection has increased due to the prolonged duration of travel and frequent close interactions due to popularization of railway transportations. This study utilized depth detection devices to analyze the close contact behaviors of passengers in high-speed train (HST), traditional trains (TT), waiting area in waiting room (WWR), and ticket check area in waiting room (CWR). A multi-route COVID-19 transmission model was developed to assess the risk of virus exposure in these scenarios under various non-pharmaceutical interventions. A total of 163,740 seconds of data was collected. The close contact ratios in HST, TT, WWR, and CWR was 5.8%, 64.0%, 7.7%, and 49.0%, respectively. The average interpersonal distance between passengers was 0.85 m, 0.92 m, 1.25 m, and 0.88 m, respectively. The probability of face-to-face contact was 9.5%, 70.0%, 64.2%, and 5.8% across each environment, respectively. When all passengers wore N95 respirators and surgical masks, the personal virus exposure via close contact can be reduced by 94.1% and 51.9%, respectively. The virus exposure in TT is about dozens of times of it in HST. In China, if all current railway traffic was replaced by HST, the total virus exposure of passengers can be reduced by roughly 50%.

Student close contact behavior and COVID-19 transmission in China’s classrooms
Yong Guo*, Zhiyang Dou*, Nan Zhang, Xiyue Liu, Boni Su, Yuguo Li, Yinping Zhang.
PNAS Nexus 2023.

This research has been featured in a press release by EurekAlert!

project page
paper
press release

abstract

Classrooms are high-risk indoor environments, so analysis of SARS-CoV-2 transmission in classrooms is important for determining optimal interventions. Due to the absence of human behavior data, it is challenging to accurately determine virus exposure in classrooms. A wearable device for close contact behavior detection was developed, and we recorded more than 250-thousand data points of close contact behaviors of students from Grades 1 through 12. Combined with a survey on students’ behaviors, we analyzed virus transmission in classrooms. Close contact rates for students were 37%±11% during classes and 48%±13% during breaks. Students in lower grades had higher close contact rates and virus transmission potential. The long-range airborne transmission route is dominant, accounting for 90%±3.6% and 75%±7.7% with and without mask wearing, respectively. During breaks, the short-range airborne route became more important, contributing 48%±3.1% in grades 1 to 9 (without wearing masks). Ventilation alone cannot always meet the demands of COVID-19 control, 30 m3/h/person is suggested as the threshold outdoor air ventilation rate in classroom. This study provides scientific support for COVID-19 prevention and control in classrooms, and our proposed human behavior detection and analysis methods offer a powerful tool to understand virus transmission characteristics, and can be employed in various indoor environments.

Close Contact Behaviors of University and School Students in 10 Typical Indoor Environments
Nan Zhang, Li Liu, Zhiyang Dou, Xiyue Liu, Xueze Yang, Doudou Miao, Yong Guo, Silan Gu, Yuguo Li, Hua Qian, Jianjian Wei.
Journal of Hazardous Materials (JHM) 2023.

paper

abstract

Close contact, including both short-range airborne and large droplet, is recognized as the main route of SARS-CoV-2 transmission in indoor environments, however exposure risk via this route is difficult to quantify due to a lack of data showing close contact behaviors of people in typical indoor environments. A digital wearable device was developed to capture human close contact behaviors automatically based on semi-supervised learning. We collected a total of 337,056 seconds of indoor close contacts from 194 and a half hours of depth video recordings in 10 typical indoor environments. The relationship between SARS-CoV-2 exposure and close contact behaviors were evaluated based on dispersion characteristics of virus-laden droplets. People in restaurant had the highest close contact ratio (63.8%) and probability of face-to-face pattern (77.6%) during close contacts, while people in shopping center had the highest speak fraction (46.6%). University students had higher exposure potential in dormitories than school students in homes, but less exposure potential in classrooms and graduate student offices than school students in classrooms. Aerosol exposure in volume for both short-range inhalation and direct deposition on facial mucosa were highest in restaurants. Classroom is the main indoor environment for SARS-CoV-2 transmission for school students. The obtained results based on real human close contact behaviors can be used for infection risk assessment and to deploy effective interventions against close contact transmission of COVID-19 and other respiratory infections.

Close Contact Behavior-based COVID-19 Transmission and Interventions in a Subway System
Xiyue Liu*, Zhiyang Dou*, Lei Wang, Boni Su, Tianyi Jin, Yong Guo, Jianjian Wei, Nan Zhang.
Journal of Hazardous Materials (JHM) 2022.

project page
paper

abstract

During COVID-19 pandemic, analysis on virus exposure and intervention efficiency in public transports based on real passenger’s close contact behaviors is critical to curb infectious disease transmission. A monitoring device was developed to gather a total of 145,821 close contact data in subways based on semi-supervision learning. A virus transmission model considering both short- and long-range inhalation and deposition was established to calculate the virus exposure. During rush-hour, short-range inhalation exposure is 3.2 times higher than deposition exposure and 7.5 times higher than long-range inhalation exposure of all passengers in the subway. The close contact rate was 56.1 % and the average interpersonal distance was 0.8 m. Face-to-back was the main pattern during close contact. Comparing with random distribution, if all passengers stand facing in the same direction, personal virus exposure through inhalation (deposition) can be reduced by 74.1 % (98.5 %). If the talk rate was decreased from 20 % to 5 %, the inhalation (deposition) exposure can be reduced by 69.3 % (73.8 %). In addition, we found that virus exposure could be reduced by 82.0 % if all passengers wear surgical masks. This study provides scientific support for COVID-19 prevention and control in subways based on real human close contact behaviors.

Top-Down Shape Abstraction Based on Greedy Pole Selection
Zhiyang Dou, Shiqing Xin, Rui Xu, Jian Xu, Yuanfeng Zhou, Shuangmin Chen, Wenping Wang, Xiuyang Zhao, Changhe Tu.
IEEE Transactions on Visualization and Computer Graphics. TVCG 2020.

paper

abstract

Motivated by the fact that the medial axis transform is able to encode nearly the complete shape, we propose to use as few medial balls as possible to approximate the original enclosed volume by the boundary surface. We progressively select new medial balls, in a top-down style, to enlarge the region spanned by the existing medial balls. The key spirit of the selection strategy is to encourage large medial balls while imposing given geometric constraints. We further propose a speedup technique based on a provable observation that the intersection of medial balls implies the adjacency of power cells (in the sense of the power crust). We further elaborate the selection rules in combination with two closely related applications. One application is to develop an easy-to-use ball-stick modeling system that helps non-professional users to quickly build a shape with only balls and wires, but any penetration between two medial balls must be suppressed. The other application is to generate porous structures with convex, compact (with a high isoperimetric quotient) and shape-aware pores where two adjacent spherical pores may have penetration as long as the mechanical rigidity can be well preserved.

Services

Reviewer: SIGGRAPH; SIGGRAPH ASIA; ACM TOG; EUROGRAPHICS; TVCG; ICCV; CVPR; ECCV; ICLR; NeurIPS; PG; Pattern Recognition; Neural Networks; GM; CAD (CADJ); GMP; 3DV; AAAI; TMM; ACM Multimedia; CVM; CVPRW; ECCVW; NeurIPSW; TIP; TCSVT; CGI; SIGGRAPH Poster; Graphics Replicability Stamp; COMPUT J; ICONIP; FSDM; MLIS; Sustainable Cities and Society (SCS); Scientific (BrainSTEM@HKU).

ICCV25W: AI for Visual Arts Workshop and Challenges (AI4VA).
CVPR25W: 4D Vision Modeling the Dynamic World; Human Motion Generation;
CVPR24W: Generative Models for Computer Vision; Human Motion Generation;
ECCV24W: Wild3D, AI4VA, OOD-CV;

Tutorial Organizer: Workshop on Human Motion Generation, ICCV 2025.

Program Committee & Evaluation Committee: CGI 2025; Graphics Replicability Stamp.

Teaching Assistant:

2023:COMP3271 Computer Graphics. Worked with Prof. Taku Komura.
2022:COMP3362 Hands-on AI: Experimentation and Applications. Worked with Dr. Yi-King Choi.
2021:COMP3362 Hands-on AI: Experimentation and Applications. Worked with Dr. Yi-King Choi.
2020:COMP2120 Computer Organization. Worked with Prof. Kwok-Ping Chan.

Talks (Past and Upcoming):

Jul. 2025:On Efficient, Controllable, and Physically Plausible Motion Synthesis, University of Chinese Academy of Sciences.
Jul. 2025:On Efficient, Controllable, and Physically Plausible Motion Synthesis, Zhejiang University.
Jun. 2025:From Static 3D Geometry to Dynamic 4D Contents: Analysis, Recovery, and Generation, VALSE.
May. 2025:Toward Fully Automated 4D Content Creation: Challenges and Future Directions, Stealth Startup.
May. 2025:Principles and Practices for Efficient, Controllable, and Physically Plausible Motion Synthesis, MiHoYo.
May. 2025:On Efficient, Controllable, and Physically Plausible Motion Synthesis, The Hong Kong Polytechnic University.
May. 2025:On Efficient, Controllable, and Physically Plausible Motion Synthesis, Stealth Startup.
May. 2025:From Static 3D Geometry to Dynamic 4D Contents: Analysis, Recovery, and Generation, EUROGRAPHICS DC.
Apr. 2025:On Efficient, Controllable, and Physically Plausible Motion Synthesis, Shandong University.
Apr. 2025:From Static 3D Geometry to Dynamic 4D Contents: Analysis, Recovery, and Generation, BAAI.
Mar. 2025:Human-Centric Spatial AI for Close Contact Behavior Analysis, Beijing University of Technology.
Feb. 2025: World Models for Physical Agent Control, Honda Research (GRASP Lab Visit).
Feb. 2025:On Efficient, Controllable, and Physically Plausible Motion Synthesis, Technion.
Dec. 2024:On Efficient, Controllable, and Physically Plausible Motion Synthesis, Nvidia.
Dec. 2024:Towards a Universal Motion Foundation Model, Stealth Startup.
Oct. 2024:On Efficient, Controllable, and Physically Plausible Motion Synthesis, Meta.
Oct. 2024:Research Sharing, Shandong University.
Oct. 2024:Addressing the Challenge of Data Scarcity in Motion Synthesis, Shanghai AI Lab.
Oct. 2024:On Efficient, Controllable, and Physically Plausible Motion Synthesis, ShanghaiTech University.
Oct. 2024:On Efficient, Controllable, and Physically Plausible Motion Synthesis, ChinaGraph.
Aug. 2024:On Efficient, Controllable, and Physically Plausible Motion Synthesis, MiHoYo.
Apr. 2024:On the Readily Deployable System for Detecting Close Contact Behaviors, Boeing.
Dec. 2023:Shape Analysis, Recovery and Generation with Geometric and Topological Priors, Stealth Startup.
Nov. 2023:Geometric Computing - Medial Axis Transform and Normal Orientation for Point Clouds, ShanghaiTech University.
Oct. 2023:Scalable Skill Embeddings for Physics-based Characters, Tencent Games.
Jun. 2023:Robust and Efficient Vision Systems for Close Contact Behavior Analysis, Beijing University of Technology.
Feb. 2023:Scalable Skill Embeddings for Physics-based Characters, Shandong University.
Oct. 2022:On Efficient Hand-to-Surface Contact Estimation, Boeing.

Awards, Scholarships and Honors

Mar. 2025:Durlach Graduate Fellowship.
Feb. 2025:Meshy Fellowship Finalist. [Link]
Jul. 2024:Top Cited Article in CGF 2022-2023. [Link]
Jul. 2024:HKU Foundation First Year Excellent Ph.D. Award 2023/24.[Link]
Oct. 2023:The Best Paper Award, SIGGRAPH 2023. [Link]
Oct. 2020:Postgraduate Scholarship.
Oct. 2019:National Scholarship.
Dec. 2019:Presidential Scholarship.
Oct. 2018:National Scholarship.

Competitions

2019:National First Prize, National Mathematical Modeling Contest.
2019:Meritorious Winner, International Mathematical Modeling Contest: The Mathematical Contest in Modeling (MCM).
2018:National First Prize, The Best Paper Award (8/38573), National Mathematical Modeling Contest.
2018:Meritorious Winner, International Mathematical Modeling Contest: The Interdisciplinary Contest in Modeling (ICM).
2018:National Grand Prize (1st), Most Commercially Valuable Award, Most Popular Award; The 11th National University Student Software Innovation Contest.

Miscs.

I used to be:

a soccer player and a middle-distance runner (substitute for my city in youth sports events; once completed a 1000-meter run in 3 minutes.).
an electric guitar player (rhythm, solo sometimes). Mainly played Cantonese music from my favorite band: Beyond.

I love Ghost in the Shell[1][2][3]. I doubt existence, somewhat and somehow.

"It's better to burn out than to fade away." — Hey Hey, My My (Into the Black) / My My, Hey Hey (Out of the Blue).

Visualization of Research Interests

Research Interest Overview.

Real2SimGen: Incorporate real-world principles, including physics, topology, and geometry, into the generation and simulation processes.

SimGen2Real: Leverage generative models and simulation techniques to address real-world challenges in domains like fabrication and robotics.

"... And now I see with eye serene
The very pulse of the machine;
A Being breathing thoughtful breath,
A Traveller between life and death;"
...
"The sacred geometry of chance;
The hidden law of a probable outcome;
The numbers lead a dance;..."
...
"The marble index of a mind forever
Voyaging through strange seas of Thought, alone."

"縱一葦之所如，凌萬頃之茫然。"

"今はまだ僕たちは旅の途中だと。"
"一度は夢を見せてくれた君じゃないか。"

Contact info

Email

frankzydou@csail.mit.edu
frankdou@mit.edu
zhiyang0@connect.hku.hk
zydou@seas.upenn.edu
frankzydou@gmail.com

For review invitations:
– CG, CV, ML: please use this email.
– Human Behavior Analysis and Public Health: please use this email.

Address

• SIGLAB Moore 103, Moore School Building, 200 South 33rd Street, Philadelphia, PA 19104, U.S.A.
• Rm 416, CYC Bldg, The University of Hong Kong, Pokfulam Road, Hong Kong SAR.

Zhiyang (Frank) Dou

News

Selected Research Works

Services

Awards, Scholarships and Honors

Competitions

Miscs.