RigidFormer: Learning Rigid Dynamics with Transformers

A mesh-free, object-centric Transformer for multi-object rigid-body contact dynamics from point clouds.

Zhiyang Dou¹Minghao Guo¹Haixu Wu¹Doug Roble²Tuur Stuyck²Wojciech Matusik¹

¹ MIT² Meta

Abstract Video Main Qualitative Results Main Quantitative Results Online Adaptation

Abstract

Learning-based simulation of multi-object rigid-body dynamics remains difficult because contact is discontinuous and errors compound over long horizons. Most existing methods remain tied to mesh connectivity and vertex-level message passing, which limits their applicability to mesh-free inputs such as point clouds and leads to high computational cost. Efficiently modeling high-fidelity rigid-body dynamics from mesh-free representations therefore remains challenging. We introduce RigidFormer, an object-centric Transformer-based model that learns mesh-free rigid-body dynamics with controllable integration step sizes. RigidFormer reasons at the object level and advances each object through compact anchors; Anchor-Vertex Pooling enriches these anchors with local vertex features, retaining contact-relevant geometry without dense vertex-level interaction. We propose Anchor-based RoPE to inject anchor geometry into attention while respecting the unordered nature of objects and anchors: object-token processing is permutation-equivariant, and the mean-pooled anchor descriptor is invariant to anchor reindexing while preserving shape extent. RigidFormer further enforces rigidity by projecting updates onto the rigid-body manifold using differentiable Kabsch alignment. On standard benchmarks, RigidFormer outperforms or matches mesh-based baselines using point inputs, runs faster, generalizes to unseen point resolutions and across datasets, and scales to 200+ objects; we also show a preliminary extension to command-conditioned articulated bodies by treating body parts as interacting object-level components.

Video

Qualitative Results on MOVi Datasets (test/val set)

Meshes are shown only for visualization; our model operates on point inputs.

MOVi-Sphere

Spherical-object scenes for cross-geometry qualitative comparison.

MOVi-Sphere Sample 1test set

MOVi-Sphere Sample 2test set

MOVi-Sphere Sample 3test set

MOVi-Sphere Sample 4test set

MOVi-A

Held-out test clips selected for qualitative inspection.

MOVi-A Sample 1test set

MOVi-A Sample 2test set

MOVi-A Sample 3test set

MOVi-A Sample 4test set

MOVi-B

More cluttered scenes with diverse object interactions.

MOVi-B Sample 1test set

MOVi-B Sample 2test set

MOVi-B Sample 3test set

MOVi-B Sample 4test set

MOVi-B Sample 5test set

MOVi-B Sample 6test set

MOVi-B Sample 7test set

MOVi-B Sample 8test set

MOVi-B Sample 9test set

MOVi-B Sample 10test set

Partial Point Cloud Observation

Rollouts from partial point-cloud inputs.

Partial Point Cloud Sample 1partial observation

Partial Point Cloud Sample 2partial observation

Partial Point Cloud Sample 3partial observation

Partial Point Cloud Sample 4partial observation

Partial Point Cloud Sample 5partial observation

Partial Point Cloud Sample 6partial observation

Partial Point Cloud Sample 7partial observation

Partial Point Cloud Sample 8partial observation

Partial Point Cloud Sample 9partial observation

Partial Point Cloud Sample 10partial observation

Partial Point Cloud Sample 11partial observation

Partial Point Cloud Sample 12partial observation

From Rigid to Soft

We replace the differentiable Kabsch alignment with a learnable skinning module and physics-informed supervision, turning sparse anchor dynamics into full-mesh deformation.

Soft Body Sample 1train set

Soft Body Sample 2train set

Soft Body Sample 3train set

Soft Body Sample 4eval set

Soft Body Sample 5train set

Soft Body Sample 6train set

Soft Body Sample 7train set

Soft Body Sample 8train set

Soft-body data simulated with the FEM solver in libuipc.

Large-Scale Simulation

Dense rigid-body rollouts with increasing scene complexity.

3×3×3 Scenelarge scale

5×5×5 Scenelarge scale

6×6×6 Scenelarge scale

Controllable Articulated Body Simulation

Controllable articulated-body rollouts driven by learned rigid dynamics.

ASE Humanoid Sample 1controlled rollout

ASE Humanoid Sample 2controlled rollout

ASE Humanoid Sample 3controlled rollout

Unitree G1 Samplecontrolled rollout

Articulated-body data simulated with Isaac Gym.

Controllable Articulated Body Simulation: More Results

Additional Unitree G1 humanoid rollouts under controlled motion. Each sample uses a different initial state and control signal (direction and velocity).

Unitree G1 Sample 1controlled rollout

Unitree G1 Sample 2controlled rollout

Unitree G1 Sample 3controlled rollout

Unitree G1 Sample 4controlled rollout

Unitree G1 Sample 5controlled rollout

Unitree G1 Sample 6controlled rollout

Unitree G1 Sample 7controlled rollout

Unitree G1 Sample 8controlled rollout

Articulated-body data simulated with Isaac Gym.

Object Fragmentation

Simulating the cracking and fragmentation process of objects.

Cracking Sample 1rollout

Cracking Sample 2rollout

Cracking Sample 3rollout

Cracking Sample 4rollout

Cracking Sample 5rollout

Cracking Sample 6rollout

Thin-Shell Simulation

Rollouts on deformable thin-shell scenes. Left: mesh visualization. Right: skinning weights. h denotes the ball's drop height and v₀ its initial velocity.

Shell Sample 1h = 1.05 m · v₀ = 1.58 m/s

Shell Sample 2h = 1.01 m · v₀ = 2.44 m/s

Shell Sample 3h = 0.83 m · v₀ = 2.58 m/s

Shell Sample 4h = 1.07 m · v₀ = 2.69 m/s

RigidFormer for Video Generation

Combining the neural physical rollout with Diffusion-as-Shader to render the same physical trajectory under diverse stylized prompts.

Inputs

Tracking Inputpoint cloud rollout

Rendered Meshvisualization

Stylized Outputs

Forest Mechstylized

Mars Surfacestylized

Snow Arcticstylized

Sunset Citystylized

Quantitative Results

Comparison against state-of-the-art neural rigid-body simulators.

Table 1. Method comparison. Mesh-Free: no triangle connectivity required. Var. Δt: handles multiple step sizes. Preproc.-Free: no offline geometry computation. Warmup Frames: number of input frames required for rollout.

Method	Mesh-Free	Var. Δt	Preproc.-Free	#Warmup Frames ↓	Runtime (FPS) ↑
MGN	✗	✗	✓	2	5.7
FIGNet	✗	✗	✓	3	3.0
SDF-Sim^†	✗	✗	✗	3	—
HopNet^‡	✗	✗	✗	3	0.2
RigidFormer (Ours)	✓	✓	✓	2	23.9

† SDF pre-learning: ~5 hours for MOVi-B.
‡ Simplicial-complex construction: ~15 days on MOVi-B.

Why these properties matter in practice.

Mesh-Free.

Real-world inputs are typically captured as point clouds (LiDAR, depth cameras, 3D scanners), so a mesh-free formulation removes the brittle "point cloud → mesh reconstruction" preprocessing stage required by mesh-based simulators.
Operating directly on points is robust to partial occlusion, sparse sampling, and irregular density; mesh-based methods are sensitive to broken or inconsistent topology under such conditions.
A point-only interface aligns naturally with on-line robotic perception and sim-to-real pipelines, allowing the dynamics loop to consume sensor streams without intermediate geometric reconstruction.
Point inputs further admit a rich family of training-time augmentations — downsampling, jitter, occlusion masking, and density variation — that can be tailored by the user to improve robustness against imperfect geometric observations at deployment.

Var. Δt (variable integration step).

The step size becomes a deployment-time knob: large Δt for fast interactive rollout, small Δt for high-fidelity simulation, all within a single trained model.
Larger Δt reduces the number of autoregressive steps required to cover a fixed horizon, lowering both cumulative error and inference cost in long-horizon rollouts.
For world-model use cases, planning rarely requires fine-grained dynamics; coarser steps yield more future information per unit of compute, which directly accelerates downstream planning and decision-making.

Preproc.-Free (no offline geometry computation).

Methods such as SDF-Sim require several hours of per-dataset SDF pre-learning, while HopNet's simplicial-complex construction takes on the order of fifteen days; these costs must be paid again for every new scene or dataset, hindering practical scalability.
Eliminating this stage enables immediate deployment to previously unseen geometries and removes a major bottleneck in scaling training data.

#Warmup Frames ↓ (fewer is better).

Warmup frames are the historical observations the model must consume before producing its first prediction.
A smaller warmup horizon translates directly into lower cold-start latency: prediction can begin after only one or two observations.
For real-time control loops such as robot MPC and interactive simulation, this is essential — waiting three to five frames before issuing the first action is rarely acceptable.
RigidFormer requires only two warmup frames, compared to three for FIGNet, SDF-Sim, and HopNet, enabling faster startup in streaming scenarios.

Runtime (FPS) ↑ (higher is better).

Higher throughput is universally desirable: it expands the set of feasible deployment scenarios — interactive simulators, real-time game physics, high-throughput RL rollouts, and the inner loops of robot MPC — and allows more episodes, longer horizons, or more parallel scenes on identical hardware.

Table 2. Performance on MOVi-A, MOVi-B, and MOVi-Sphere. Each cell reports position RMSE (m) / orientation RMSE (°) at prediction horizons of 50, 75, and 100 frames. Per-column top-two ranks are highlighted: 1st, 2nd.

Model	MOVi-A			MOVi-B			MOVi-Sphere
Model	50	75	100	50	75	100	50	75	100
MGN+	0.705 / 31.21	N/A	N/A	0.538 / 26.91	N/A	N/A	N/A	N/A	N/A
MGN-LargeRadius+	0.119 / 15.07	N/A	N/A	0.460 / 26.34	N/A	N/A	N/A	N/A	N/A
FIGNet	0.115 / 14.84	N/A	N/A	0.127 / 13.99	N/A	N/A	N/A	N/A	N/A
FIGNet_reimpl	0.132 / 7.10	0.285 / 14.62	0.492 / 23.30	0.141 / 7.39	0.300 / 15.16	0.516 / 24.96	N/A	N/A	N/A
HCMT_reimpl^*	0.239 / 5.70	0.538 / 11.82	0.951 / 18.40	0.237 / 4.72	0.527 / 9.80	0.932 / 17.43	0.243 / 4.19	0.541 / 8.21	0.956 / 13.81
VPD^†_reimpl	0.235 / 5.10	0.489 / 11.66	0.827 / 20.37	0.275 / 4.65	0.581 / 9.70	0.987 / 16.99	0.244 / 4.47	0.510 / 9.50	0.855 / 17.65
HopNet	0.054 / 5.64	0.115 / 11.84	0.196 / 18.83	0.047 / 4.91	0.101 / 10.35	0.176 / 17.91	0.034 / 4.05	0.073 / 8.21	0.124 / 13.68
RigidFormer^†,* (Ours)	0.049 / 5.06	0.103 / 10.90	0.177 / 18.32	0.050 / 3.97	0.095 / 8.51	0.161 / 15.33	0.026 / 3.00	0.057 / 6.48	0.099 / 11.19

^* Transformer-based model. ^† Point inputs.

Table 3. Step-size-conditioned rollout performance. A single RigidFormer model is conditioned on the integration step size at inference time. Each cell reports position RMSE (m) / orientation RMSE (°).

Step Size	MOVi-A			MOVi-B			MOVi-Sphere
Step Size	50	75	100	50	75	100	50	75	100
10	0.018 / 1.47	0.068 / 6.98	0.118 / 11.93	0.029 / 1.51	0.069 / 5.89	0.115 / 10.85	0.014 / 1.09	0.045 / 4.30	0.076 / 7.47
5	0.035 / 3.33	0.083 / 8.55	0.148 / 15.08	0.040 / 3.06	0.078 / 7.25	0.136 / 13.55	0.021 / 2.11	0.047 / 5.23	0.086 / 9.58
1	0.049 / 5.06	0.103 / 10.90	0.177 / 18.32	0.050 / 3.97	0.095 / 8.51	0.161 / 15.33	0.026 / 3.00	0.057 / 6.48	0.099 / 11.19

Table 4. Runtime comparison. Measured on MOVi-B with a 50-step autoregressive rollout averaged over 10 iterations on an NVIDIA GeForce RTX 5080. RigidFormer achieves 8× and 101× speedups over FIGNet and HopNet, respectively.

Method	ms / step ↓	FPS ↑
HopNet	4228.7	0.2
FIGNet	336.0	3.0
RigidFormer (Ours)	41.9	23.9

Online Adaptation

Fine-tuning a frozen RigidFormer on 10 in-domain MOVi-B scenes for 20 epochs (~23 min) reduces both position and rotation error across all step sizes. Each curve shows the percentage reduction relative to the pre-finetune baseline at four milestone epochs.

Position RMSE reduction at prediction frame 100 — **Position RMSE @ frame 100.** Short-horizon convergence. Best reduction **−25.0%** at Δt=5.

Position RMSE reduction at prediction frame 200 — **Position RMSE @ frame 200.** Long-horizon convergence. Best reduction **−36.6%** at Δt=5.

Rotation Error reduction at prediction frame 100 — **Rotation Error @ frame 100.** Best reduction **−26.6%** at Δt=1.

Rotation Error reduction at prediction frame 200 — **Rotation Error @ frame 200.** Best reduction **−22.2%** at Δt=10.