RigidFormer: Learning Rigid Dynamics using Transformers

A mesh-free, object-centric Transformer for multi-object rigid-body contact dynamics from point clouds.

Zhiyang Dou1Minghao Guo1Haixu Wu1Doug Roble2Tuur Stuyck2Wojciech Matusik1

1 MIT2 Meta

Abstract

Learning-based simulation of multi-object rigid-body dynamics remains difficult because contact is discontinuous and errors compound over long horizons. Most existing methods remain tied to mesh connectivity and vertex-level message passing, which limits their applicability to mesh-free inputs such as point clouds and leads to high computational cost. Efficiently modeling high-fidelity rigid-body dynamics from mesh-free representations therefore remains challenging. We introduce RigidFormer, an object-centric Transformer-based model that learns mesh-free rigid-body dynamics with controllable integration step sizes. RigidFormer reasons at the object level and advances each object through compact anchors; Anchor-Vertex Pooling enriches these anchors with local vertex features, retaining contact-relevant geometry without dense vertex-level interaction. We propose Anchor-based RoPE to inject anchor geometry into attention while respecting the unordered nature of objects and anchors: object-token processing is permutation-equivariant, and the mean-pooled anchor descriptor is invariant to anchor reindexing while preserving shape extent. RigidFormer further enforces rigidity by projecting updates onto the rigid-body manifold using differentiable Kabsch alignment. On standard benchmarks, RigidFormer outperforms or matches mesh-based baselines using point inputs, runs faster, generalizes to unseen point resolutions and across datasets, and scales to 200+ objects; we also show a preliminary extension to command-conditioned articulated bodies by treating body parts as interacting object-level components.

Qualitative Results on MOVi Datasets (test/val set)

Meshes are shown only for visualization; our model operates on point inputs.

MOVi-Sphere

Spherical-object scenes for cross-geometry qualitative comparison.

MOVi-Sphere Sample 1test set
MOVi-Sphere Sample 2test set
MOVi-Sphere Sample 3test set
MOVi-Sphere Sample 4test set

MOVi-A

Held-out test clips selected for qualitative inspection.

MOVi-A Sample 1test set
MOVi-A Sample 2test set
MOVi-A Sample 3test set
MOVi-A Sample 4test set

MOVi-B

More cluttered scenes with diverse object interactions.

MOVi-B Sample 1test set
MOVi-B Sample 2test set
MOVi-B Sample 3test set
MOVi-B Sample 4test set
MOVi-B Sample 5test set
MOVi-B Sample 6test set
MOVi-B Sample 7test set
MOVi-B Sample 8test set
MOVi-B Sample 9test set
MOVi-B Sample 10test set

Partial Point Cloud Observation

Rollouts from partial point-cloud inputs under occluded observations.

Partial Point Cloud Sample 1partial observation
Partial Point Cloud Sample 2partial observation
Partial Point Cloud Sample 3partial observation
Partial Point Cloud Sample 4partial observation
Partial Point Cloud Sample 5partial observation
Partial Point Cloud Sample 6partial observation
Partial Point Cloud Sample 7partial observation
Partial Point Cloud Sample 8partial observation
Partial Point Cloud Sample 9partial observation
Partial Point Cloud Sample 10partial observation
Partial Point Cloud Sample 11partial observation
Partial Point Cloud Sample 12partial observation

Large Scale Simulation

Dense rigid-body rollouts with increasing scene complexity.

3 x 3 x 3 Scenelarge scale
5 x 5 x 5 Scenelarge scale
6 x 6 x 6 Scenelarge scale

Controllable Articulated Body Simulation

Controllable articulated-body rollouts under learned rigid dynamics.

ASE Humanoid Sample 1controlled rollout
ASE Humanoid Sample 2controlled rollout
ASE Humanoid Sample 3controlled rollout
Unitree G1 Samplecontrolled rollout

Quantitative Results

Comparison against state-of-the-art neural rigid-body simulators.

Table 1. Method comparison. Mesh-Free: no triangle connectivity required. Var. Δt: handles multiple step sizes. Preproc.-Free: no offline geometry computation. Warmup Frames: number of input frames required for rollout.

Method Mesh-Free Var. Δt Preproc.-Free #Warmup Frames ↓ Runtime (FPS) ↑
MGN25.7
FIGNet33.0
SDF-Sim3
HopNet30.2
RigidFormer (Ours)223.9

 SDF pre-learning: ~5 hours for MOVi-B.
 Simplicial-complex construction: ~15 days on MOVi-B.

Why these properties matter in practice.

Mesh-Free.
  • Real-world inputs are typically captured as point clouds (LiDAR, depth cameras, 3D scanners), so a mesh-free formulation removes the brittle "point cloud → mesh reconstruction" preprocessing stage required by mesh-based simulators.
  • Operating directly on points is robust to partial occlusion, sparse sampling, and irregular density; mesh-based methods are sensitive to broken or inconsistent topology under such conditions.
  • A point-only interface aligns naturally with on-line robotic perception and sim-to-real pipelines, allowing the dynamics loop to consume sensor streams without intermediate geometric reconstruction.
  • Point inputs further admit a rich family of training-time augmentations — downsampling, jitter, occlusion masking, and density variation — that can be tailored by the user to improve robustness against imperfect geometric observations at deployment.
Var. Δt (variable integration step).
  • The step size becomes a deployment-time knob: large Δt for fast interactive rollout, small Δt for high-fidelity simulation, all within a single trained model.
  • Larger Δt reduces the number of autoregressive steps required to cover a fixed horizon, lowering both cumulative error and inference cost in long-horizon rollouts.
  • For world-model use cases, planning rarely requires fine-grained dynamics; coarser steps yield more future information per unit of compute, which directly accelerates downstream planning and decision-making.
Preproc.-Free (no offline geometry computation).
  • Methods such as SDF-Sim require several hours of per-dataset SDF pre-learning, while HopNet's simplicial-complex construction takes on the order of fifteen days; these costs must be paid again for every new scene or dataset, hindering practical scalability.
  • Eliminating this stage enables immediate deployment to previously unseen geometries and removes a major bottleneck in scaling training data.
#Warmup Frames ↓ (fewer is better).
  • Warmup frames are the historical observations the model must consume before producing its first prediction.
  • A smaller warmup horizon translates directly into lower cold-start latency: prediction can begin after only one or two observations.
  • For real-time control loops such as robot MPC and interactive simulation, this is essential — waiting three to five frames before issuing the first action is rarely acceptable.
  • RigidFormer requires only two warmup frames, compared to three for FIGNet, SDF-Sim, and HopNet, enabling faster startup in streaming scenarios.
Runtime (FPS) ↑ (higher is better).
  • Higher throughput is universally desirable: it expands the set of feasible deployment scenarios — interactive simulators, real-time game physics, high-throughput RL rollouts, and the inner loops of robot MPC — and allows more episodes, longer horizons, or more parallel scenes on identical hardware.

Table 2. Standard Performance on MOVi-A, MOVi-B, and MOVi-Sphere. Each cell reports position RMSE (m) / orientation RMSE (°) at prediction horizons of 50, 75, and 100 frames. Per-column top-two ranks are highlighted: 1st, 2nd.

Model MOVi-A MOVi-B MOVi-Sphere
5075100 5075100 5075100
MGN+ 0.705 / 31.21N/AN/A 0.538 / 26.91N/AN/A N/AN/AN/A
MGN-LargeRadius+ 0.119 / 15.07N/AN/A 0.460 / 26.34N/AN/A N/AN/AN/A
FIGNet 0.115 / 14.84N/AN/A 0.127 / 13.99N/AN/A N/AN/AN/A
FIGNetreimpl 0.132 / 7.100.285 / 14.620.492 / 23.30 0.141 / 7.390.300 / 15.160.516 / 24.96 N/AN/AN/A
HCMTreimpl* 0.239 / 5.700.538 / 11.820.951 / 18.40 0.237 / 4.720.527 / 9.800.932 / 17.43 0.243 / 4.190.541 / 8.210.956 / 13.81
VPDreimpl 0.235 / 5.100.489 / 11.660.827 / 20.37 0.275 / 4.650.581 / 9.700.987 / 16.99 0.244 / 4.470.510 / 9.500.855 / 17.65
HopNet 0.054 / 5.640.115 / 11.840.196 / 18.83 0.047 / 4.910.101 / 10.350.176 / 17.91 0.034 / 4.050.073 / 8.210.124 / 13.68
RigidFormer†,* (Ours) 0.049 / 5.060.103 / 10.900.177 / 18.32 0.050 / 3.970.095 / 8.510.161 / 15.33 0.026 / 3.000.057 / 6.480.099 / 11.19

* Transformer-based model.    Point inputs.

Table 3. Step-size-conditioned rollout performance. A single RigidFormer model is conditioned on the integration step size at inference time. Each cell reports position RMSE (m) / orientation RMSE (°).

Step Size MOVi-A MOVi-B MOVi-Sphere
5075100 5075100 5075100
10 0.018 / 1.470.068 / 6.980.118 / 11.93 0.029 / 1.510.069 / 5.890.115 / 10.85 0.014 / 1.090.045 / 4.300.076 / 7.47
5 0.035 / 3.330.083 / 8.550.148 / 15.08 0.040 / 3.060.078 / 7.250.136 / 13.55 0.021 / 2.110.047 / 5.230.086 / 9.58
1 0.049 / 5.060.103 / 10.900.177 / 18.32 0.050 / 3.970.095 / 8.510.161 / 15.33 0.026 / 3.000.057 / 6.480.099 / 11.19

Table 4. Runtime comparison. Measured on MOVi-B with a 50-step autoregressive rollout averaged over 10 iterations on an NVIDIA GeForce RTX 5080. RigidFormer achieves 8× and 101× speedups over FIGNet and HopNet, respectively.

Method ms / step ↓ FPS ↑
HopNet4228.70.2
FIGNet336.03.0
RigidFormer (Ours)41.923.9