RigidFormer: Learning Rigid Dynamics using Transformers
A mesh-free, object-centric Transformer for multi-object rigid-body contact dynamics from point clouds.
1 MIT2 Meta
Abstract
Learning-based simulation of multi-object rigid-body dynamics remains difficult because contact is discontinuous and errors compound over long horizons. Most existing methods remain tied to mesh connectivity and vertex-level message passing, which limits their applicability to mesh-free inputs such as point clouds and leads to high computational cost. Efficiently modeling high-fidelity rigid-body dynamics from mesh-free representations therefore remains challenging. We introduce RigidFormer, an object-centric Transformer-based model that learns mesh-free rigid-body dynamics with controllable integration step sizes. RigidFormer reasons at the object level and advances each object through compact anchors; Anchor-Vertex Pooling enriches these anchors with local vertex features, retaining contact-relevant geometry without dense vertex-level interaction. We propose Anchor-based RoPE to inject anchor geometry into attention while respecting the unordered nature of objects and anchors: object-token processing is permutation-equivariant, and the mean-pooled anchor descriptor is invariant to anchor reindexing while preserving shape extent. RigidFormer further enforces rigidity by projecting updates onto the rigid-body manifold using differentiable Kabsch alignment. On standard benchmarks, RigidFormer outperforms or matches mesh-based baselines using point inputs, runs faster, generalizes to unseen point resolutions and across datasets, and scales to 200+ objects; we also show a preliminary extension to command-conditioned articulated bodies by treating body parts as interacting object-level components.
Qualitative Results on MOVi Datasets (test/val set)
Meshes are shown only for visualization; our model operates on point inputs.
MOVi-Sphere
Spherical-object scenes for cross-geometry qualitative comparison.
MOVi-A
Held-out test clips selected for qualitative inspection.
MOVi-B
More cluttered scenes with diverse object interactions.
Partial Point Cloud Observation
Rollouts from partial point-cloud inputs under occluded observations.
Large Scale Simulation
Dense rigid-body rollouts with increasing scene complexity.
Controllable Articulated Body Simulation
Controllable articulated-body rollouts under learned rigid dynamics.
Quantitative Results
Comparison against state-of-the-art neural rigid-body simulators.
Table 1. Method comparison. Mesh-Free: no triangle connectivity required. Var. Δt: handles multiple step sizes. Preproc.-Free: no offline geometry computation. Warmup Frames: number of input frames required for rollout.
| Method | Mesh-Free | Var. Δt | Preproc.-Free | #Warmup Frames ↓ | Runtime (FPS) ↑ |
|---|---|---|---|---|---|
| MGN | ✗ | ✗ | ✓ | 2 | 5.7 |
| FIGNet | ✗ | ✗ | ✓ | 3 | 3.0 |
| SDF-Sim† | ✗ | ✗ | ✗ | 3 | — |
| HopNet‡ | ✗ | ✗ | ✗ | 3 | 0.2 |
| RigidFormer (Ours) | ✓ | ✓ | ✓ | 2 | 23.9 |
† SDF pre-learning: ~5 hours for MOVi-B.
‡ Simplicial-complex construction: ~15 days on MOVi-B.
Why these properties matter in practice.
- Mesh-Free.
-
- Real-world inputs are typically captured as point clouds (LiDAR, depth cameras, 3D scanners), so a mesh-free formulation removes the brittle "point cloud → mesh reconstruction" preprocessing stage required by mesh-based simulators.
- Operating directly on points is robust to partial occlusion, sparse sampling, and irregular density; mesh-based methods are sensitive to broken or inconsistent topology under such conditions.
- A point-only interface aligns naturally with on-line robotic perception and sim-to-real pipelines, allowing the dynamics loop to consume sensor streams without intermediate geometric reconstruction.
- Point inputs further admit a rich family of training-time augmentations — downsampling, jitter, occlusion masking, and density variation — that can be tailored by the user to improve robustness against imperfect geometric observations at deployment.
- Var. Δt (variable integration step).
-
- The step size becomes a deployment-time knob: large Δt for fast interactive rollout, small Δt for high-fidelity simulation, all within a single trained model.
- Larger Δt reduces the number of autoregressive steps required to cover a fixed horizon, lowering both cumulative error and inference cost in long-horizon rollouts.
- For world-model use cases, planning rarely requires fine-grained dynamics; coarser steps yield more future information per unit of compute, which directly accelerates downstream planning and decision-making.
- Preproc.-Free (no offline geometry computation).
-
- Methods such as SDF-Sim require several hours of per-dataset SDF pre-learning, while HopNet's simplicial-complex construction takes on the order of fifteen days; these costs must be paid again for every new scene or dataset, hindering practical scalability.
- Eliminating this stage enables immediate deployment to previously unseen geometries and removes a major bottleneck in scaling training data.
- #Warmup Frames ↓ (fewer is better).
-
- Warmup frames are the historical observations the model must consume before producing its first prediction.
- A smaller warmup horizon translates directly into lower cold-start latency: prediction can begin after only one or two observations.
- For real-time control loops such as robot MPC and interactive simulation, this is essential — waiting three to five frames before issuing the first action is rarely acceptable.
- RigidFormer requires only two warmup frames, compared to three for FIGNet, SDF-Sim, and HopNet, enabling faster startup in streaming scenarios.
- Runtime (FPS) ↑ (higher is better).
-
- Higher throughput is universally desirable: it expands the set of feasible deployment scenarios — interactive simulators, real-time game physics, high-throughput RL rollouts, and the inner loops of robot MPC — and allows more episodes, longer horizons, or more parallel scenes on identical hardware.
Table 2. Standard Performance on MOVi-A, MOVi-B, and MOVi-Sphere. Each cell reports position RMSE (m) / orientation RMSE (°) at prediction horizons of 50, 75, and 100 frames. Per-column top-two ranks are highlighted: 1st, 2nd.
| Model | MOVi-A | MOVi-B | MOVi-Sphere | ||||||
|---|---|---|---|---|---|---|---|---|---|
| 50 | 75 | 100 | 50 | 75 | 100 | 50 | 75 | 100 | |
| MGN+ | 0.705 / 31.21 | N/A | N/A | 0.538 / 26.91 | N/A | N/A | N/A | N/A | N/A |
| MGN-LargeRadius+ | 0.119 / 15.07 | N/A | N/A | 0.460 / 26.34 | N/A | N/A | N/A | N/A | N/A |
| FIGNet | 0.115 / 14.84 | N/A | N/A | 0.127 / 13.99 | N/A | N/A | N/A | N/A | N/A |
| FIGNetreimpl | 0.132 / 7.10 | 0.285 / 14.62 | 0.492 / 23.30 | 0.141 / 7.39 | 0.300 / 15.16 | 0.516 / 24.96 | N/A | N/A | N/A |
| HCMTreimpl* | 0.239 / 5.70 | 0.538 / 11.82 | 0.951 / 18.40 | 0.237 / 4.72 | 0.527 / 9.80 | 0.932 / 17.43 | 0.243 / 4.19 | 0.541 / 8.21 | 0.956 / 13.81 |
| VPD†reimpl | 0.235 / 5.10 | 0.489 / 11.66 | 0.827 / 20.37 | 0.275 / 4.65 | 0.581 / 9.70 | 0.987 / 16.99 | 0.244 / 4.47 | 0.510 / 9.50 | 0.855 / 17.65 |
| HopNet | 0.054 / 5.64 | 0.115 / 11.84 | 0.196 / 18.83 | 0.047 / 4.91 | 0.101 / 10.35 | 0.176 / 17.91 | 0.034 / 4.05 | 0.073 / 8.21 | 0.124 / 13.68 |
| RigidFormer†,* (Ours) | 0.049 / 5.06 | 0.103 / 10.90 | 0.177 / 18.32 | 0.050 / 3.97 | 0.095 / 8.51 | 0.161 / 15.33 | 0.026 / 3.00 | 0.057 / 6.48 | 0.099 / 11.19 |
* Transformer-based model. † Point inputs.
Table 3. Step-size-conditioned rollout performance. A single RigidFormer model is conditioned on the integration step size at inference time. Each cell reports position RMSE (m) / orientation RMSE (°).
| Step Size | MOVi-A | MOVi-B | MOVi-Sphere | ||||||
|---|---|---|---|---|---|---|---|---|---|
| 50 | 75 | 100 | 50 | 75 | 100 | 50 | 75 | 100 | |
| 10 | 0.018 / 1.47 | 0.068 / 6.98 | 0.118 / 11.93 | 0.029 / 1.51 | 0.069 / 5.89 | 0.115 / 10.85 | 0.014 / 1.09 | 0.045 / 4.30 | 0.076 / 7.47 |
| 5 | 0.035 / 3.33 | 0.083 / 8.55 | 0.148 / 15.08 | 0.040 / 3.06 | 0.078 / 7.25 | 0.136 / 13.55 | 0.021 / 2.11 | 0.047 / 5.23 | 0.086 / 9.58 |
| 1 | 0.049 / 5.06 | 0.103 / 10.90 | 0.177 / 18.32 | 0.050 / 3.97 | 0.095 / 8.51 | 0.161 / 15.33 | 0.026 / 3.00 | 0.057 / 6.48 | 0.099 / 11.19 |
Table 4. Runtime comparison. Measured on MOVi-B with a 50-step autoregressive rollout averaged over 10 iterations on an NVIDIA GeForce RTX 5080. RigidFormer achieves 8× and 101× speedups over FIGNet and HopNet, respectively.
| Method | ms / step ↓ | FPS ↑ |
|---|---|---|
| HopNet | 4228.7 | 0.2 |
| FIGNet | 336.0 | 3.0 |
| RigidFormer (Ours) | 41.9 | 23.9 |