LA-Pose: Latent Action Pretraining Meets Pose Estimation

Project Teaser

Learning camera motion from unlabelled driving video

LA-Pose learns a latent action space without pose labels, then reuses those motion-centric features for efficient camera pose estimation.

TL;DR — Self-supervised learning of a latent action space from unlabelled video, repurposed for accurate and generalizable camera pose estimation.

Abstract

A scalable route to accurate camera pose

This paper revisits camera pose estimation through the lens of self-supervised pretraining, focusing on inverse dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward dynamics models to learn latent action representations, similar to Genie from large-scale driving videos. Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as "action" conditioning of world-models or as proxies of robot "action" parameters in policy networks. Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high quality 3D annotations. This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving benchmarks show that LA-Pose achieves competitive and even superior performance to state-of-the-art methods while using orders of magnitude less labeled data. Concretely, on the Waymo and PandaSet benchmarks, LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods. To our knowledge, this work is the first to demonstrate the power of inverse-dynamics self-supervised learning for pose estimation.

Method

Latent action pretraining, then pose post-training

The same motion representation learned from consecutive frames becomes the signal for metric camera pose prediction.

LA-Pose method overview — Our framework consists of two stages: **latent action pretraining** and **camera pose post-training**. In the **pretraining stage (top)**, an inverse–forward dynamics model learns latent actions from consecutive video frames by predicting future tokens through a self-supervised inverse-dynamics objective. These latent actions encode compact, motion-centric representations of frame-to-frame dynamics. In the **post-training stage (bottom left)**, we attach a lightweight pose estimation head to the pretrained inverse dynamics encoder. The head predicts relative camera translation, rotation (quaternion), field-of-view, and metric scale from the latent actions.

Interactive Latent Space

Motion structure emerges without pose labels

Toggle the coloring, click points, or select a cluster to inspect the driving sequences behind the representation.

Each point below represents a latent action — a learned representation of the camera motion between two consecutive driving frames, projected to 2D via t-SNE. Points that are close together encode similar motions. Toggle the coloring to reveal how translation speed and yaw (turning direction) naturally separate in the learned space. Click any point to view its driving sequence, or drag-select a region to browse all frame pairs within a cluster.

Loading visualization...

Try it out

Click a point to view its driving sequence

Drag-select a region to browse all frame pairs

Click anywhere to start exploring

Click a point to view its driving sequence Box-select a region to browse frame pairs

Frame Pairs from Selected Region

Benchmark Results

Qualitative camera trajectory predictions

Side-by-side driving videos and predicted trajectories on two standard autonomous driving benchmarks.

Waymo
PandaSet

Generalization

In-the-wild YouTube driving videos

Predicted camera trajectories remain stable across night, traffic, mountain roads, rain, and tunnels.

LA-Pose generalizes to uncurated YouTube driving footage across diverse conditions. Below we show predicted camera trajectories (right) alongside the original video (left) on scenes spanning nighttime, crowded traffic, mountain roads, rain, and tunnels. Videos are sourced from the OpenDV-YouTube dataset.

Citation

BibTeX

@inproceedings{Wang2026LAPose,
  author    = {Wang, Zhengqing and Nair, Saurabh and Chidananda, Prajwal and Kachana, Pujith and Li, Samuel and Brown, Matthew and Furukawa, Yasutaka},
  title     = {LA-Pose: Latent Action Pretraining Meets Pose Estimation},
  booktitle = {CVPR},
  year      = {2026},
}

Acknowledgments

Contributors and support

We thank Jaskaran Singh Sodhi and Saloni Puran Parekh for their help with data construction, and Anner De Jong for engineering support. We are grateful to Thomas Kollar and Gianluca Corrado for reviewing the paper and providing valuable feedback. We also thank Shreyas Rajesh and Soham Phade for their early contributions to the idea of Latent-Actions.