Cameras as Relative Positional Encoding

Ruilong Li^1,2*

Brent Yi^1*

Junchen Liu^1*

Hang Gao¹

Yi Ma^1,3

Angjoo Kanazawa¹

( *Equal contribution )

¹UC Berkeley

²NVIDIA

³HKU

Paper
</Code>

Problem: Camera Conditioning in Vision Transformers

As transformers become more prevalent in multiview 3D perception, a central challenge lies in how we condition models on camera parameters:

Idea: Camera as Image Token's "Positional" Identifier

In multiview settings, camera intrinsic and extrinsic parameters are important for understanding the meaning of images. Camera parameters act as a "positional" identifier that we aim to bind to visual tokens.


2D Observations	Camera: {intrinsic, extrinsic}

In this work: we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we proposeProjective Positional Encoding (PRoPE)—that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding

Positional Encoding for Multiview Geometry

Similar to language tokens require 1D positional encoding is necessary, image tokens in multiview tasks also require geometric positional encoding. This can be done in absolute or relative terms:

Method: Relative Projective Positional Encoding (PRoPE)

PRoPE encodes camera geometry as a relative positional encoding. It accomplishes this via the relative projective transformation between cameras:

Like prior methods, it is easy to implement, compatible with FlashAttention, and can be plugged into any existing attention-based framework with minimal changes! See GitHub for complete implementation.

Plug-and-Play: Improved Novel View Synthesis

We inject PRoPE into the existing LVSM framework and see immediate improvement! Here are the curves:


LVSM	LVSM + PRoPE

Plug-and-Play: Improved Stereo Depth Estimation

We inject PRoPE into the official UniMatch codebase and evaluate stereo depth estimation performance:


UniMatch	UniMatch + PRoPE

Better Robustness than Alternative Camera Conditioning Methods

Incorporating PRoPE also improves robustness to out-of-distribution inputs, both in terms of 1) number of views (context length), and 2) unseen intrinsic parameters (zoom-in/out) at test time.

Try it out today with our code and let us know your findings!

Acknowledgement

Our work is deeply inspired by several outstanding prior works. In particular, EscherNet and GTA explored relative token-level encoding for SE(3), while CAT3D and Cameras as Rays investigated pixel-level encoding approaches. We would also like to thank Qianqian Wang and Alex Yu for valuable early discussions on this direction, and Aleksander Hołyński for assisting with the CAT3D comparison.

Website inspired by DyCheck