Cameras as Relative Positional Encoding

Ruilong Li1,2*
Brent Yi1*
Junchen Liu1*
Hang Gao1
Yi Ma1,3
Angjoo Kanazawa1
( *Equal contribution )
1UC Berkeley
2NVIDIA
3HKU

Paper

</Code>
teaser

Problem: Camera Conditioning in Vision Transformers

As transformers become more prevalent in multiview 3D perception, a central challenge lies in how we condition models on camera parameters:

Idea: Camera as Image Token's "Positional" Identifier

In multiview settings, camera intrinsic and extrinsic parameters are important for understanding the meaning of images. Camera parameters act as a "positional" identifier that we aim to bind to visual tokens.

2D Observations Camera: {intrinsic, extrinsic}

In this work: we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we proposeProjective Positional Encoding (PRoPE)—that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding

Positional Encoding for Multiview Geometry

Similar to language tokens require 1D positional encoding is necessary, image tokens in multiview tasks also require geometric positional encoding. This can be done in absolute or relative terms:

teaser

Method: Relative Projective Positional Encoding (PRoPE)

PRoPE encodes camera geometry as a relative positional encoding. It accomplishes this via the relative projective transformation between cameras:

teaser

Like prior methods, it is easy to implement, compatible with FlashAttention, and can be plugged into any existing attention-based framework with minimal changes! See GitHub for complete implementation.

Plug-and-Play: Improved Novel View Synthesis

We inject PRoPE into the existing LVSM framework and see immediate improvement! Here are the curves:

Plug-and-Play: Improved Stereo Depth Estimation

We inject PRoPE into the official UniMatch codebase and evaluate stereo depth estimation performance:

Better Robustness than Alternative Camera Conditioning Methods

Incorporating PRoPE also improves robustness to out-of-distribution inputs, both in terms of 1) number of views (context length), and 2) unseen intrinsic parameters (zoom-in/out) at test time.

ood-combined

Try it out today with our code and let us know your findings!

Acknowledgement

Our work is deeply inspired by several outstanding prior works. In particular, EscherNet and GTA explored relative token-level encoding for SE(3), while CAT3D and Cameras as Rays investigated pixel-level encoding approaches. We would also like to thank Qianqian Wang and Alex Yu for valuable early discussions on this direction, and Aleksander Hołyński for assisting with the CAT3D comparison.


Website inspired by DyCheck