As transformers become more prevalent in multiview 3D perception, a central challenge lies in how we condition models on camera parameters:
In multiview settings, camera intrinsic and extrinsic parameters are important for understanding the meaning of images. Camera parameters act as a "positional" identifier that we aim to bind to visual tokens.
2D Observations | Camera: {intrinsic, extrinsic} |
In this work: we compare techniques for conditioning transformers on cameras: token-level raymap encodings, attention-level relative pose encodings, and a new relative encoding we proposeProjective Positional Encoding (PRoPE)—that captures complete camera frustums, both intrinsics and extrinsics, as a relative positional encoding
Similar to language tokens require 1D positional encoding is necessary, image tokens in multiview tasks also require geometric positional encoding. This can be done in absolute or relative terms:
PRoPE encodes camera geometry as a relative positional encoding. It accomplishes this via the relative projective transformation between cameras:
Like prior methods, it is easy to implement, compatible with FlashAttention, and can be plugged into any existing attention-based framework with minimal changes! See GitHub for complete implementation.
We inject PRoPE into the existing LVSM framework and see immediate improvement! Here are the curves:
We inject PRoPE into the official UniMatch codebase and evaluate stereo depth estimation performance:
Incorporating PRoPE also improves robustness to out-of-distribution inputs, both in terms of 1) number of views (context length), and 2) unseen intrinsic parameters (zoom-in/out) at test time.
Try it out today with our code and let us know your findings!
Our work is deeply inspired by several outstanding prior works. In particular, EscherNet and GTA explored relative token-level encoding for SE(3), while CAT3D and Cameras as Rays investigated pixel-level encoding approaches. We would also like to thank Qianqian Wang and Alex Yu for valuable early discussions on this direction, and Aleksander Hołyński for assisting with the CAT3D comparison.