Logo QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer

*Corresponding author.

Visualization Results

Add a short description of the qualitative visualization here.

QVGGT
VGGT

🔔 News

  • [2026.02.21] 🎉 QVGGT has been accepted to CVPR 2026!

Abstract

Estimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT.

Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camera information compensation, which removes these outliers from activation calibration and restores their geometric cues using a PCA-derived global compensation token. Finally, we develop a task-aware scale search mechanism that evaluates candidate quantization scales not only through layer reconstruction but also through multi-head supervision and cross-head geometric consistency among camera poses, depth maps, and point maps.

Extensive experiments on multiple geometry perception benchmarks demonstrate that QVGGT achieves near-lossless W4A16 quantization, preserving the accuracy of all 3D prediction heads while delivering 3$\sim$4.9$\times$ memory reduction and up to 2.8$\times$ real hardware speedup over FP32.

Our approach makes high-fidelity 3D perception feasible on edge devices, enabling practical deployment of feed-forward 3D reconstruction models in real-world constrained environments.

Method

Method overview

Overview of QVGGT. Our framework consists of three components to preserve VGGT performance under low-bit quantization. Step 1: Sensitivity analysis. We estimate per-block quantization sensitivity across frame-wise and global transformer blocks, enabling selective mixed-precision assignment for critical layers. Step 2: Token filtering with camera information compensation. To mitigate outlier-dominated scale estimation, we exclude camera and register tokens during activation statistics collection. A camera information compensation (CIC) token is then constructed via top-K PCA over camera-token activations and injected into the camera head. Step 3: Task-aware scale search. We select quantization scales using a task-aware objective that combines multi-head losses, geometric consistency, and reconstruction error, jointly optimizing accuracy and robustness across all heads.

Experiments & Comparison

BibTeX