ECoDepth
Effective Conditioning of Diffusion Models for Monocular Depth Estimation

CVPR 2024

Suraj Patni* Aradhye Agarwal* Chetan Arora

Indian Institute of Technology Delhi

Abstract

In the absence of parallax cues, a learning based single image depth estimation (SIDE) model relies heavily on shading and contextual cues in the image. While this simplicity is attractive, it is necessary to train such models on large and varied datasets, which are difficult to capture. It has been shown that using embeddings from pretrained foundational models, such as CLIP, improves zero shot transfer in several applications. Taking inspiration from this, in our paper we explore the use of global image priors generated from a pre-trained ViT model to provide more detailed contextual information. We argue that the embedding vector from a ViT model, pre-trained on a large dataset, captures greater relevant information for SIDE than the usual route of generating pseudo image captions, followed by CLIP based text embeddings. Based on the idea, we propose a new SIDE model using a diffusion backbone conditioned on ViT embeddings. Our proposed design establishes a new state-of-the-art (SOTA) for SIDE on NYU Depth v2 dataset, achieving Abs Rel error of 0.059(14% improvement) compared to 0.069 by the current SOTA (VPD). And on KITTI dataset, achieving SqRel error of 0.139 (2% improvement) compared to 0.142 by the current SOTA (GEDepth). For zero shot transfer with a model trained on NYU Depth v2, we report mean relative improvement of (20%, 23%, 81%, 25%) over NeWCRF on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets, compared to (16%, 18%, 45%, 9%) by ZoEDepth.

Architecture Diagram: The latent representations of the input image undergo a diffusion process, which is conditioned by our proposed CIDE module. Within the CIDE module, the input image is fed through the frozen ViT model. From this, a linear combination of learnable embeddings is computed, which are subsequently transformed to generate a 768-dimensional semantic context embedding. This embedding is utilized to condition the diffusion backbone. Subsequently, hierarchical feature maps are extracted from U-Net's decoder which are upsampled and processed through a depth regressor to generate the depth map.

Demo ZeroShot Performance on Video

Demo Video: Our depth estimation model in action, capturing intricate details of the driving scene with precision.

State-of-the-art results on Monocular Depth Estimation

Our model achieves state-of-the-art results on the indoor dataset NYUv2 and outdoor dataset KITTI for metric depth estimation using monocular images.
Table 1: NYUv2 Dataset
Table 2: KITTI Dataset

Generalization and Zero Shot Transfer

Our model generalizes well to other datasets and achieves state-of-the-art results for zero-shot transfer on various datasets. Our model which is trained only on NYUv2 dataset is tested on the following datasets: Sun-RGBD, iBims1, DIODE, and HyperSim. Achieving SOTA results compared to all previous methods that claim to generalize on unseen data(ex: ZoeDepth).

Zero-Shot Qualitative Results

Qualitative results across four different datasets, demonstrating the zero-shot performance of our model trained only on the NYUv2 dataset. Corresponding quantitative results are presented in above table. The first column displays RGB images, the second column depicts ground truth depth, and the third column showcases our model's predicted depths. Additional images for each dataset are available in the Supplementary Material.

BibTeX (Citation)


@InProceedings{Patni_2024_CVPR,
    author    = {Patni, Suraj and Agarwal, Aradhye and Arora, Chetan},
    title     = {ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {28285-28295}
}