SSR

Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

Yang Liu1,†, Ming Ma3,†, Xiaomin Yu4,†, Pengxiang Ding1,2,§, Han Zhao1,2,
Mingyang Sun1,2,5, Siteng Huang2, Donglin Wang1,*

1Westlake University, 2Zhejiang University, 3Harbin Institute of Technology, 4The Hong Kong University of Science and Technology (Guangzhou), 5Shanghai Innovation Institute
Equal contribution. §Project lead. *Corresponding author.

Abstract

Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth, either require specialized sensors or fail to effectively exploit depth information for higher-order reasoning. To this end, we propose a novel Spatial Sense and Reasoning method, dubbed SSR, a novel framework that transforms raw depth data into structured, interpretable textual rationales. These textual rationales serve as meaningful intermediate representations to significantly enhance spatial reasoning capabilities. Additionally, we leverage knowledge distillation to compress the generated rationales into compact latent embeddings, which facilitate resource-efficient and plug-and-play integration into existing VLMs without retraining. To enable comprehensive evaluation, we introduce a new dataset named SSR-CoT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBench, a comprehensive multi-task benchmark. Extensive experiments on multiple benchmarks demonstrate SSR substantially improves depth utilization and enhances spatial reasoning, thereby advancing VLMs toward more human-like multi-modal understanding.

  • We propose an efficient VLM, dubbed SSR, capable of simultaneously performing depth perception and spatial reasoning, and generating answers based on implicit reasoning rationales.
  • We introduce SSR-CoT, a million-scale visual-language reasoning dataset enriched with intermediate spatial reasoning annotations, and present SSRBench, a comprehensive mliti-task benchmark.
  • Extensive experiments and solid analysis across various benchmarks demonstrate our SSR can efficiently and dramatically enhance the spatial understanding of existing VLMs.

SSR

Schematic of SSR framework. (a) Overall pipeline. (b) full architecture of SSR, comprising the MIDI module followed by the VLM. (c) Two training stages of the SSR. In the stage 1, the LLM provides alignment supervision for the MIDI module, whereas the stage 2 is optional.

Experiementation

SSR in 3 billion parameters can achieve comparable or even higher results than large-scale baseline models, including closed-source and backbone models. Our larger variant, comprising 7 billion parameters, yields the best performance on most tasks across the two benchmarks.

The improved performance of SSR compared to the backbone model across the four benchmarks at varying model scales.

We conducted experiments to evaluate the SSR model without the second training phase. These experiments illustrate the performance of the MIDI module when integrated in a plug-and-play manner, leading to improved spatial understanding.

In the left example, the images depict only people and bananas. Consequently, the model must abandon conventional assumptions and carefully reason about the spatial relations explicitly present in the image to answer accurately. In the right example, complex relationships among numerous objects are depicted, and relevant features for answering the posed question are not immediately obvious. In this case, the model must thoroughly comprehend the correspondence between each object and the given question, as well as understand intricate spatial relations among these objects, to produce a correct response. These examples clearly demonstrate that our SSR effectively enhances the spatial awareness and reasoning capabilities of vision-language models, thereby significantly improving their ability to understand complex spatial relationships.