RT-GS2: Real-Time Generalizable Semantic Segmentation for 3D Gaussian Representations of Radiance Fields

BMVC 2024
Vrije Universiteit Brussel
Universitatea Tehnică din Cluj-Napoca

*Indicates Equal Contribution
Code arXiv
Motivation figure

Visualization of the enhanced view-consistency throughout subsequent frames. (Top) Visualization of our view-independent 3D features using PCA, (middle) semantic segmentation without the usage of view-independent 3D features, and (bottom) proposed semantic segmentation when using our view-independent 3D features.

Abstract

Gaussian Splatting has revolutionized the world of novel view synthesis by achieving high rendering performance in real-time. Recently, studies have focused on enriching these 3D representations with semantic information for downstream tasks. In this paper, we introduce RT-GS2, the first generalizable semantic segmentation method employing Gaussian Splatting. While existing Gaussian Splatting-based approaches rely on scene-specific training, RT-GS2 demonstrates the ability to generalize to unseen scenes. Our method adopts a new approach by first extracting view-independent 3D Gaussian features in a self-supervised manner, followed by a novel View-Dependent / View-Independent (VDVI) feature fusion to enhance semantic consistency over different views. Extensive experimentation on three different datasets showcases RT-GS2’s superiority over the state-of-the-art methods in semantic segmentation quality, exemplified by a 8.01% increase in mIoU on the Replica dataset. Moreover, our method achieves real-time performance of 27.03 FPS, marking an astonishing 901 times speedup compared to existing approaches. This work represents a significant advancement in the field by introducing, to the best of our knowledge, the first real-time generalizable semantic segmentation method for 3D Gaussian representations of radiance fields.

Method

The architecture of the proposed method consists out of three main stages: (i) a new self-supervised view-independent 3D Gaussian feature extractor, (ii) the rendering of the 3D information encapsulated in the enhanced 3D Gaussians to a specific view, and (iii) a novel View-Dependent / View-Independent (VDVI) feature fusion. The first stage transforms a 3D Gaussian representation of a scene into a set of features. Important to note is that since the proposed method operates on the entire 3D Gaussian representation as input, the features are view-independent. In the second stage, the features are rendered by using alpha blending, to a specific view , defined by a given pose. In parallel, the novel view is rendered. In the last stage, the novel VDVI feature fusion module, extracts view-dependent features and fuses them at different scales. A joint decoder is employed to obtain the semantic predictions. By training the model on different scenes, RT-GS2 is able to generalize semantic segmentation to unseen scenes.

Method overview

Overview of the proposed method.

Experiments

Quantitative

Experiments were conducted on three datasets: Replica, Scan-Net, and ScanNet++, representing synthetic and real-world indoor scenes. Experimental settings meticulously followed those of GSNeRF for Replica and ScanNet. Details and splits are provided in the supplementary material. As evaluation metrics, we employed mean Intersection over Union (mIoU), mean Accuracy (mAcc), and overall Accuracy (oAcc) for segmentation, and Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS) for rendering quality. Inference speed was measured in Frames Per Second (FPS).
Table 1

Comparison on REPLICA of the proposed method against the state of the art for generalizable semantic segmentation and after finetuning on a specific scene. * denotes the addition of a semantic head.

Table 2

Comparison on ScanNet of the proposed method against the state of the art for generalizable semantic segmentation and after finetuning on a specific scene. * denotes the addition of a semantic head.

Table 3

Quantitative results on ScanNet++ of the proposed method for generalizable semantic segmentation and after finetuning on a specific scene.

Qualitative

We present qualitative results on two scenes from the Replica and ScanNet++ datasets. Our evaluation includes both semantic segmentation performance and depth prediction, demonstrating the robustness of our learned features.

BibTeX

@article{jurca2024rt,
        title={RT-GS2: Real-Time Generalizable Semantic Segmentation for 3D Gaussian Representations of Radiance Fields},
        author={Jurca, Mihnea-Bogdan and Royen, Remco and Giosan, Ion and Munteanu, Adrian},
        journal={arXiv preprint arXiv:2405.18033},
        year={2024}
      }