Visualization of the enhanced view-consistency throughout subsequent frames. (Top) Visualization of our view-independent 3D features using PCA, (middle) semantic segmentation without the usage of view-independent 3D features, and (bottom) proposed semantic segmentation when using our view-independent 3D features.
Gaussian Splatting has revolutionized the world of novel view synthesis by achieving high rendering performance in real-time. Recently, studies have focused on enriching these 3D representations with semantic information for downstream tasks. In this paper, we introduce RT-GS2, the first generalizable semantic segmentation method employing Gaussian Splatting. While existing Gaussian Splatting-based approaches rely on scene-specific training, RT-GS2 demonstrates the ability to generalize to unseen scenes. Our method adopts a new approach by first extracting view-independent 3D Gaussian features in a self-supervised manner, followed by a novel View-Dependent / View-Independent (VDVI) feature fusion to enhance semantic consistency over different views. Extensive experimentation on three different datasets showcases RT-GS2’s superiority over the state-of-the-art methods in semantic segmentation quality, exemplified by a 8.01% increase in mIoU on the Replica dataset. Moreover, our method achieves real-time performance of 27.03 FPS, marking an astonishing 901 times speedup compared to existing approaches. This work represents a significant advancement in the field by introducing, to the best of our knowledge, the first real-time generalizable semantic segmentation method for 3D Gaussian representations of radiance fields.
The architecture of the proposed method consists out of three main stages: (i) a new self-supervised view-independent 3D Gaussian feature extractor, (ii) the rendering of the 3D information encapsulated in the enhanced 3D Gaussians to a specific view, and (iii) a novel View-Dependent / View-Independent (VDVI) feature fusion. The first stage transforms a 3D Gaussian representation of a scene into a set of features. Important to note is that since the proposed method operates on the entire 3D Gaussian representation as input, the features are view-independent. In the second stage, the features are rendered by using alpha blending, to a specific view , defined by a given pose. In parallel, the novel view is rendered. In the last stage, the novel VDVI feature fusion module, extracts view-dependent features and fuses them at different scales. A joint decoder is employed to obtain the semantic predictions. By training the model on different scenes, RT-GS2 is able to generalize semantic segmentation to unseen scenes.
@article{jurca2024rt,
title={RT-GS2: Real-Time Generalizable Semantic Segmentation for 3D Gaussian Representations of Radiance Fields},
author={Jurca, Mihnea-Bogdan and Royen, Remco and Giosan, Ion and Munteanu, Adrian},
journal={arXiv preprint arXiv:2405.18033},
year={2024}
}