JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at isprs2026@icsevents.com.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Daily Overview

Location: 716B
175 theatre

Date: Wednesday, 08-July-2026

8:30am - 10:00am

ThS5: Large Language Models for Intelligent LiDAR Point Cloud Processing
Location: 716B

8:30am - 8:45am

GeoOpen3D: Geometry-guided training-free open-vocabulary 3D segmentation via visual foundation models

Shuai Zhang¹, Zhuoxiao Li¹, Jing Ou¹, Tengxi Wang¹, Zhecheng Shi^1,2, Wufan Zhao¹

¹The Hong Kong University of Science and Technology (Guangzhou), China; ²School of Computer and Communication Engineering, Northeastern University, China

Open-vocabulary 3D segmentation offers an attractive alternative to closed-set scene parsing, yet directly transferring 2D vision-language models to outdoor point clouds remains difficult because projection disrupts geometric continuity and sparse sampling weakens mask quality. This paper presents GeoOpen3D, a geometry-guided and training-free framework for open-vocabulary 3D point cloud segmentation. GeoOpen3D constructs a geometry-preserving RGB-D representation through projection, super-sampling, and depth enhancement to improve alignment between 3D structure and 2D foundation models. It then combines GroundingDINO for language-driven proposal generation with SAM for mask extraction, while introducing depth-aware regularisation to favour structurally coherent regions and clearer boundaries. The selected masks are back-projected to the original point cloud through pixel-to-point correspondence, yielding point-wise semantic labels without any 3D model training. Experiments on the SensatUrban dataset show that GeoOpen3D achieves 42.1\% mIoU, including 98.5\% IoU for buildings and 97.3\% IoU for vegetation, outperforming existing training-free open-vocabulary baselines. Additional experiments on a custom island dataset further demonstrate promising transferability to unseen categories. These results indicate that geometry-guided 2D-to-3D transfer provides an effective and scalable path towards open-vocabulary understanding of large-scale outdoor scenes.

8:45am - 9:00am

SPARC: Scalable 3D Panoptic Segmentation with Reinforcement-driven Clustering

Haoling Huang, Ting Han, Jie Ouyang, Yiping Chen

Sun Yat-sen University, China, People's Republic of

Large-scale 3D panoptic segmentation is critical for digital twins and geospatial analysis, demanding models that process massive point clouds while distinguishing instances across highly diverse spatial scales. However, prevailing graph-based approaches rely on one-shot optimization, suffering from \textit{short-sighted decisions} where irreversible local errors propagate globally, leading to severe under-segmentation at boundaries between objects of disparate scales. To overcome this short-sightedness, we present \textbf{SPARC}, a scalable framework that reframes graph clustering as a sequential, self-correcting decision process driven by hierarchical reinforcement learning. Specifically, SPARC employs a dual-level agent where a meta-controller adaptively determines instance completeness while a low-level policy iteratively refines edge affinities, enabling the model to revise early mistakes based on long-horizon rewards rather than greedy local cues. Complementing this, we introduce Semantic Voxel Partitioning (SVP) to generate semantically coherent superpoints, ensuring robust primitives that mitigate noise before clustering begins. Extensive experiments demonstrate that SPARC achieves state-of-the-art performance on the DALES dataset with a Panoptic Quality of 62.4\%, surpassing existing methods by 9.8\% and effectively resolving multi-scale segmentation ambiguities.

9:00am - 9:15am

LaSA-Net: A Language-Guided Network for Large-Scale 3D Referring Expression Segmentation on the UrbanRefer Benchmark

Bin Liu^1,2,3, Lingfei Ma^1,2,3,4, Haiyan Guan⁵, Jonathan Li^1,2,3,4

¹Key Laboratory of Geographic Information Science (Ministry of Education), East China Normal University, Shanghai 200241, China; ²Key Laboratory of Spatial-temporal Big Data Analysis and Application of Natural Resources in Megacities, Ministry of Natural Resources, East China Normal University, Shanghai 200241, China; ³School of Geospatial Artificial Intelligence, East China Normal University, Shanghai 200241, China; ⁴Hinton STAI Institute, East China Normal University, Shanghai 200241, China; ⁵School of Remote Sensing and Geomatics Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

3D Referring Expression Segmentation (3DRES) aims to segment point cloud scenes based on a given expression. However, existing 3DRES methods face three main challenges: (1) significant progress has been made in indoor scenes, yet large-scale and complex outdoor scenes, captured by airborne or mobile LiDAR, remain fully unexplored; (2) traditional methods often suffer from inefficiency and mis-segmentation due to insufficient attention to the spatial information of instances during query generation; and (3) existing models treat all queries equally in the decoder and predict the final mask in one step, which is inefficient in outdoor road scenes dominated by background point clouds, where objects are sparse and small. To address these challenges, a new outdoor 3DRES benchmark, named UrbanRefer, is introduced. The dataset consists of 100 large-scale outdoor scenes and 1,100 specially designed long textual descriptions, emphasizing geospatial relationships and multi-object contexts unique to outdoor environments. Additionally, the Language-guided Spatial Anchoring Network (LaSA-Net) is proposed for the directional segmentation task in outdoor scenes. Specifically, the Local-Global Aggregation (LGA) module is incorporated into the backbone to enhance local and global context awareness, effectively optimizing point features. Furthermore, a Text-driven Localization (TL) module is introduced, which directly predicts the 3D positions of all entities mentioned in the text, providing robust spatial priors for the decoder. Finally, a Hierarchical Prompt-aware Decoder (HPAD) is designed to locate rough regions by extracting task-driven signals from the interaction between expressions and visual features. Extensive experiments demonstrate that the mIoU metric of LaSA-Net outperforms state-of-the-art methods by 0.9%.

9:15am - 9:30am

Scenereasoner: Decoupled Spatial Tokenization for large-scene understanding with llms

Bohang Shi, Shengjun Tang, Xiaoming Li, Weixi Wang, Linfu Xie, Baoding Zhou

Shenzhen University, Shenzhen, Guangdong, People's Republic of China

Most existing 3D vision-language models focus on object-level or single-room understanding and perform poorly in large-scale,

multi-room indoor environments where task-relevant objects constitute only a small fraction of the total point cloud. When multi-

room point clouds are fed directly into an LLM, critical semantic signals are diluted by the vast amount of redundant background,

making it difficult for the model to focus on truly relevant regions. We propose SceneReasoner, a decoupled spatial tokenisation

framework that addresses this challenge through three core designs: (1) pre-tokenisation text-guided feature weighting that leverages

the shared CLIP embedding space between OpenScene point features and text queries to amplify question-relevant point features

before any spatial compression occurs; (2) 2D–3D feature fusion that integrates top-down 2D CLIP features with 3D sparse tokens,

supplying the model with appearance semantics—such as texture, material, and room layout—absent from raw point clouds; and

(3) layer-wise dense feature injection that inserts local dense features into the LLM attention mechanism layer by layer for fine-

grained perception of key regions. We evaluate on the XR-Scene benchmark, which covers cross-room question answering and

scene captioning over HM3D indoor environments with an average area of 132 m2. SceneReasoner achieves the best CIDEr on

XR-SceneCaption (+0.33 over LSceneLLM), the highest METEOR on XR-QA, and competitive ROUGE-L across all three tasks,

demonstrating the effectiveness of task-guided spatial tokenisation for large-scene understanding.

9:30am - 9:45am

Llm-Supervised Point Cloud Processing: from Unsupervised 3D Scene-Graph Generation to Interactive Scene Manipulation

Florent Poux^1,2, Alex Key³

¹3D Geodata Academy, France; ²Geoscity Lab, University of Liège, Belgium; ³Panoriq AI, Germany

Understanding and manipulating 3D spatial environments remains a fundamental challenge in geospatial sciences, with applications spanning digital twins, facility management, urban planning, and autonomous systems. While point cloud acquisition technologies have matured significantly, the semantic interpretation and interactive manipulation of captured 3D scenes continue to require extensive manual intervention and domain expertise. This paper presents a novel LLM framework that bridges unsupervised graph-based 3D scene understanding with natural language-driven interactive manipulation, enabling context-aware spatial intelligence at scale.

9:45am - 10:00am

Multimodal Large Language Models to road inventory with non-photorealistic Point Cloud visualization

Horia Ameen, Mario Soilán, Henrique Lorenzo Cimadevila, Jesús Balado-Frías

CINTECX, Universidade de Vigo, GeoTECH, 36310, Vigo, Spain

Accurate road inventories are crucial for maintenance, safety, and resource allocation, with automation improving efficiency but often lacking user-friendly human-machine interaction. This paper evaluates how non-photorealistic rendering of 3D point clouds impacts Multimodal Large Language Models (MLLMs) interpretation for road inventory, testing three methods on real road data in Santarém, Portugal. From 3D point clouds coloured with RGB information, non-photorealistic techniques are implemented and compared: Ambient Occlusion (AO), Eye-Dome Lighting (EDL) and Multi Feature-Rich Synthetic Color (MFRSC). Several state-of-the-art MLLMs are also tested: GPT5, Gemini2.5-Pro, Gemini2.5-Flash, CogVLM2, MiniCPM-V, Llama4-scout-17b, Mistral-Small3.2, Qwen2.5vl and Gemma3. The results indicate that non-photorealistic techniques do not hinder the identification of road elements by MLLMs, indicating their potential for 3D point cloud classification tasks even when true RGB colour is not available. Furthermore, the overall performance metrics, with F-scores over 80% for proprietary, state-of-the-art models (GPT5, Sonnet 4.5 and Gemini) show that 2D captures of 3D point clouds can be a suitable data source for zero-shot object classification.

1:30pm - 3:00pm

Forum3A: Legacy Project: How to Secure Funding to Support Geospatial Activities
Location: 716B

3:30pm - 5:15pm

Forum3B: Legacy Project: How to Secure Funding to Support Geospatial Activities
Location: 716B