Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
|
Daily Overview | |
|
Location: 713A 125 theatre |
| Date: Sunday, 05-July-2026 | |
| 8:30am - 12:00pm | TuT10: Open-source Scientific Software py4dgeo for Change Analysis in 3D/4D Point Clouds Location: 713A |
| 12:00pm - 1:15pm | ICWG I/IV: Robotics for Mapping and Machine Intelligence Location: 713A |
|
|
12:00pm - 12:15pm
A category-specific prompt strategy for semantic 3D indoor mapping using RGB-D camera 1Remote Sensing and Image Analysis, Department of Civil and Environmental Engineering, Technical University of Darmstadt, Germany; 2Geodetic Measurement Systems and Sensor Technology, Department of Civil and Environmental Engineering, Technical University of Darmstadt, Germany Semantic 3D indoor mapping often depends on supervised learning and large annotated datasets, limiting scalability across diverse environments. This work introduces a category-specific prompt strategy for semantic 3D mapping using RGB-D cameras, integrating RGB-D SLAM with the Segment Anything Model 2 (SAM2) to enable annotation-efficient reconstruction. Keyframes and trajectories extracted from SLAM provide spatial references, while SAM2 performs zero-shot segmentation guided by a Category-Specific Prompt Strategy (CPSS), which segments structural and functional elements (e.g., floors, doors, staircases) by category to reduce prompt interference and manual effort. The segmented keyframes are then fused with depth and pose data to produce instance-level semantic point clouds. Experiments on custom RGB-D sequences and selected ScanNet scenes demonstrate centimeter-level geometric accuracy and strong semantic consistency, with mIoU values up to 0.89 on the custom dataset and 0.98 on ScanNet. The resulting semantic point clouds are clean, structured, and require minimal post-processing, showing that the proposed strategy provides an efficient and scalable solution for semantic 3D indoor mapping without retraining or environment-specific supervision. 12:15pm - 12:30pm
3L-Planner: Lightweight LiDAR mapping and real-time local planning for ground robot autonomous navigation State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, China Mobile robots are widely used in unmanned surveying, warehouse logistics, and emergency response. However, achieving safe, reliable, and efficient autonomous navigation in unknown environments remains challenging, where accurate environment representation and feasible trajectory planning are crucial. This paper presents an autonomous navigation method integrating lightweight LiDAR mapping with real-time local planning for ground robots. At the perception level, an incremental single-frame point cloud update is used to accumulate and project locally traversable space, producing a lightweight obstacle map that preserves geometric accuracy while reducing planning computation. At the planning level, A* is employed to generate reference control points, and uniform B-spline curves are used to optimize the trajectory while enforcing kinematic feasibility and smoothness. At the control level, nonlinear model predictive control (NMPC) ensures accurate trajectory tracking by producing control commands that satisfy velocity and acceleration constraints. The framework also supports low-cost evaluation in simulation. Experiments in simulated forests, simulated indoor corridors, and real-world gardens and hallways show average navigation speeds of 2.24 m/s, 0.76 m/s, 0.43 m/s, and 0.38 m/s, respectively. Results demonstrate that the proposed method generates smooth, feasible, and safe trajectories and completes autonomous navigation and mapping tasks across diverse environments. 12:30pm - 12:45pm
CMCL-PR: Cross-Modal Camera-to-LiDAR Place Recognition with Cross-Attention Contrastive Learning Wuhan University, China, People's Republic of Place recognition is a crucial task for both robots and autonomous vehicles, facilitating positioning and loop closure within pre-built global maps. Although single-modal sensor-based methods have shown satisfactory performance, cross-modal place recognition—retrieving low-cost camera images from global point cloud databases—remains a significant challenge. In this paper, we propose a contrastive learning-based lightweight cross-modal place recognition framework (CMCL-PR) to retrieve a single image from a global offline point cloud map. We introduce a perspective projection based field-of-view(FoV) transformation module that converts point clouds into a modality analogous to images; Then, we design a dual branch intra-modal encoder structure based on shared Transformer, which extracts and aligns image and point cloud features separately, effectively unifying the feature distribution between modals; Besides, a cross-attention mechanism module guided by inter-modal consistency was constructed, which utilizes the contribution of scene context information within different modalities to generate the discriminating cross-modal descriptors. Finally, during the contrastive learning, cross-modal feature was enhanced, and a multi loss function was constructed, including cross-modal contrastive learning loss, intra-modal consistency loss, and matching supervision loss. We assess the effectiveness and generalizability of our method using three publicly available datasets: KITTI, KITTI-360, and Oxford RobotCar. The project page and code will be released at https://github.com/qp-li/CMCL-PR. |
| 1:30pm - 2:45pm | WG II/4A: AI/ML for Geospatial Data Location: 713A |
|
|
1:30pm - 1:45pm
Target Vessel Identification in Aerial Search Imagery via MLLM-Based Attribute Extraction and Geolocation Fusion Dept. of Geoinformatics, University of Seoul, Seoul, Republic of Korea Identifying a distressed vessel among many ships detected in wide-area aerial imagery is a critical challenge in maritime Search and Rescue (SAR) operations. Conventional methods cannot determine which vessel matches the incident description, especially when Automatic Identification System (AIS) reports are uncertain. This study proposes an integrated framework that combines MLLM-based semantic attribute extraction with geolocation fusion to prioritize candidate vessels according to their consistency with Situation Report (SITREP) based scenarios. The method detects vessels using YOLOv8, tracks them with Deep Simple Online and Real-time Tracking (DeepSORT), and performs image-based georeferencing using onboard metadata. A Multi-modal Large Language Model (MLLM) extracts appearance/status attributes from representative vessel images, while scenario descriptions are also converted to attributes. Both sets are encoded using MiniLM embeddings. Finally, semantic similarity is fused with geolocation proximity within an Support Vector Machine (SVM) classifier to produce a probability-ranked list of candidates. Experiments using real aerial search footage demonstrate robust identification performance across a range of scenario quality levels. The correct vessel appears within the top three candidates in more than 73\% of cases and within the top five in more than 91\%, even when attribute extraction is affected by low resolution, illumination effects, or missing scenario information. These results show that coarse semantic cues, when combined with approximate geolocation, provide a resilient basis for identifying target vessels under high uncertainty. The proposed framework offers a practical foundation for automated SAR decision support, enabling faster and more reliable prioritization during wide-area maritime search operations. 1:45pm - 2:00pm
Zero-shot Vision-Language Reranking for Cross-View Geolocalization 1Dept. of Electrical and Computer Engineering, The Ohio State University, United States of America; 2US Army Engineer Research and Development Center, Geospatial Research Laboratory, United States of America; 3Dept. of Civil Engineering, The Ohio State University, United States of America Cross-view geolocalization (CVGL) systems, while effective at retrieving a list of relevant candidates (high Recall@k), often fail to identify the single best match (low Top-1 accuracy). This work investigates the use of zero-shot Vision-Language Models (VLMs) as rerankers to address this gap. We propose a two-stage framework: state-of-the-art (SOTA) retrieval followed by VLM reranking. We systematically compare two strategies: (1) Pointwise (scoring candidates individually) and (2) Pairwise (comparing candidates relatively). Experiments on the VIGOR dataset show a clear divergence: all pointwise methods cause a catastrophic drop in performance. In contrast, a pairwise comparison strategy using LLaVA improves Top-1 accuracy over the strong retrieval baseline. Our analysis concludes that VLMs are poorly calibrated for absolute relevance scoring but are effective at fine-grained relative visual judgment, making pairwise reranking a promising direction for enhancing CVGL precision. 2:00pm - 2:15pm
From Pixels to Polygons: Evaluating Vision Foundation Models for High-Resolution Orthophoto Segmentation and Vectorization University of the Bundeswehr Munich, Germany Accurate and topologically consistent vector data extraction from aerial imagery is essential for geospatial applications such as urban mapping and the enrichment of volunteered geographic information (VGI) platforms like OpenStreetMap (OSM). While deep learning has advanced automated feature extraction, traditional supervised networks remain constrained by large annotation requirements and limited generalization. Recent vision foundation models offer a promising alternative through promptable, zero-shot segmentation capabilities. This study presents a modular “pixels-to-polygons” workflow for transforming high-resolution orthophotos (20 cm DOP20) into GIS-ready vector data, comprising four stages: data preparation, segmentation, vectorization, and validation. The framework is model-agnostic and designed to integrate different vision foundation models without modification to downstream processing. As a representative instantiation, we evaluate the Segment Anything Model 2 (SAM2) in automatic mask generation mode over the University of the Bundeswehr Munich campus. The resulting segmentation masks are polygonized and compared against OSM reference layers. Preliminary results indicate that vision foundation models can effectively delineate major built-up areas, vegetation, and transport infrastructure without task-specific training, producing geometrically smooth and topologically valid polygons. However, smaller or shadowed objects remain challenging, often leading to partial merging or fragmentation. Future work will include quantitative evaluation using raster-based Intersection-over-Union and vector-domain metrics such as completeness, correctness, and geometric fidelity. The study provides an initial assessment of vision foundation models for high-resolution remote sensing and highlights their potential to bridge the gap between general-purpose segmentation models and GIS-compatible vector data generation. 2:15pm - 2:30pm
Polarization-Aware Segmentation for Camouflaged Threat Detection from UAVs Department of Earth and Space Science and Engineering, York University Surface-laid unexploded ordnance (UXO) and landmines constitute a critical humanitarian crisis. While unmanned aerial vehicles (UAVs) provide a scalable remote sensing solution, detecting modern, non-metallic explosive devices in cluttered environments remains a profound Camouflaged Object Detection (COD) challenge. Traditional optical sensors frequently suffer from foreground-background confusion when a target's texture mimics its surroundings. To overcome these physical bottlenecks, we introduce XPol-Net, a novel multimodal architecture synergizing the semantic reasoning of Vision Transformers with the deterministic physics of polarimetric imaging. Built on a hierarchical PVTv2 backbone, XPol-Net utilizes a progressive Dual Cross-Attention Strategy for effective modality fusion. In early stages, Channel Cross-Attention (CCA) filters material-specific Degree of Linear Polarization (DoLP) cues to suppress background clutter. In deeper stages, Spatial Cross-Attention (SCA) dynamically aligns high-level RGB semantics with strict structural boundaries. To enhance robustness and prevent modality collapse, we deploy a multi-task auxiliary learning framework that reconstructs the continuous Angle of Linear Polarization (AoLP) map. On the PCOD benchmark, XPol-Net achieves state-of-the-art results in global structural alignment (E_phi of 0.980 and 0.984 at 352 x 352 and 704 x 704, respectively). While minor trade-offs are observed in localized metrics such as S_alpha or F_beta, XPol-Net remains highly competitive, consistently delivering superior results in E_phi and MAE. By prioritizing structural recall over localized strictness, XPol-Net ensures the complete discovery of concealed targets, establishing a reliable, physics-aware foundation for humanitarian demining operations. |

