JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at isprs2026@icsevents.com.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Agenda Overview

Location: 713A
125 theatre

Date: Saturday, 04-July-2026

8:30am - 5:00pm

TuT1: 3D Reconstruction from Multi-View Satellite Imagery: From Classic to Modern Methods
Location: 713A

Date: Sunday, 05-July-2026

8:30am - 12:00pm

TuT10: Open-source Scientific Software py4dgeo for Change Analysis in 3D/4D Point Clouds
Location: 713A

12:00pm - 1:15pm

ICWG I/IV: Robotics for Mapping and Machine Intelligence
Location: 713A

12:00pm - 12:15pm

A category-specific prompt strategy for semantic 3D indoor mapping using RGB-D camera

Jiwei Hou¹, Vivien Volland², Samer Karam¹, Dorota Iwaszczuk¹

¹Remote Sensing and Image Analysis, Department of Civil and Environmental Engineering, Technical University of Darmstadt, Germany; ²Geodetic Measurement Systems and Sensor Technology, Department of Civil and Environmental Engineering, Technical University of Darmstadt, Germany

Semantic 3D indoor mapping often depends on supervised learning and large annotated datasets, limiting scalability across diverse environments. This work introduces a category-specific prompt strategy for semantic 3D mapping using RGB-D cameras, integrating RGB-D SLAM with the Segment Anything Model 2 (SAM2) to enable annotation-efficient reconstruction. Keyframes and trajectories extracted from SLAM provide spatial references, while SAM2 performs zero-shot segmentation guided by a Category-Specific Prompt Strategy (CPSS), which segments structural and functional elements (e.g., floors, doors, staircases) by category to reduce prompt interference and manual effort. The segmented keyframes are then fused with depth and pose data to produce instance-level semantic point clouds. Experiments on custom RGB-D sequences and selected ScanNet scenes demonstrate centimeter-level geometric accuracy and strong semantic consistency, with mIoU values up to 0.89 on the custom dataset and 0.98 on ScanNet. The resulting semantic point clouds are clean, structured, and require minimal post-processing, showing that the proposed strategy provides an efficient and scalable solution for semantic 3D indoor mapping without retraining or environment-specific supervision.

12:15pm - 12:30pm

3L-Planner: Lightweight LiDAR mapping and real-time local planning for ground robot autonomous navigation

Wenlei Fan, Zhenqi Zheng, Mingyang Zhou, You Li

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, China

Mobile robots are widely used in unmanned surveying, warehouse logistics, and emergency response. However, achieving safe, reliable, and efficient autonomous navigation in unknown environments remains challenging, where accurate environment representation and feasible trajectory planning are crucial. This paper presents an autonomous navigation method integrating lightweight LiDAR mapping with real-time local planning for ground robots. At the perception level, an incremental single-frame point cloud update is used to accumulate and project locally traversable space, producing a lightweight obstacle map that preserves geometric accuracy while reducing planning computation. At the planning level, A* is employed to generate reference control points, and uniform B-spline curves are used to optimize the trajectory while enforcing kinematic feasibility and smoothness. At the control level, nonlinear model predictive control (NMPC) ensures accurate trajectory tracking by producing control commands that satisfy velocity and acceleration constraints. The framework also supports low-cost evaluation in simulation. Experiments in simulated forests, simulated indoor corridors, and real-world gardens and hallways show average navigation speeds of 2.24 m/s, 0.76 m/s, 0.43 m/s, and 0.38 m/s, respectively. Results demonstrate that the proposed method generates smooth, feasible, and safe trajectories and completes autonomous navigation and mapping tasks across diverse environments.

12:30pm - 12:45pm

CMCL-PR: Cross-Modal Camera-to-LiDAR Place Recognition with Cross-Attention Contrastive Learning

Qipeng Li, Yuan Zhuang

Wuhan University, China, People's Republic of

Place recognition is a crucial task for both robots and autonomous vehicles, facilitating positioning and loop closure within pre-built global maps. Although single-modal sensor-based methods have shown satisfactory performance, cross-modal place recognition—retrieving low-cost camera images from global point cloud databases—remains a significant challenge. In this paper, we propose a contrastive learning-based lightweight cross-modal place recognition framework (CMCL-PR) to retrieve a single image from a global offline point cloud map. We introduce a perspective projection based field-of-view(FoV) transformation module that converts point clouds into a modality analogous to images; Then, we design a dual branch intra-modal encoder structure based on shared Transformer, which extracts and aligns image and point cloud features separately, effectively unifying the feature distribution between modals; Besides, a cross-attention mechanism module guided by inter-modal consistency was constructed, which utilizes the contribution of scene context information within different modalities to generate the discriminating cross-modal descriptors. Finally, during the contrastive learning, cross-modal feature was enhanced, and a multi loss function was constructed, including cross-modal contrastive learning loss, intra-modal consistency loss, and matching supervision loss. We assess the effectiveness and generalizability of our method using three publicly available datasets: KITTI, KITTI-360, and Oxford RobotCar. The project page and code will be released at https://github.com/qp-li/CMCL-PR.

1:30pm - 2:45pm

WG II/4A: AI/ML for Geospatial Data
Location: 713A

1:30pm - 1:45pm

Target Vessel Identification in Aerial Search Imagery via MLLM-Based Attribute Extraction and Geolocation Fusion

Jeonghyo Oh, Youngon Oh, Impyeong Lee

Dept. of Geoinformatics, University of Seoul, Seoul, Republic of Korea

Identifying a distressed vessel among many ships detected in wide-area aerial imagery is a critical challenge in maritime Search and Rescue (SAR) operations. Conventional methods cannot determine which vessel matches the incident description, especially when Automatic Identification System (AIS) reports are uncertain. This study proposes an integrated framework that combines MLLM-based semantic attribute extraction with geolocation fusion to prioritize candidate vessels according to their consistency with Situation Report (SITREP) based scenarios. The method detects vessels using YOLOv8, tracks them with Deep Simple Online and Real-time Tracking (DeepSORT), and performs image-based georeferencing using onboard metadata. A Multi-modal Large Language Model (MLLM) extracts appearance/status attributes from representative vessel images, while scenario descriptions are also converted to attributes. Both sets are encoded using MiniLM embeddings. Finally, semantic similarity is fused with geolocation proximity within an Support Vector Machine (SVM) classifier to produce a probability-ranked list of candidates. Experiments using real aerial search footage demonstrate robust identification performance across a range of scenario quality levels. The correct vessel appears within the top three candidates in more than 73\% of cases and within the top five in more than 91\%, even when attribute extraction is affected by low resolution, illumination effects, or missing scenario information. These results show that coarse semantic cues, when combined with approximate geolocation, provide a resilient basis for identifying target vessels under high uncertainty. The proposed framework offers a practical foundation for automated SAR decision support, enabling faster and more reliable prioritization during wide-area maritime search operations.

1:45pm - 2:00pm

Zero-shot Vision-Language Reranking for Cross-View Geolocalization

Yunus Talha Erzurumlu¹, John E. Anderson², William J. Shuart², Charles Toth³, Alper Yilmaz³

¹Dept. of Electrical and Computer Engineering, The Ohio State University, United States of America; ²US Army Engineer Research and Development Center, Geospatial Research Laboratory, United States of America; ³Dept. of Civil Engineering, The Ohio State University, United States of America

Cross-view geolocalization (CVGL) systems, while effective at retrieving a list of relevant candidates (high Recall@k), often fail to identify the single best match (low Top-1 accuracy). This work investigates the use of zero-shot Vision-Language Models (VLMs) as rerankers to address this gap. We propose a two-stage framework: state-of-the-art (SOTA) retrieval followed by VLM reranking. We systematically compare two strategies: (1) Pointwise (scoring candidates individually) and (2) Pairwise (comparing candidates relatively). Experiments on the VIGOR dataset show a clear divergence: all pointwise methods cause a catastrophic drop in performance. In contrast, a pairwise comparison strategy using LLaVA improves Top-1 accuracy over the strong retrieval baseline. Our analysis concludes that VLMs are poorly calibrated for absolute relevance scoring but are effective at fine-grained relative visual judgment, making pairwise reranking a promising direction for enhancing CVGL precision.

2:00pm - 2:15pm

From Pixels to Polygons: Evaluating Vision Foundation Models for High-Resolution Orthophoto Segmentation and Vectorization

Islam Mansour, Michael Schmitt

University of the Bundeswehr Munich, Germany

Accurate and topologically consistent vector data extraction from aerial imagery is essential for geospatial applications such as urban mapping and the enrichment of volunteered geographic information (VGI) platforms like OpenStreetMap (OSM). While deep learning has advanced automated feature extraction, traditional supervised networks remain constrained by large annotation requirements and limited generalization. Recent vision foundation models offer a promising alternative through promptable, zero-shot segmentation capabilities.

This study presents a modular “pixels-to-polygons” workflow for transforming high-resolution orthophotos (20 cm DOP20) into GIS-ready vector data, comprising four stages: data preparation, segmentation, vectorization, and validation. The framework is model-agnostic and designed to integrate different vision foundation models without modification to downstream processing. As a representative instantiation, we evaluate the Segment Anything Model 2 (SAM2) in automatic mask generation mode over the University of the Bundeswehr Munich campus.

The resulting segmentation masks are polygonized and compared against OSM reference layers. Preliminary results indicate that vision foundation models can effectively delineate major built-up areas, vegetation, and transport infrastructure without task-specific training, producing geometrically smooth and topologically valid polygons. However, smaller or shadowed objects remain challenging, often leading to partial merging or fragmentation.

Future work will include quantitative evaluation using raster-based Intersection-over-Union and vector-domain metrics such as completeness, correctness, and geometric fidelity. The study provides an initial assessment of vision foundation models for high-resolution remote sensing and highlights their potential to bridge the gap between general-purpose segmentation models and GIS-compatible vector data generation.

2:15pm - 2:30pm

Polarization-Aware Segmentation for Camouflaged Threat Detection from UAVs

Youssef Korny, Sunghwan Yoo, Gunho Sohn

Department of Earth and Space Science and Engineering, York University

Surface-laid unexploded ordnance (UXO) and landmines constitute a critical humanitarian crisis. While unmanned aerial vehicles (UAVs) provide a scalable remote sensing solution, detecting modern, non-metallic explosive devices in cluttered environments remains a profound Camouflaged Object Detection (COD) challenge. Traditional optical sensors frequently suffer from foreground-background confusion when a target's texture mimics its surroundings. To overcome these physical bottlenecks, we introduce XPol-Net, a novel multimodal architecture synergizing the semantic reasoning of Vision Transformers with the deterministic physics of polarimetric imaging. Built on a hierarchical PVTv2 backbone, XPol-Net utilizes a progressive Dual Cross-Attention Strategy for effective modality fusion. In early stages, Channel Cross-Attention (CCA) filters material-specific Degree of Linear Polarization (DoLP) cues to suppress background clutter. In deeper stages, Spatial Cross-Attention (SCA) dynamically aligns high-level RGB semantics with strict structural boundaries. To enhance robustness and prevent modality collapse, we deploy a multi-task auxiliary learning framework that reconstructs the continuous Angle of Linear Polarization (AoLP) map. On the PCOD benchmark, XPol-Net achieves state-of-the-art results in global structural alignment (E_phi of 0.980 and 0.984 at 352 x 352 and 704 x 704, respectively). While minor trade-offs are observed in localized metrics such as S_alpha or F_beta, XPol-Net remains highly competitive, consistently delivering superior results in E_phi and MAE. By prioritizing structural recall over localized strictness, XPol-Net ensures the complete discovery of concealed targets, establishing a reliable, physics-aware foundation for humanitarian demining operations.

Date: Monday, 06-July-2026

8:30am - 10:00am

WG III/1A: Remote Sensing Data Processing and Understanding
Location: 713A

8:30am - 8:45am

Cube Kernel: A Novel Approach to Enable Local Gradient Flow Across Channels in CNNs

Zhimeng HE, Yuwei Cai, Meiliu Wu, Xinyan Xian, Brian Barrett

University of Glasgow, United Kingdom

Understanding inter-band and cross-channel relationships is essential for human color perception and object recognition. Yet, local gradients in standard convolutions are tied to fixed input–output channel pairs, and thus channels are fused by a dense, fully-coupled weight tensor: each output channel aggregates all input channels in a uniform way at every spatial location. This leads to heavy computation and does not exploit structured sparsity or selective local channel mixing. To overcome this limitation, we introduce Cube Kernel, a novel convolutional operator that introduces structured cross-channel groups into the local gradient. This design strengthens cross-channel feature fusion, improves optimization efficiency, and reduces computational overhead. Extensive building extraction experiments validate its effectiveness: Cube Kernel consistently outperforms standard convolutions and Involution when integrated into UNet, and replacing a single layer in DeepLabV3+, Swin-UNet, or UNet leads to consistent performance gains. Beyond serving as a lightweight plug-in module, Cube Kernel also scales effectively as a fundamental building block. A Cube-enhanced ConvNeXt variant, ConvNeXt-Cube, achieves state-of-the-art performance across all models (0.9095 IoU / 0.9535 F1 on WBD and 0.9133 IoU / 0.9547 F1 on WHU), demonstrating strong stackability and architectural potential. These results highlight a largely overlooked space in CNN design: enhancing cross-channel interaction at the gradient level. Cube Kernel offers a scalable and efficient alternative to deepen networks for channel mixing, laying a foundation for future advancements in convolutional architecture design.

8:45am - 9:00am

Land Surface Dynamics Modeling and Prediction with dual Latent-Space Representations

Keli Shi^1,2, Zheng Zhang¹, Liang Tang³, Wenhe Xu⁴, Xiaojun Shan¹, Ping Tang¹

¹Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China; ²School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, China; ³Yazhou Bay Innovation Institute, Hainan Tropical Ocean University, Sanya, China; ⁴The University of Hong Kong, Hong Kong, China

Modeling land surface dynamics from satellite observations is crucial for revealing change patterns and predicting future states, although effective modeling methods remain limited.

For complex systems such as reaction-diffusion, two approaches have proven particularly effective:

(i) Direct modeling in the high-dimensional observation space with deep networks(e.g., (Wang et al., 2022)). These methods are often autoregressive. Errors accumulate during rolling extrapolation.

(ii) Modeling in a reduced-dimensional latent space(e.g., (Chen et al., 2022)). One reduces dimension and then learns the evolution. Some works estimate the intrinsic dimension (ID) and model in the ID latent space.

This improves long-term stability, but reliance on latent representations may reduce accuracy.

This route is promising if two issues are addressed:

(1) effectively modeling multi-scale spatiotemporal data with long sampling intervals;

(2) combining ID-space modeling with other latent dimensions to balance accuracy and stability.

This paper proposes a Dual Latent-Space Representation-based Land Surface Dynamic Model （DLS-LSDM）. The core contributions are:

(1) a stacked-convolution and multi-scale linear-attention autoencoder to obtain a base latent, together with ID estimation to derive an ID latent;

(2) a long-horizon scheme that combines ID and base latents to achieve both stability and high accuracy ;

(3) comprehensive evaluation on ten-year MODIS NDVI across multiple climate zones, demonstrating superiority.

9:00am - 9:15am

Revealing Feature Contribution Mechanisms for Interpretable CNN-Transformer Remote Sensing Classification

He Chen¹, Xianwei Zheng¹, Wei He¹, Jiansi Yang¹, Linwei Yue², Ting Hu³, Jianya gong¹

¹Wuhan university; ²China University of Geosciences; ³Nanjing University of Information Science and Technology

Deep learning models have become the backbone of remote sensing image intelligent classification, enabling high-precision recognition of land cover, geospatial objects, and scene categories. However, their inherent "black-box" nature—where decision logic is embedded in complex parameter spaces—poses critical barriers to deployment in high-stakes domains such as military reconnaissance, disaster monitoring, and environmental governance. These fields demand transparent reasoning to validate model reliability, yet traditional interpretability methods suffer from two key limitations when applied to remote sensing data: They are primarily designed for natural images, failing to account for remote sensing-specific characteristics. They focus on local feature attribution or saliency mapping but lack quantitative analysis of how core image features (shape, texture, spectrum) contribute to global classification decisions, especially across different network architectures.To address these problem, this study proposes a comprehensive feature contribution analysis framework tailored to remote sensing images, with the core objectives of: (1) Decoupling and extracting shape, texture, and spectrum features from remote sensing images in a physically meaningful manner; (2) Quantifying the contribution of each feature type to classification decisions; (3) Revealing differences in feature processing mechanisms between CNN and Transformer architectures.

9:15am - 9:30am

EfficientViM-CD: An Efficient Remote Sensing Change Detection Network Based on Hidden State-Mixer

haiming zhang¹, hongyang fan²

¹State Key Laboratory of Information Engineering in Surveying , Mapping and Remote Sensing, wuhan university, China, People's Republic of; ²School of Information Science and Engineering, Wuchang Shouyi University

High-resolution optical remote sensing change detection (CD) is of great significance in urban evolution monitoring, disaster assessment, and land management. Traditional deep models often face computational, memory, and inference latency bottlenecks when processing large high-resolution imagery. To address this, we propose EfficientViM-CD: a Hidden-State Mixer based efficient remote sensing change detection network. The approach builds upon the EfficientViM backbone, migrating global interaction operations into a compact hidden state space and leveraging Hidden State Mixer based on state space duality (HSM-SSD) to fuse global context while reducing computational complexity. We employ a Siamese encoding architecture to extract multi-scale features and hidden states from paired temporal images, and utilize a Cross-Hidden Fusion module to integrate hidden semantic interactions between time points. At each scale, local difference features are computed and enhanced in hidden state space, and a multi-scale decoder reconstructs a pixel-level change probability map. We conducted experiments on four public datasets (LEVIR-CD+, WHU-CD, S2Looking, SVCD) and compared against nine state-of-the-art methods. Results demonstrate that EfficientViM-CD achieves competitive accuracy while delivering significant advantages in inference speed and memory efficiency. This method offers a lightweight, efficient, and scalable solution for high-resolution remote sensing change detection, with potential for real-time monitoring and emergency response systems.

9:30am - 9:45am

Local NMS: Enhancing Object Detection in Large-Scale Remote Sensing Images via iterative pipelined Postprocessing

Bettina Felten, Wolfgang Gross, Andreas Michel

Fraunhofer IOSB, Germany

Object detection in large, dense remote sensing imagery is difficult because targets are often small and arbitrarily oriented, and state-of-the-art detectors cannot process very large images directly without a reduction in accuracy. Tiling-based inference workflows mitigate the latter issue by running inference iteratively on overlapping tiles, but introduce pre- and postprocessing overhead for image tiling and Non-Maximum Suppression (NMS). We introduce local NMS, an asynchronous tile-wise postprocessing scheme. Local NMS runs in a separate subprocess in parallel to tile-wise inference and collects intermediate results enqueued by the inference process, immediately applying postprocessing. Intelligent reordering of tiles in a preprocessing step ensures optimal usage of computing resources. We assess our method using three state-of-the art object detection models for horizontal and oriented bounding box detection on two benchmark datasets containing large dense aerial and satellite images, DOTA-v2.0 and Izembek Lagoon Birds, stratifying by image size and average object density. Local NMS consistently reduces end-to-end runtime across models and datasets without significant impact on mAP. A maximum runtime reduction of 60.77% on large dense DOTA-v2.0 scenes could be achieved without modifying model architectures or retraining.

9:45am - 10:00am

ERD: Extended RAW-Diffusion Framework for De-rendering sRGB Images

Jiaqi Shang¹, Yifan Qu¹, Jianbo Qi²

¹Department of Computer Science, University of Toronto, Canada; ²Faculty of Geographical Science, Beijing Normal University, China

Recovering RAW sensor measurements from rendered sRGB images is important for radiometric calibration, low-level vision, and computational photography. However, reversing a camera’s proprietary Image Signal Pipeline (ISP) is highly challenging, especially when the ISP is unknown. Existing inverse-ISP and diffusion-based approaches have several issues: they depend on known ISPs from the sensor, require one model per sensor, or generalize poorly across camera brands.

This work presents ERD (Extended RAW-Diffusion), a unified diffusion-model framework for de-rendering sRGB images into RAW format for any given image, and does not require ISP to be known or camera information from the image. ERD extends the RAW-Diffusion architecture by incorporating camera metadata only during training, allowing the model to learn a shared representation across heterogeneous sensors. To capture global sensor characteristics, ERD introduces a conditioning mechanism, Feature-wise Linear Modulation (FiLM) for global features such as CFA patterns and color gains. To enhance structural consistency, ERD integrates a ControlNet branch that injects edge and gradient priors derived from the sRGB input, stabilizing RAW reconstruction under diverse tone-mapping operations. For practical adaptation to new sensors, ERD supports efficient few-shot tuning via LoRA.

Evaluations on Adobe FiveK (Nikon and Canon) and RAW-NOD (Nikon and Sony) show that ERD outperforms state-of-the-art baselines in PSNR and SSIM, offering improved robustness to unseen camera models. ERD enables a practical, general-purpose inverse ISP process across heterogeneous imaging devices.

1:30pm - 3:00pm

WG II/2A: Point Cloud Generation and Processing
Location: 713A

1:30pm - 1:45pm

LGSSM: Local-to-global state space model for serialized point cloud semantic segmentation

Hao Wu, Li Yan, Huchen Li, Qimeng Li, Longze Zhu, Junjie Yuan, Hong Xie

School of Geodesy and Geomatics, Hubei Luojia Laboratory, Wuhan University

Point clouds have become essential data for describing real-world objects. Accurate and efficient 3D semantic segmentation plays a crucial role in environment understanding and scene reconstruction. However, current segmentation methods still face challenges from unordered data, high computational complexity, limited scene perception, and insufficient generalization. To address these issues, we propose a local-to-global semantic segmentation method based on a state-space model (LGSSM). Specifically, the proposed method uses three-dimensional serialization encoding to serialize point clouds along the x, y, and z directions, effectively addressing the inherent disorder of point clouds and enhancing spatial representation. Then, the local state space model extracts fine-grained local geometric structural information and the global state space model captures the overall scene representation, improving the modeling ability for both short and long distances. Finally, the serialized context aggregation module is utilized to fuse contextual features to promote spatial semantic consistency. Extensive experiments conducted on ScanNet, ScanNet200, and S3DIS demonstrate that our model achieves state-of-the-art segmentation accuracy compared with other existing methods.

1:45pm - 2:00pm

Hierarchical Gaussian Partitioning for Semantic Segmentation of Airborne LiDAR Scenes

Moussa Bendjilali^1,2, Nicola Luminari¹, Pierre Alliez²

¹Alteia, France; ²Inria Sophia-Antipolis, France

In this paper, we present a novel approach to semantic segmentation of airborne LiDAR point clouds that integrates a hierarchical Gaussian Mixture Model (hGMM) within the Superpoint Transformer (SPT) framework. The hGMM constructs a coarse-to-fine representation of the scene by recursively fitting Gaussian components to spatially coherent subsets of the point cloud, resulting in a hierarchical and structured decomposition that serves as a structured token set for the segmentation objective. While Gaussian Mixture Models (GMMs) can virtually fit any distribution, we constrain their use to structured suburban scenes, where their parametric form is naturally suited to represent planar and ellipsoidal geometries, hence allowing parsimonious mixtures. Experimental results on the DALES benchmark demonstrate that our method achieves competitive performance with respect to state-of-the-art approaches, with notable improvements on classes such as ground and buildings. Results on indoor S3DIS confirm the method's intended specificity to outdoor environments. These findings validate hGMM as a principled and effective alternative to heuristic partitioning techniques, integrating stochastic modelling with transformer-based semantic reasoning in large-scale 3D environments.

2:00pm - 2:15pm

MCPF-Net: Multi-stage LiDAR-Image Collaborative Perception Fusion Network for Point Cloud Semantic Segmentation in Urban Scenes

Huchen Li¹, Wubiao Huang¹, Haibing Liu¹, Shihan Chen¹, Bin Liu², Fei Deng^1,3

¹School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China; ²Hinton STAI Institute, East China Normal University, Minhang, Shanghai 200241, China; ³Hubei Luojia Laboratory, Wuhan 430079, China

Point cloud semantic segmentation through multi-modal fusion provides a fundamental basis for surface observation and visual perception tasks. LiDAR provides precise geometric structural information, while optical images offer rich semantic and textural details. However, existing fusion methods still suffer from limited cross-modal perception and insufficient information complementarity. To address these challenges, we propose a multi-stage LiDAR-image collaborative perception fusion network (MCPFNet) for point cloud semantic segmentation in urban scenes. In the mid-stage, the network introduces a geometric-aware fusion (GAFM) and a semantic-aware fusion module (SAFM) to achieve bi-directional injection of structural and semantic features between LiDAR and image modalities. In the later stage, an adaptive feature fusion module (AFFM) is designed to refine semantic representations through gated weighting and bi-directional attention mechanisms. Extensive experiments demonstrated that MCPFNet achieved the best mIoU scores of 74.51%, 72.10%, and 95.15% on the ISPRS Vaihingen, FRACTAL, and N3C datasets, respectively, validating its superior performance in multi-modal semantic segmentation.

2:15pm - 2:30pm

Cross-Sensor Robustness and Spatial Generalization for 3D Railway Point Cloud Semantic Segmentation

Arshia Ghasemlou, Mario Soilán, Jesús Balado, Belén Riveiro

CINTECX, GeoTECH group, Universidade de Vigo

This contribution investigates the cross-sensor and spatial generalization of deep learning methods for 3D semantic segmentation in railway environments. Although current models achieve high accuracy on large benchmark datasets, their robustness under real-world acquisition variability remains insufficiently understood. To address this gap, three state-of-the-art architectures—Point Transformer v3, Swin3D, and MinkUNet—were trained on the SemanticRail3D dataset and evaluated on a newly acquired 120-m railway section captured with three heterogeneous LiDAR systems: a Faro Focus S150+ terrestrial laser scanner, a CHCNAV RS10 handheld device, and a GeoSLAM ZEB Go SLAM-based scanner.

The case-study point clouds were carefully registered, normalized, voxelized, and manually annotated to provide consistent ground truth across sensors. A standardized preprocessing and test-time augmentation pipeline was applied to ensure compatibility with the training domain. Generalization performance was analysed through per-class IoU, cross-model agreement, and sensor-dependent degradation patterns. Results show significant variability across acquisition platforms, with denser, low-noise scans enabling better transferability, while sparser SLAM-based point clouds remain challenging for thin or small components such as overhead wires.

To mitigate cross-sensor variability, an IoU-weighted ensemble strategy was introduced, leveraging complementary model strengths without requiring retraining. This ensemble consistently improved or matched the performance of individual models on the case-study datasets.

Overall, the study demonstrates the importance of evaluating semantic segmentation models under realistic multi-sensor conditions and provides a practical benchmark and methodology for assessing domain-shift effects in railway point clouds.

2:30pm - 2:45pm

Revisiting NeRF for Street Scene Point Cloud Semantic Segmentation in the Era of 3DGS

Yuzhou Zhou

University of Oxford, United Kingdom

Accurate semantic segmentation of urban point clouds is fundamental for autonomous driving and city mapping. Recent advances in neural scene representations, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have significantly improved photorealistic reconstruction quality. However, 3DGS is primarily designed for small-scale, object-centric scenes with dense viewpoints, and its optimization becomes sub-optimal in large-scale street scenes with trajectory-constrained observations, leading to semantic errors and distorted geometry.

In this work, we revisit NeRF-based scene representation in the era of 3DGS to address these challenges. Our method integrates the explicit and efficient modeling strategy of 3DGS with the surface-constrained sampling nature of NeRF. Specifically, we employ Deformable Neural Mesh Primitives (DNMPs) to jointly encode geometry and semantics, enabling efficient ray–mesh intersection sampling and neural field interpolation. This formulation achieves 3D-annotation-free point cloud semantic segmentation by leveraging rendered image supervision.

Experiments on the KITTI-360 dataset demonstrate that our approach surpasses the Street Gaussians baseline in overall mIoU and across most semantic categories. The improvement mainly stems from reducing semantic errors caused by limited viewpoints during 3D Gaussian optimization, providing a robust and scalable solution for street scene semantic understanding.

2:45pm - 3:00pm

Extraction of Pole-like Road Objects from MMS Point Clouds Using Deep Learning and Geometric-Topological Feature Fusion

Shu Su, Masataka Shirai, Hiroyuki Yokota

AERO TOYOTA CORPORATION, Japan

This paper presents a fusion framework for the automatic extraction of pole-like road objects—such as traffic lights, road signs, streetlights, and utility poles—from Mobile Mapping System (MMS) point clouds. The proposed method integrates KPConv-based semantic segmentation with geometric–topological reasoning to achieve structural completion and false-positive suppression without retraining or additional annotated data. The framework was trained on 8 km of manually labeled MMS data from the Kinki region, Japan, and evaluated on large-scale unseen data from Hokkaido (≈ 26 km, 2.53 billion points) and the Paris–Lille-3D benchmark (France) acquired with a different LiDAR sensor. The proposed approach significantly outperformed the KPConv baseline. On the Hokkaido dataset, the F₁-score improved from 0.8263 to 0.8689 (+0.0426), successfully reconstructing lamp tops, signal arms, and previously unseen snow delineator posts (snow poles). On the Paris–Lille-3D benchmark, recall increased by 15.5 points, yielding an overall F₁-score gain of +0.0802. The 26 km Hokkaido dataset was processed in less than 13 hours on a single NVIDIA Quadro RTX 8000. These results demonstrate that the proposed deep learning–geometry–topology fusion achieves robust, scalable, and efficient performance across diverse geographic and sensor domains, supporting nationwide road-asset mapping and digital-twin generation.

3:30pm - 5:15pm

ThS4A: Toward Smart Forests: Emerging Tools in Remote Sensing, Artificial Intelligence, and Field Robotics
Location: 713A

3:30pm - 3:45pm

AI-Enabled Forest Inventory in TerraScan: integrating Georeferencing, Species Identification, and Volume Computation

Anna Shcherbacheva, Helen Rost, Ana Puttonen, Arttu Soininen

Terrasolid LTD, Hatsinanpuisto 8, 02600, Espoo, Finland

The Terrasolid software suite provides an automated and scalable framework for large-scale LiDAR data processing, widely adopted in both national and private forest inventories. Its unified processing pipeline covers all essential steps—from point cloud import and georeferencing to ground classification, object detection, tree segmentation, and computation of individual-tree attributes such as diameter at breast height (DBH), height, volume, and tree species. Georeferencing is initially performed in TerraScan using signal markers or automatically detected tree trunks, with optional refinement in TerraMatch, which corrects angular misalignments between flight lines.

Following object classification, individual trees are extracted from points labeled as trees. The semi-manual Group Inspection tools support efficient correction of segmentation errors, such as merged or over-segmented trees, after which stem-wise metrics are automatically updated. These conventional modules rely on optimized algorithms capable of processing hundreds of millions of points within minutes.

A recent innovation, the Tree Species tool, introduces one of the first AI-based extensions within Terrasolid software. It employs a machine learning approach that integrates 2D raster-based features with 3D point cloud descriptors to achieve accurate tree species identification. Validation was conducted using the FOR-species20K dataset, comprising 33 species collected worldwide. Among several tested classifiers, the Histogram Gradient Boosting Classifier (HGBC) achieved the highest accuracy. To mitigate class imbalance, multiple side-view rasterizations and SVM-SMOTE oversampling were applied, significantly improving the separability of underrepresented species and overall classification robustness.

3:45pm - 4:00pm

Spatiotemporal Foundation Model for Aboveground Biomass Estimation: A case study in Mixedwood Plains Ecozone, Ontario, Canada

Zilong Zhong¹, José Bermúdez¹, Dominic Cyr², Alemu Gonsamo¹

¹McMaster University; ²Environment and Climate Change Canada

Traditional aboveground biomass estimation for forested areas relies on allometric equations (Návar, 2009), which use input variables such as diameter at breast height (DBH), tree height, and tree species or broader taxonomic group. Although allometric equations can estimate the biomass of individual trees, and stand-level equations exist for larger scales, they often require extensive field data, making them less suitable for densely clustered or remote forests. However, satellite images provide increasingly detailed global observations of forested areas, and spaceborne lidar data like GEDI (Duncanson et al., 2022) provide accurate measurements for canopy height across different ecozones worldwide. In recent years, foundation models (FMs) inspired by large language models (Vaswani et al., 2017) have become the new paradigm to leverage large amounts of unlabelled data through self-supervised pre-training and have shown capacity to benefit multiple downstream tasks. In this work, we adopt the Granite foundation model (Muszynski et al., 2024) as a baseline to improve aboveground biomass estimation on different satellite data, using the Mixedwood Plains Ecozone (MPE) as a case study. We also explore adding temporal, geospatial, and spatiotemporal features and validate the proposed spatiotemporal foundation model with field sampling plots.

4:00pm - 4:15pm

Improving Tree Species Detection for Operational Forestry: The Role of Dataset Design

Mirela Beloiu, Khatereh Meshkini, Zhongyu Xia, Severin Broch, Verena C. Griess

Department of Environmental Systems Science, Institute of Terrestrial Ecosystems, ETH Zurich, 8092 Zurich, Switzerland

Accurate detection and mapping of individual trees and their species are vital for sustainable forest management. Traditional field-based inventories remain the golden standard in forest monitoring, but are increasingly overwhelmed by temporal, spatial and accessibility constraints. Remote sensing offers faster, repeatable, and high-resolution data that complement and scale beyond field inventories. However, species-level detection remains difficult due to overlapping crowns, and spatial mismatches between crowns and trunks. Deep learning (DL) methods, particularly convolutional neural networks, have advanced crown delineation by automatically learning spatial and spectral patterns from imagery. Yet, their success depends heavily on dataset quality, class balance, and diversity. To address this, we applied a DL object detection framework for tree crown and species detection in Swiss forests and evaluate how dataset composition and training strategies influence accuracy and generalization. We test three dataset configurations: (1) an unbalanced masked dataset, (2) a class-balanced masked dataset, and (3) a mixed dataset combining masked and unmasked imagery. Results show that class balancing improved accuracy for both dominant and minority species, while mixed data enhances generalization.

4:15pm - 4:30pm

Self-Supervised Leaf-Off Segmentation of Tree Functional Types and Buildings from Airborne NIRGB and LiDAR Data in Southern Ontario

Jose David Bermudez Castro¹, Zilong Zhong¹, Dominic Cyr², Alemu Gonsamo¹

¹McMaster University, School of Earth Environment Society, Canada; ²Environment and Climate Change Canada

High-resolution airborne sensing enables joint mapping of urban infrastructure and forest composition at ecological scales. This study presents a self-supervised segmentation framework that fuses 0.5 m Near-Infrared + RGB (NIRGB) orthophotography from the Ontario Imagery Program (2013–2026) with Canopy-height models (CHM) derived from the Ontario Elevation Mapping Program (8–10 pulses m⁻², 5–10 cm vertical accuracy). Imagery was collected during the leaf-off season, providing strong spectral–structural contrast between evergreen and deciduous crowns, to produce high-fidelity land- cover segmentations that differentiate vegetation functional types and built structures as a prerequisite for tree-level biomass and carbon-stock estimation.

4:30pm - 4:45pm

Updating Forestry Road networks in Ontario Using Single Photon LiDAR and Deep Learning-enhanced algorithms

Hamid Ghanbari, Emmanuelle Baby-Bouchard, Catherine Chagnon, Alexandre Morin-Bernard, Alexis Achim

Department of Wood and Forest Sciences, Université Laval, Québec, Canada

Spatially accurate forestry road networks are essential for effective forestry operations, sustainable resource management, and conservation. Current forestry road databases in Ontario have significant location errors due to limitations and human errors associated with conventional road delineation approaches such as GPS-based field surveys and photointerpretation. A previously developed algorithm, which used airborne laser scanning (ALS) data, successfully corrected road locations in Quebec. However, its design limited its application in other landscapes, ALS instruments, and road construction and maintenance practices. This study advances that algorithm by integrating a deep learning component to improve its robustness and scalability for diverse forest conditions. A hybrid workflow combines the original friction-based conductivity surface with a road probability surface generated by an Attention Residual U-Net model trained on 11 LiDAR-derived features using road segments from five forest sites in Quebec. The enhanced workflow was applied to two forest management units in Ontario: Nipissing and Dryden. The results showed significant improvement in road alignment when compared to the existing provincial data and the outputs from the earlier automated approach. The deep learning-enhanced algorithm lowered mean positional error by 78% (from 9.36 m to 2.07 m) and increased the proportion of road centerline points within 3 m of the reference from 66.7% to 87.2%. These improved centerline accuracies will further support a scalable tool for rapid and accurate forestry road network mapping, which in turn will aid sustainable forest management and conservation planning at both provincial and national scales.

4:45pm - 5:00pm

Attention-guided Multi-Scale Deep Learning Approach for Tree Health Detection Using Very High-Resolution Aerial Imagery

Khatereh Meshkini, Mirela Beloiu, Zhongyu Xia, Verena C. Griess

Department of Environmental Systems Science, Institute of Terrestrial Ecosystems, ETH Zurich, Zurich, Switzerland

Monitoring tree health is essential for detecting early signs of stress, defoliation, and potential mortality, supporting effective forest management, ecosystem conservation, and early warning systems. Advances in deep learning have enabled automated analysis of trees in remote sensing imagery through object detection methods that leverage both spectral and spatial information. However, assessing tree defoliation remains challenging, as subtle differences between defoliation levels make accurate classification difficult. To address this, we propose the hybrid ResNet-Swin Transformer, an object detection architecture built on a Faster R-CNN framework, incorporating a fused ResNet and Swin Transformer backbone with attention-based feature fusion. This design captures rich, multiscale representations by combining convolutional and transformer-based features and progressively refines them through channel-wise attention blocks for robust detection and classification. The architecture was evaluated on a very high-resolution aerial dataset from Switzerland, partially annotated with five classes: Conifer (healthy), Conifer (defoliated), Broadleaf (healthy), Broadleaf (defoliated) and Dead. Comparative experiments with state-of-the-art object detection and classification methods demonstrate that the proposed approach achieves higher accuracy and robustness, highlighting its potential for precise and reliable automated tree health monitoring.

5:00pm - 5:15pm

Fine-grained vegetation segmentation in complex urban park environments using a deeply supervised parallel SegFormer

Haixin Zhang, Qinying Zhang

Department of Landscape Architecture, Tianjin University, 300072 Tianjin, China,

Accurate vegetation mapping in complex urban environments is essential for ecological monitoring, biodiversity assessment, and sustainable park management. However, fine-grained vegetation segmentation remains challenging because of the high diversity of plant species, overlapping canopies, and the interference of artificial objects. To address these challenges, a deeply supervised parallel architecture based on the SegFormer backbone was proposed in this paper. The model incorporated a SegFormer-ASPP-low-level (SAL) head, which fused high-level semantic representations, multi-scale contextual information, and low-level spatial details through a parallel decoding mechanism. Two auxiliary heads, a pyramid pooling module (PSP) and a fully convolutional network (FCN), were added to provide deep supervision and improve the recognition of blurred boundaries and rare categories. High-resolution UAV imagery was used to perform fine-grained semantic segmentation of 17 vegetation categories. The dataset included multiple tree species as well as non-tree classes such as Nelumbo sp. (lotus) and dead trees. Experimental results showed that our model achieved a mean intersection over union (mIoU) of 73.57%, outperforming architectures such as SegFormer-b1, DeepLab v3+, ConvNeXt and SCTNet. Visual analysis further demonstrated the model's robustness in complex urban park scenes, showing superior boundary delineation, improved recognition of small and spectrally similar species, and resilience to interference from artificial objects like plastic lawns and landscape lighting. The proposed approach offers valuable insights for precision forestry, ecological monitoring, and intelligent UAV-based remote sensing applications.

Date: Tuesday, 07-July-2026

8:30am - 10:00am

WG II/2B: Point Cloud Generation and Processing
Location: 713A

8:30am - 8:45am

Multi-Source Fusion of Roof Skeletons, LiDAR and Street-View Imagery for Semi-Automated LoD-2 Building Modelling

Vaibhav Rajan¹, Sander Münster¹, Jonas Bruschke¹, Ferdinand Maiwald²

¹Digital Humanities, Friedrich-Schiller-Universität Jena, Germany; ²Chair of Optical 3D-Metrology, TUD Dresden University of Technology, Germany

LoD-2 building models are more informative and practically more useful than LoD-1 representations because they capture the roof structure that defines the essential three-dimensional form of a building. They are important for applications such as urban planning, environmental simulation, and digital heritage. Although recent roof shape extraction methods can derive vectorised 2D roof structures from very-high-resolution imagery, transforming these image-based representations into fully textured 3D buildings remains challenging. In this paper, we present a semi-automated LoD-2 reconstruction pipeline that integrates HEAT-derived roof geometry with airborne LiDAR, satellite and Google Street View imagery. The 2D outputs are reprojected into map coordinates, fused with LiDAR through a two-stage roof reconstruction strategy to derive roof shapes and combined with an adaptive, LiDAR-based ground base initialisation to create a complete 3D wireframe. Roofs are textured using VHR orthophotos while the walls are textured via a process of Street View panorama selection, geometric filtering, Mask2Former segmentation, and homography rectification. Across a large-scale evaluation on 1000 buildings, the proposed two-stage reconstruction strategy improves geometric agreement with the LiDAR reference data achieving a roof-surface RMSE of 0.445~m. The wall texturing process produces convincing facades when suitable panoramas are available. While minor challenges such as sensitivities to LiDAR outliers, incomplete roof geometry, and facade occlusions persist, this pipeline effectively bridges 2D roof parsing and textured LoD-2 model generation, providing a robust and scalable foundation for advancing toward fully automated workflows.

8:45am - 9:00am

BIM-to-Labelled Point Cloud : Automated Point Cloud Annotation from BIM Models using Bounding Boxes and Solid Geometry

Saad Boudarbala^1,2, Tania Landes², Hélène Macher², Thibault Bavoux¹

¹Futurmap Lyon, France; ²INSA-Strasbourg, France

This paper presents an automated framework for generating semantically labelled building point clouds from their corresponding BIM models. The proposed methodology aims to facilitate the creation of training datasets for deep learning–based indoor semantic segmentation. Two complementary labelling strategies are introduced. The first relies on bounding boxes (BBX) extracted from BIMelements to efficiently assign labels to points based on volumetric inclusion. The second approach uses solid geometry and a nearest-neighbour principle (SG-NN) to compute distances between BIM object meshes and the point cloud, enabling a more precise spatial correspondence. In addition, a room-based geometric grouping strategy is proposed to structure the annotated point clouds into spatial units compatible with common indoor segmentation datasets. The methods are evaluated through a qualitative analysis on several real building datasets of different typologies and acquisition conditions, as well as through a quantitative evaluation based on a manually segmented reference point cloud. Results show that the SG-NN approach achieves higher performance, with an average Recall of 92% and IoU of 88%, compared to 87% of Recall and %78 of IoU for the BBX approach. While the BBX approach provides faster processing, the SG-NN strategy achieves higher labelling accuracy, particularly for geometrically complex elements. The proposed workflow enables scalable dataset generation from Scan-to-BIM projects while significantly reducing manual annotation effort.

9:00am - 9:15am

Enhanced SegNet-based Building Extraction Framework via Image Segmentation and Point Cloud Fusion

Chi Tien Nguyen, Dinh Minh Bui, Somin Han, Changjae Kim

Department of Civil Engineering and Environment, College of Engineering, Myongji University

This paper presents an enhanced building extraction framework that combines deep learning-based image segmentation with photogrammetric point cloud refinement for urban roof detection. The method first applies a modified SegNet model to orthophotos from the ISPRS Vaihingen dataset to generate initial building masks. These results are then refined using geometric information from point clouds through ground filtering, clustering, and normal-guided region growing. By integrating spectral information from imagery with structural cues from 3D data, the proposed framework improves roof boundary delineation and reduces spurious detections. Experimental results on Areas 35 and 37 show that the method achieves strong overall performance, with a precision of 0.96, recall of 0.81, IoU of 0.78, and F1-score of 0.88. The findings indicate that point cloud refinement helps produce cleaner and more reliable building objects than image-based segmentation alone, especially in complex urban scenes. However, the approach remains sensitive to the density and quality of the point cloud. Overall, the study demonstrates that fusing orthophoto segmentation with point cloud processing is an effective strategy for more accurate and geometrically consistent building extraction.

9:15am - 9:30am

Application Of Multi-Source Photogrammetric Data For Fast Building Inventory

Anna Fryskowska-Skibniewska, Patryk Wróblewski, Klaudia Pasternak, Julia Gotowiec

Military University of Technology, Poland

The rapid expansion of urban areas and the continuous demand for their monitoring make remote sensing data a highly valuable tool for collecting large volumes of geospatial information in a relatively short time and with high repeatability. The main objective of this paper is to examine the potential offered by different types of geospatial data, as well as the relationships based on their scope, in comparison with measured reference data.

Architectural inventory tasks are useful not only for engineering projects but also for broader applications, such as environmental impact assessments, spatial planning, and related fields. This article introduces a rapid and cost-effective mixed-mode data collection framework for building inventory development, integrating terrestrial laser scanning, UAV imagery, and traditional ground measurements.

The paper will discuss the latest measurement technologies and their practical applications in building surveying, illustrated with a selected case study. The criteria for selecting appropriate measurement methods will also be analyzed, depending on the investor’s requirements and the intended use of the documentation.

This paper presents a set of techniques for updating the geometric information of buildings using laser scanning and imagery. It begins with an introduction to the fundamental concepts, terminology, and principles of 3D information. Subsequently, various measurement techniques are described, along with a discussion of potential sources of error and data incompleteness. The extracted geometric values are validated against independent survey data.

9:30am - 9:45am

Conjugate Feature-Guided Dense Stereo Matching for High-Precision Attribute-Enriched Urban Point Clouds

Yung-Ching Yang, Jen-Jer Jaw

National Taiwan University, Taiwan

Accurate 3D reconstruction of urban scenes from multi-view images is essential for city planning, digital twins, and autonomous navigation. Traditional dense image matching relies on low-level cues such as intensity or gradients, which often produce noisy or incomplete point clouds in complex urban environments. This study introduces an attribute-enriched dense matching framework that embeds both geometric features and semantic attributes from multi-view images to guide dense image matching.

The framework first extracts semantic labels and geometric feature correspondences to generate intermediate products: conjugate features, feature seeds, an attribute map, and an initialized disparity map. These elements provide reliable priors that constrain dense matching, reduce search ranges, and prevent mismatches across structural boundaries. Dense image matching then propagates these constraints, producing an attribute-enriched disparity map and point cloud in which each 3D point carries both geometric and semantic information.

Evaluated on urban datasets, the proposed approach improves corner and edge localization, enhances edge continuity, reduces outliers in low-texture areas, and preserves semantic and structural attributes throughout 3D scene reconstruction. By integrating feature-based initialization with attribute-enriched dense image matching, the method delivers more accurate, interpretable, and robust 3D urban reconstructions, supporting downstream tasks such as precise measurement, object recognition, and scene analysis.

9:45am - 10:00am

Efficient Extraction and Specification-Compliant Optimization of Railway Alignment Parameters from UAV LiDAR Point Clouds

Zhaochen Han, Xuming Ge, Min Chen, Han Hu

Faculty of Geosciences and Engineering, Southwest Jiaotong University

The rapid acquisition of high-precision parametric railway alignment is a fundamental prerequisite for intelligent railway construction and maintenance. Traditional measurement techniques and alignment fitting methods heavily rely on manual operations, often resulting in inefficiency, high costs, and insufficient accuracy control. To address these challenges, this study proposes an automated method for extracting and optimizing railway alignment from UAV-based LiDAR point clouds. Initially, track centerlines are extracted by leveraging the geometric smoothness of the railway and the structural characteristics of the track. A multi-constraint energy model integrating distance, orientation, and curvature is constructed to fit the geometric parameters of alignment elements, thereby providing high-quality initial values for subsequent alignment engineering parameter optimization. Finally, a global optimization strategy based on the simulated annealing algorithm is applied to jointly refine the engineering parameters of the standardized alignment composition, ensuring strict compliance with railway design specification. Experimental results demonstrate that the proposed method can efficiently and robustly extract high-precision alignment parameters with well-defined engineering semantics from complex railway point clouds, thereby providing reliable technical support for intelligent construction and full lifecycle management of railway systems.

1:30pm - 3:00pm

WG III/1B: Remote Sensing Data Processing and Understanding
Location: 713A

1:30pm - 1:45pm

Multi-modal semantic segmentation for open vocabulary interactions with remote sensing images

Jinkun Dai, Tao Peng, Yuhang Xue, Xianping Ma, Yuanxin Ye

Southwest Jiaotong University, Chengdu 611756, China

Semantic segmentation of multi-modal remote sensing imagery plays a pivotal role in land use/land cover (LULC) mapping, environmental monitoring, and precision earth observation. Current multi-modal approaches mainly focus on integrating complementary visual modalities (e.g., optical and synthetic aperture radar (SAR) imagery), yet neglect the incorporating of non-visual textual data a rich source of knowledge that can bridge semantic gaps between visual patterns and real-world concepts. To address this limitation, we propose TSMNet, a text supervised multi-modal open vocabulary semantic segmentation network that synergistically integrates textual supervision with visual representation for open-vocabulary semantic segmentation. Unlike conventional multi-modal segmentation frameworks, TSMNet introduces a dual-branch text encoder to extract both scene-level semantic and object-level label information from various textual data, enabling dynamic cross-modal fusion. These text-derived features dynamically interact with visual embeddings through the proposed text-guided visual semantic fusion module, enabling domain-aware feature refinement and human-interpretable decision-making. Moreover, integrating text opens pathways for open-vocabulary semantic segmentation, enabling systems to recognize and classify unseen categories through natural language descriptions, thereby overcoming the rigid constraints of predefined class taxonomies. To verify our method, we innovatively construct two new multi-modal datasets, and do a lot of extensive experiments are carried out to make a comprehensive comparison between the proposed method and other state-of-the-art (SOTA) semantic segmentation models. Results demonstrate that TSMNet achieves superior segmentation accuracy while exhibiting robust generalization capabilities across diverse geographical and sensor-specific scenarios. This work establishes a new paradigm for explainable remote sensing analysis, demonstrating that textual knowledge integration significantly enhances model generalizability.

1:45pm - 2:00pm

Meta-Prompting with Open-Source Language Models for Zero-Shot Scene Classification in Remote Sensing

Antonis Promponas¹, Eirini Baltzi¹, Valsamis Ntouskos², Konstantinos Karantzalos¹

¹Remote Sensing Lab, National Technical University of Athens, Greece; ²Department of Engineering and Sciences, Universitas Mercatorum, Rome, Italy

Zero-shot visual recognition with vision-language models (VLMs) has shown strong generalization to unseen categories in natural-image benchmarks, yet its effectiveness in remote-sensing (RS) imagery remains less explored. In this paper, we investigate whether meta-prompting with large language models (LLMs) can improve zero-shot scene classification in RS by automatically generating semantically rich class descriptions. Building on the Meta-Prompting for Visual Recognition (MPVR) framework, we evaluate three open-source LLMs, Mixtral-8x7B, Qwen 2.5 7B, and LLaMA 3.1 8B, as prompt generators across five RS benchmark datasets. The resulting descriptions are encoded with several VLMs, including CLIP, MetaCLIP, RemoteCLIP, and CLIP-LAION-RS, and compared against generic single-template and handcrafted domain-specific prompting baselines. Our results show that LLM-generated prompts are competitive with, and in several cases improve upon, manually designed templates, while revealing that the gains depend on both the dataset and the visual backbone. Overall, the study highlights the potential of open-source LLMs as scalable prompt generators for zero-shot remote-sensing recognition and provides insight into the transferability of meta-prompting beyond natural-image domains.

2:00pm - 2:15pm

Knowledge graph enhanced for zero-shot semantic segmentation in remote sensing imagery

Wubiao Huang¹, Huchen Li¹, Shuai Zhang¹, Haibing Liu¹, Zizhen Chen¹, Shihan Chen¹, Fei Deng^1,2

¹School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China; ²Hubei Luojia Laboratory, Wuhan 430079, China

Zero-shot semantic segmentation (ZSSS) is a crucial task in remote sensing image understanding, yet existing methods still suffer from limited generalization to unseen classes. To address this issue, we propose a Knowledge Graph (KG) enhanced ZSSS framework, which introduces explicit hierarchical and relational information into class embeddings to achieve more structured and semantically consistent representations. Specifically, a KG class encoder is designed, consisting of the class enhanced query (CEQ) and class enhanced embedding (CEE) modules, which extract class-relevant subgraphs from a self-constructing Remote Sensing Semantic Class Knowledge Graph (RSSCKG) and generate knowledge-enriched embeddings through a text encoder. Experiments on three public remote sensing datasets demonstrate that the proposed method consistently improves performance across seven state-of-the-art ZSSS frameworks. The integration of KG-based embeddings yields significant gains in the evaluation metrics, with particularly strong improvements on unseen classes, while maintaining accuracy on seen classes. Compared with enhancement strategies based on large language model (LLM) generated descriptions, the proposed KG class encoder exhibit superior semantic separability and stability. These results validate the effectiveness, generalization, and scalability of the proposed framework for ZSSS in remote sensing imagery.

2:15pm - 2:30pm

Segmentation-driven statistics-aware workflow for detailed scene description of UAV images using Mistral and LORA fused model

Bhargav Parulekar, Anandakumar M Ramiya

Indian Institute of Space Science and Technology, Thiruvananthapuram, Kerala, India

In the era of explainable AI, rapid data processing, analysis, and generation have become essential. Over the past few years, many approaches have been developed to process such heavy data and present it in an explainable manner, including in the field of remote sensing. One of such applications is remote sensing scene description. Many established workflows and models exist, but these models either fail to incorporate essential geospatial information or suffer from hallucination. We present a hybrid multimodal captioning methodology that tightly couples semantic segmentation outputs (via a LoRA-adapted Segment Anything Model) with a small, high-quality LLM- Mistral to produce descriptive, interpretable, and data-grounded scene captions. Rather than relying on direct image-to-text pipelines, our approach first extracts structured scene statistics (class proportions), spatial context (quadrant dominance and object localization), and color fingerprints (dominant colors per semantic class). These structured signals are converted into compact, factual prompts that the LLM consumes to generate coherent, informative, and verifiable captions. A comparison with the established Florence-2 model in terms of quantitative description demonstrates a significant improvement, with the Precision Vocabulary Index increasing from 0.077 to 0.232 due to the proposed workflow.

2:30pm - 2:45pm

Evaluating the Adaptation Potential of SAM2 for Glacier Segmentation in Severe Weather

Bindusara Nagathihalli Lokesh, Laura Camila Duran Vergara, Hans-Gerd Maas, Anette Eltner

Dresden University of Technology, Germany

Ground based time lapse cameras provide continuous, high frequency observations of glacier dynamics; however, automated analysis of these image streams remains challenging due to fog, snowfall, lens contamination, and variable illumination. This study investigates the potential of adapting the foundation segmentation model Segment Anything Model 2 (SAM2) for glacier segmentation from ground-based monitoring. To enable integration into automated pipelines, SAM2 is configured in image mode with a learned prompt generation strategy, while fine-tuning is restricted to the prompt encoder and mask decoder. In addition, the internal Intersection over Union (IoU) prediction head is utilized as a confidence estimator to assess segmentation reliability. Experimental results demonstrate that the adapted model achieves stable segmentation under moderate environmental variability, while degrading under severe visibility loss. This stability is consistent across model scales and input resolutions. The confidence estimation further provides a meaningful signal for identifying uncertain predictions, supporting reliability-aware processing in downstream workflows.

2:45pm - 3:00pm

Reasoning-guided ego-path segmentation for autonomous trains using vision–language models

Mohammadjavad Ghorbanalivakili, Ashley Varghese, Gunho Sohn

York University, Canada

Autonomous train perception must identify the train’s valid path under complex railway geometry, particularly at merging and diverging switches where multiple candidate tracks coexist. Existing approaches are primarily trained as purely visual predictors and typically do not provide justification for route selection, despite the fact that valid paths depend on structured cues such as blade–stock contact, rail gaps, and track continuity. In this work, we adapt the Large Language Instructed Segmentation Assistant (LISA) to railway ego-path perception and formulate the task as reasoning-guided segmentation: given a forward-facing railway image and a natural-language query, the model predicts the valid ego-path mask and, when prompted, generates a textual explanation grounded in visible switch geometry. Our approach integrates railway-specific prompting, a tailored annotation scheme, and efficient finetuning, along with semantic segmentation supervision to support general scene understanding. Experiments on a RailSem19-based evaluation set show improved ego-path segmentation performance over the original LISA checkpoint and increased robustness to prompt variation, while qualitative results indicate that the model can produce plausible, though not always consistent, reasoning. Notably, these capabilities emerge despite the reasoning-specific dataset consisting of only 54 samples, highlighting the data efficiency of the approach. These results highlight the potential of vision-language models for more interpretable railway perception, while also underscoring the need for stronger supervision and evaluation in safety-critical settings. Code and reasoning segmentation data are available at https://github.com/mvakili96/Railway_Perception_FoundationModel.

3:30pm - 5:15pm

ThS12: TLS-based Deformation Analysis
Location: 713A

3:30pm - 3:45pm

Complementing and validating uncertainty of terrestrial laser scanning via interval analysis

Reza Naeimaei, Steffen Schön

Institut für Erdmessung (IfE), Leibniz University Hannover, Hannover, Germany

Terrestrial laser scanning (TLS) enables dense spatial sampling; however, millimeter-level deformation analysis is limited by uncertainty rather than resolution, as inter-epoch differences can arise from actual change or residual systematic effects. Classical methods capture random variability under distributional assumptions but do not guarantee bounds for persistent systematic effects.

This paper presents a complementary interval-based framework that provides reliable, distribution-free bounds for TLS uncertainty and integrates seamlessly with least-squares workflows.

Starting from a measurement and instrumental correction model for high-end panoramic scanners, deviations of effective parameters are propagated to TLS observations and represented as interval radii at the observation level. We then extended the Least-Squares Adjustment, which linearly maps observation-level interval bounds to residuals and parameter estimates, providing conservative first-order enclosures alongside stochastic covariances.

Validation without a trusted nominal is addressed via a residual-based strategy that exploits two-face (Face 1/Face 2) acquisitions. This paper proposes a framework to validate intervals without existing nominal values. It begins with challenges and also guides addressing these challenges to ensure fair validation and test the proposed method on real TLS data. Overall, the proposed framework provides guaranteed bounds for remaining effects, improves discrimination between actual deformation and systematic effects, and offers actionable diagnostics for TLS-based monitoring.

3:45pm - 4:00pm

Point-based, profile-based and 3D point cloud-based vibration monitoring of structures: comparisons based on a lab experiment

Oliver Geißendörfer¹, Victoria Rosa², Hans-Berndt Neuner², Christoph Holst¹

¹Technical University of Munich, Germany; ²Technical University of Vienna, Austria

The safety and longevity of civil infrastructure rely on robust structural health monitoring (SHM), yet conventional methods are

often constrained by the high cost and impracticality of contact-based sensors. On the other hand, existing non-contact technologies

typically specialize in either static geometric mapping or spatially limited dynamic vibration analysis, leading to fragmented data

and complex post-processing. This research presents a unified non-contact methodology that addresses this challenge by simul-

taneously acquiring high-resolution 3D geometry time-series vibrational data using a single Light Detection and Ranging (LiDAR)

device. For this purpose, we compare point-based measurements using a total station, an iPhone along with a profile-based LiDAR

and 3D LiDAR point clouds for an experimental analysis. Sensor observations are recorded and analyzed at the same location

on the experimental surface showing flexibility in input dimensionality as well as robustness in resulting scalograms. The core

of the analysis is our developed method, a directional wavelet transform, a signal processing technique uniquely suited handling

non-stationary signals as multidimensional unstructured data. This method enables the characterization of oscillations across the

unstructured 3D surface, a capability beyond traditional modal analysis with one-dimensional time-frequency localization, but using

LiDAR point cloud time series. The result is a richer and more integrated understanding of structural behavior, capable of revealing

vibration behavior in high spatial detail. The study demonstrates that spatio-temporal LiDAR data contains embedded dynamic

information, offering a more comprehensive and efficient way to assess the health and integrity of a structure in the future.

4:00pm - 4:15pm

From tensor-product to truncated hierarchical B-splines: Enhancing spatial Resolution in space-continuous Deformation Analysis based on 3D point clouds

Elisabeth Ötsch, Hans Neuner

TU Wien, Department of Geodesy and Geoinformation, Austria

The quasi-continuous capturing of our environment by terrestrial laser scanning (TLS) in form of 3D point clouds provides the basis for numerous spatial analyses, including space-continuous deformation analysis. In times of aging infrastructure and climate

change-induced, cumulative mass movements, statistically-sound methods for determining areal deformations are becoming increasingly important. However, the lack of reproducibility of absolute point positions between consecutive scans and the resence of measurement noise demand approaches that retrieve credible comparison statements. The representation of point clouds by geometric surfaces supports noise reduction and serves as basis for successive analysis. Tensor-product B-spline surfaces have proven to be particularly versatile geometric representations to derive spatially consistent deformation estimates. This paper extends this concept by investigating the use of truncated hierarchical B-splines for statistically sound deformation analysis. We show that deformation is detectable when partition of unity is preserved through truncation. In a simulated environment, significant deformations between two point clouds were successfully detected. Results indicate that coarse surface representations lead to type-1 errors and underestimated deformation magnitudes, whereas more refined surface representations yield consistent deformation estimates, providing a potential termination criterion for adaptive model refinement.

4:15pm - 4:30pm

Towards a Framework for Benchmarking Dense 3D Displacement Estimation Approaches for Geomonitoring Using Long-Range TLS Data

Nicholas Meyer, Tomislav Medic, Andreas Wieser

Institute of Geodesy and Photogrammetry, ETH Zurich, Switzerland

Accurate and spatially dense 3D displacement estimation can contribute to a better understanding of geomorphological processes, while long-range terrestrial laser scanning (LR-TLS) has emerged as a promising technique for generating such observations. However, selecting the most effective algorithms for dense 3D displacement estimation remains challenging due to the lack of benchmarking. This study introduces an open and extensible benchmarking framework for 3D displacement estimation and provides an initial validation through a systematic comparison of representative 2D projection-based and 3D point cloud--based methods for estimating 3D displacements from LR-TLS scans. The evaluation includes 252 combinations of algorithmic and hyperparameter configurations, covering cross-correlation, optical flow, and salient feature tracking approaches, as well as the 3D displacement estimation method F2S3. All methods were benchmarked on a single common LR-TLS dataset, using sparse GNSS and manually derived displacements as ground truth. Results show that F2S3 achieves the highest agreement with the ground truth, while the top-performing configurations of the 2D approaches reach comparable accuracy, albeit slightly lower than that of F2S3. Our findings further highlight key sensitivities of current methods to parameter choices and data characteristics. The presented open and extensible evaluation framework enables reproducible performance assessment and could provide a foundation for future large-scale benchmarking and further development of 3D displacement estimation techniques for LR-TLS data.

4:30pm - 4:45pm

Joint Stone Segmentation and Feature Driven Deformation Analysis at Water Dams

Annika Tobies, Judith Foth, André Cornelißen, Eike Koller, Lasse Klingbeil, Heiner Kuhlmann

Institute of Geodesy and Geoinformation, University of Bonn, Germany

Structural health monitoring of water dams is crucial to ensure their long-term safety and operational reliability. Traditional geodetic techniques, although precise, are limited to sparse observation points and cannot capture spatially heterogeneous deformations. Laser scanning enables comprehensive, area-wide acquisition, overcoming this limitation. Subsequent deformation analysis often relies on comparisons along the local surface normal, which are limited in detecting in-plane movements. To address this, this study presents an approach that combines image-based stone segmentation with point-cloud-based deformation analysis to estimate both in-plane and out-of-plane displacements across masonry dam surfaces. Individual stones are detected in unmanned aerial vehicle (UAV) imagery using a deep learning segmentation model (Mask R-CNN) and subsequently projected into corresponding point clouds acquired by terrestrial laser scanning (TLS) and UAV laser scanning. By establishing consistent stone correspondences across multi-epoch point clouds via centroid-based matching and local iterative closest point (ICP) alignment, the proposed method enables deformation analysis on a stone-by-stone level. Simulated deformations were applied to TLS- and UAV-based point clouds of a dam to evaluate the method. Results demonstrate that the approach achieves sub-centimeter accuracy for the TLS and low-centimeter accuracy for the UAV point cloud, as measured by the RMSE between the estimated and true deformation. Our approach outperforms conventional model-to-model comparison methods, such as Multiscale Model to Model Cloud Comparison (M3C2), for in-plane displacements. The integration of image segmentation and geometric analysis provides a powerful framework for full-field deformation monitoring of masonry structures, supporting the detection of instabilities and improving dam safety.

4:45pm - 5:00pm

Reducing Non-rigidity in TLS Point Clouds Induced by Inhomogeneous Systematic Errors Using Free-form Surface Modeling

Yihui Yang¹, Corinna Harmening², Daniel Czerwonka-Schröder³, Christoph Holst¹

¹Chair of Engineering Geodesy, TUM School of Engineering and Design, Technical University of Munich, Germany; ²Geodetic Institute, Karlsruhe Institute of Technology, Germany; ³Department of Geodesy, Bochum University of Applied Sciences, Germany

In geodetic monitoring, terrestrial laser scanning (TLS) point clouds are typically assumed to be accurate and true-to-scale, implying that data acquired from different epochs or stations differ only by rigid transformations. Consequently, systematic errors related to scanner or platform variations can be mitigated through rigid point cloud registration. However, variations in the propagation speed and path of laser beams due to atmospheric refraction, as well as ranging biases induced by surface properties, can introduce non-rigid distortions in the generated point clouds. These effects are particularly pronounced under complex meteorological and topographic conditions, such as in mountainous areas. As a result, the acquired point clouds exhibit inhomogeneous and non-linear deviations that cannot be effectively compensated by simple distance corrections or rigid transformations. In this study, robust rigid registration is first performed to minimize the effects of platform offsets. A data-driven approach is then employed to generate sparse stable points, providing distance deviations that incorporate spatially varying systematic errors. Finally, a free-form surface is fitted to these sparse point-wise distance deviations, thereby establishing a 3D correction field for the entire point cloud. For a dataset collected by a permanent TLS monitoring system in the Vals Valley (Tyrol, Austria), the proposed method effectively reduces the registration residuals in TLS point clouds caused by inhomogeneous systematic errors.

5:00pm - 5:15pm

Calibration of Panoramic Terrestrial Laser Scanners using Planar Patches

Eike Koller¹, Lasse Klingbeil², Heiner Kuhlmann³

¹University of Bonn, Germany; ²University of Bonn, Germany; ³University of Bonn, Germany

Using point clouds captured by Terrestrial Laser Scanners for measurement tasks with high-quality requirements is well established in engineering geodesy. However, geometric imperfections within the scanners introduce systematic deviations into the captured point clouds. These deviations often reach several millimeters in magnitude, exceeding the impact of random measurement noise. Calibrating the scanners by estimating these internal imperfections allows these systematic errors to be corrected, thereby preventing misinterpretations of the measurement results. In this work, we develop a methodology that allows users of Terrestrial Laser Scanners to independently determine calibration parameters for panorama scanners and to correct the resulting point clouds using planar patches extracted directly from the captured data. This approach requires no additional hardware or specialized measurement equipment. We evaluate the methodology using an independent point cloud of a water dam and demonstrate that it achieves a substantial reduction in systematic deviations. Furthermore, by estimating calibration parameters in a dedicated state-of-the-art calibration field, we show that our method delivers results comparable to these established calibration procedures—yet without the need for such specialized calibration environments.

5:15pm - 5:30pm

Methodological framework for determining vertical angular variances of terrestrial laser scanners

Jakob Hummelsberger¹, Omar AbdelGafar¹, Derek Lichti², Christoph Holst¹

¹Chair of Engineering Geodesy, TUM School of Engineering and Design, Technical University of Munich, Munich, Germany; ²Department of Geomatics Engineering, Schulich School of Engineering, University of Calgary, Calgary, Canada

Information on the precision of TLS observables is limited. While the range measurement precision can be modeled with respect to the intensity measurement nowadays, the precision of the angular observations still relies on the claims of the manufacturer. This contribution proposes a method to determine the vertical angular variance of a TLS using profile measurements. Supported by a simulation, which serves as proof-of concept, the methodology is laid out. In the end, measurements with a Z+F IMAGER® 5016A are evaluated. A dependency of the angular standard deviation on the rotational speed of the beam deflection unit is observed. The estimation precision of the angular standard deviation is high with consistent values for differing ranges. The estimated angular standard deviations are much lower than the claims of the manufacturer starting with roughly 2" for the slowest rotating settings, up to 4" for the fastest. All this can be achieved by scanning a reflectivity target with at least two adjacent fields of different homogeneous reflectivity. This needs to be aligned to the scanner to reduce and eliminate as many contributing error sources as possible. The target itself provides the fields and the transitions needed to perform the in-situ estimation of the angular precision.

Date: Wednesday, 08-July-2026

8:30am - 10:00am

WG II/2C: Point Cloud Generation and Processing
Location: 713A

8:30am - 8:45am

Differentiable deep consistency for point cloud registration

Tian Zhang, Sagi Filin

Technion - Israel Institute of Technology, Israel

Point cloud registration is a key facilitator for scan alignment in mapping, autonomous driving, and robotic applications. Current pipelines increasingly adopt neural-based paradigms, where most research focuses on learning view-consistent descriptors for correspondence matching. Due to outliers, matching is typically followed by a geometric verification phase that assesses correspondences by enforcing distance or angular consistency to support transformation estimation. Although effective, this verification stage scales quadratically, creating a computational bottleneck that hampers efficient registration. More importantly, since matching and verification are usually optimized separately, the verification stage cannot guide the learned descriptors or foster their geometric awareness. To address both limitations, we introduce a novel end-to-end neural registration framework that unifies correspondence learning and verification within a single differentiable formulation. Specifically, we propose a new consistency-driven cross-attention module that dynamically correlates cross-scan neighborhoods to suppress inconsistent matches and reinforce inter-scan feature coherence. In doing so, it produces robust and discriminative descriptors without incurring the quadratic cost of explicit pairwise verification. Our formulation is readily applicable, and we demonstrate its seamless integration into the GeoTransformer and RoITr state-of-the-art architectures without additional supervision or post-processing. Results show that our method excels in challenging low-overlap scenarios, where competing methods often yield few correct correspondences or fail entirely. It consistently achieves superior inlier ratios and the lowest registration errors on 3DMatch, 3DLoMatch, and KITTI, improving registration recall by up to 2.6%. Beyond accuracy, it converges faster during training and achieves the quickest inference among state-of-the-art methods.

8:45am - 9:00am

Cross-source Point Cloud Registration in the Bird’s-eye Domain: Aligning Street-level LiDAR with High-resolution Aerial Orthoimagery

Seunghwan Hong¹, Junghun Suh¹, Seokhyun Moon², Mohammad Gholami Farkoushi³, Hong-gyoo Sohn³

¹Kakao Mobility, Republic of Korea; ²University of Seoul, Republic of Korea; ³Yonsei University, Republic of Korea

Combining terrestrial Mobile Mapping System (MMS) point clouds with aerial photogrammetric data offers a practical route to comprehensive 3D urban models that integrate street-level geometric detail with wide-area coverage. However, direct 3D-to-3D registration between these data sources often fails because of large differences in viewpoint, point density, scale, and scene composition. This study presents an orthoimage-based registration framework that reformulates cross-source alignment in the Bird's-Eye-View (BEV) domain. After removing transient objects and extracting ground-level points from the MMS cloud, the data are rasterised into a synthetic orthoimage aligned in resolution and projection with a geo-referenced Unmanned Aerial Vehicle (UAV) orthoimage. A learned dense matcher establishes image correspondences, which are geometrically verified and lifted to 3D for coarse alignment, followed by tile-wise point-to-plane Iterative Closest Point (ICP) refinement and global trajectory regularisation via robust factor-graph optimisation. The aligned MMS and UAV point clouds are then integrated through reliability-driven voxel-level fusion. Experiments on a 3.7km urban corridor in Seoul demonstrate that the proposed framework achieves a 3D root-mean-square error of 6.19cm, indicating that BEV-domain orthoimage matching combined with local 3D refinement and trajectory regularisation provides a viable approach for large-scale MMS-UAV registration in dense urban environments.

9:00am - 9:15am

Automated Alignment Enhancement of Backpack Image-LiDAR Data in a Forest Environment

Jidong Liu, Sangyoon Park, Songling Fei, Ayman Habib

Purdue University, United States of America

In recent years, backpack mobile mapping systems (MMS) have shown great promise for under-canopy forest mapping. These systems integrate cameras, LiDAR sensors, and Global Navigation Satellite System/Inertial Navigation System (GNSS/INS) units to provide multi-modal geospatial data essential for modern forest applications that require both geometric and spectral information. However, transportation logistics and improper handling can degrade the system calibration. Moreover, canopy-induced GNSS signal outages will cause trajectory errors. The resulting misalignments between the image-LiDAR data necessitate the application of image–LiDAR registration. Such algorithms can be broadly classified as 2D-3D, 3D-3D, or 2D-2D, depending on the domain in which image-LiDAR features are identified. Due to the inherent modality differences, 2D–3D methods often struggle with feature matching. These methods typically require manual feature selection (Habib et al., 2005) or the availability of prominent features in urban environments (Liao et al., 2023). In contrast, 3D-3D methods rely on generating 3D image point clouds, which imposes strict requirements on image overlap (Yang et al., 2015). Although 2D–2D approaches are less demanding on image data (Hu et al., 2023), none have been applied in under-canopy forests, where establishing multi-modal correspondences remains challenging. To overcome these limitations, this study introduces a post-processing framework for automated image–LiDAR alignment enhancement for backpack MMS in forest environments. This method utilizes a 2D–2D image–LiDAR registration approach based on semantic tree-trunk features.

9:15am - 9:30am

A Marker-based Method for precise 3D Registration between CT-Data and photogrammetric Datasets

Sebastian Preussel¹, Cyrill Milkau², Hans-Gerd Maas¹

¹TU Dresden, Germany; ²HTW Dresden, Germany

In order to enable photogrammetric tracking of objects from a computed tomography (CT) dataset with a multi-camera system, a transformation between the CT data space and a photogrammetric reference frame is required, typically based on control points. To achieve a robust and precise registration between CT and photogrammetric datasets, this work proposes a marker-based approach. The main goal is to use a marker model that allows straightforward segmentation and control point estimation in CT voxel space, while also supporting reliable and precise control point estimation in the photogrammetric images. As a proof-of-concept, spherical markers were investigated, since they allow centre estimation in both domains. In the CT data, marker centres were determined by intensity-based thresholding followed by sphere fitting, while in the photogrammetric data they were estimated by intensity-based thresholding, edge detection, circle fitting, and multi-image spatial intersection. Two different marker models were tested. The results show that the proposed method is feasible and yields sub-millimetre standard deviations of unit weight for both marker types. However, since a sufficient stochastic model is not yet available, the reported accuracy measures may be optimistic and should therefore be interpreted with caution. Future work will address these limitations, in particular uncertainty modelling as well as remaining lighting and contrast issues.

9:30am - 9:45am

Advances in Historical Aerial Image Analysis: Boosting SfM Pipelines with Learned Models

Lucas Kugler¹, Jan Dirk Wegner¹, Camilo Rada^2,3, Livia Piermattei¹

¹University of Zurich, Switzerland; ²University of Magallanes; ³University of British Columbia

Scanned aerial images acquired with film cameras (hereafter referred to as historical images) over the past century is a unique source for deriving Digital Elevation Models (DEMs) and orthoimage to reconstruct past Earth’s surface and quantify long-term changes from glacier to landscape and urban development.

The Historical Structure-from-Motion (HSfM) pipeline (Knuth et al., 2023) currently represents the state of the art to fully automatically generate these historical DEMs. However, struggles with inconsistent image quality, distortions, distinct geometries and above all is based on the commercial software Metahape. Therefore, we aim to: (1) develop a fully open-source solution in COLMAPs environment, (2) integrate learned models in different SfM-steps to better handle the complex properties that come with historical imagery, and (3) compare our output against HSfM.

Our work is based on 180 historical aerial images acquired above the challenging terrain of Gran Campo Nevado Glacier. The results show that our photogrammetric workflow leads to a 0.26 px smaller mean reprojection error as well as roughly 9-times more tie-points for the sparse point cloud compared to the HSfM. The mean DEM difference with a reference DEM on stable terrain and the 95%-quantile DEM difference are also smaller in our experiments (0.71m vs. 10.10 m and 73.62 m vs. 99.03 m).

Further tests of our workflow include employing alternative models for feature extraction, matching, and dense reconstruction as well as evaluating multitemporal approaches (as adopted in Knuth et al., 2023) to enable a more representative comparison.

9:45am - 10:00am

Trinocular Multi-Object 3D Reconstruction in Camera-Simulating virtual Environments for Knee Arthroplasty

Arne Schierbaum¹, Tobias Neiss-Theuerkauff², Thomas Luhmann¹, Frank Wallhoff², Till Sieberth¹

¹Jade University of Applied Sciences, Institute for Applied Photogrammetry and Geoinformatics, Oldenburg, Germany; ²Jade University of Applied Sciences, Institute for Technical Assistive Systems, Oldenburg, Germany

In knee arthroplasty, computer-assisted navigation enhances the accuracy of prosthesis placement. However, current methods rely on invasively drilled locators to track the knee position during surgery, prolonging the healing process. For this reason, research is focused on markerless approaches capable of determining knee orientation and transferring preoperative planning into the surgical environment. This work presents a trinocular multi-object 3D reconstruction system designed for intraoperative acquisition of the knee surface, providing a foundation for marker less navigation. Due to the scarcity of real surgical data with ground truth, a synthetic dataset was created using Blender to simulate optical image acquisition of a virtual knee model under controlled camera and lighting conditions. The dataset enables a systematic evaluation of how camera motion and viewpoint affect pose estimation and 3D reconstruction accuracy. The results demonstrate that moderate camera deflection between 15° and 25° achieve the best balance between accurate camera pose estimation and surface reconstruction quality. The work confirms the potential of trinocular SLAM for robust bone surface tracking while also identifying the limitations of synthetic data, such as the absence of real-world visual variability. These results form the basis for future work on 3D reconstruction during dynamic knee movements and their tracking, as well as on the integration of markerless optical navigation systems into surgery.

1:30pm - 3:00pm

WG III/1C: Remote Sensing Data Processing and Understanding
Location: 713A

1:30pm - 1:45pm

Remote sensing image semantic segmentation sample generation using a decoupled latent diffusion framework

Yue XU¹, Honghao Liu², Ruixia Yang³, Zhengchao Chen¹

¹Aerospace Information Research Institute, Chinese Academy of Sciences, China, People's Republic of; ²University of Chinese Academy of Sciences, China, People's Republic of; ³International Research Center of Big Data for Sustainable Development Goals, China, People's Republic of

Semantic segmentation deep learning algorithms still depend on large quantities of high-quality annotated samples. Because remote sensing imagery spans vast areas and highly variable land surface environments, annotation demands substantial expertise and is both time-consuming and labour-intensive, leaving the field with an acute shortage of first-rate training samples. Moreover, object categories in land cover data are inherently imbalanced. Models trained under imbalance often underperform in small sample categories.

This study proposed a decoupled latent diffusion framework for RS semantic segmentation sample generation, and a proportion-aware loss to optimize balance of sample classes. We tested the proposed method on the ISPRS Potsdam dataset and compared it with two classic image generation baselines. The results show that our approach outperforms the baselines, producing synthetic samples with superior visual quality and semantic consistency. To verify downstream utility, we trained DeeplabV3+, PSPNet, and SegFormer segmentation models with the synthesized data. Across all three networks, overall segmentation accuracy and class balance metrics improved markedly; gains were especially pronounced for the rare “Clutter” and “Car” categories, underscoring the proposed method’s generality and robustness. We further analysed how the proportion of synthetic samples affects performance. As the ratio of synthetic to real samples increased, mIoU and mF1 first rose and then declined; the best results were obtained when the proportion of synthesized samples approached 40%. This indicates that a moderate amount of synthetic sample can significantly boost segmentation performance, whereas excessive synthetic data risks over-fitting or misclassification.

1:45pm - 2:00pm

Bright-CC: A Novel Change Captioning Benchmark for Cross-Modal Remote Sensing Images

Jing Yao¹, Dongwei Sun², Yun Lin³, Zhangqi Zhou³, Haizhen Chen³, Xiangyong Cao², Zhengchao Chen¹

¹State Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences; ²School of Computer Science and Technology, Xi’an Jiaotong University; ³Ningbo Institute of Surveying, Mapping and Remote Sensing

Existing remote sensing change captioning methods are limited to optical-only data, precluding all-weather, all-day monitoring. To address this, we introduce Bright-CC, the first large-scale benchmark for cross-modal (Optical-to-SAR) change captioning. Curated from the newly-proposed BRIGHT dataset, Bright-CC comprises 9,953 paired images focused on building damage assessment. It features dense four-class semantic labels (intact, damaged, destroyed) and a rich corpus of 49,765 GPT-4O-generated sentences (5 per pair), moving beyond simple binary change labels. Furthermore, we propose the Hybrid Feature Alignment Network (HFA-Net) as a robust baseline for this new task. HFA-Net is specifically designed to tackle the significant domain shift between heterogeneous sensors. Its architecture features: (1) a pseudo-siamese alignment module (HFEA) to project features into a common space; (2) a multi-scale atrous convolution module (CSTDF) to refine change context; and (3) a novel Lightweight Caption Generator (LCG), which is a parameter-efficient Transformer trained from scratch to avoid overfitting. Experiments show HFA-Net substantially outperforms adapted optical-only baselines (RSICCFormer, Chg2Cap) on all standard metrics. This work provides the community with a critical dataset and a strong baseline for future cross-modal spatio-temporal intelligence.

2:00pm - 2:15pm

Remote Sensing Change/Damage Image Generator Based on Prior Foundation Model and Multimodal Reference Information

Peng Chen, Guorui Ma, Haiming Zhang, Di Wang, Lunjun Fan

Wuhan University, China, People's Republic of

The scarcity and high cost of acquiring high-quality post-event remote sensing images (due to cloud cover, satellite limitations, and security risks) severely constrain the development and accuracy of change/damage detection models. This data gap is especially critical in disaster or military conflict scenarios. Existing cross-temporal image generation methods often lack precise spatial and semantic control, leading to inconsistent or unrealistic synthetic results. To address this core challenge, this paper introduces the Remote Sensing Change/Damage Generator (RSCDG), a novel method based on the Latent Diffusion Model for high-fidelity simulation of post-event satellite imagery. The RSCDG’s core innovation lies in its multimodal condition embedding framework, which integrates three specialized control pathways:The Pre-event Visual Prompt Adapter (built on PrithviModel) ensures high structural consistency between the pre-event and generated post-event images.The Spatial Location Control Pathway (using a ControlNet architecture and change/damage masks) precisely dictates the geometric location of the simulated change.The Generation Content Controller (using a CLIP Text Encoder) enhances semantic realism by guiding the model with natural language descriptions of the change/damage.Furthermore, we introduce a Mask Alignment Loss to enforce spatial and semantic adherence to detection rules. Results demonstrate that RSCDG accurately simulates complex scenarios like new urban construction and catastrophic building collapse. RSCDG is a powerful new tool designed to augment training data and significantly accelerate high-precision disaster response and urban monitoring.

2:15pm - 2:30pm

Edge Knowledge Distillation Guided Lightweight Change Detection Network

Tingyu Ji^1,2, Yixin Chen^5,6, Ruiqian Zhang¹, Xiaogang Ning¹, Xiao Huang³, Hanchao Zhang^1,7, Weibin Ma¹, Chunquan Cheng¹, Jiaming Wang⁴

¹State Key Laboratory of Spatial Datum, Chinese Academy of Surveying and Mapping, Beijing 100036, China; ²the College of Geodesy and Geomatics, Shandong University of Science and Technology, Qingdao 266590, China; ³the Department of Environmental Sciences, Emory University, Atlanta, GA 30322, USA; ⁴the Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan 430205, China; ⁵Sichuan Institute of Land Science and Technology (Sichuan Center of Satellite Application Technology), Chengdu 610045, China; ⁶Key Laboratory of Investigation, Monitoring, Protection and Utilization for Cultivated Land Resources, MNR, Chengdu 610045, China; ⁷Joint Laboratory of Spatial Intelligent Perception and Large Model Application

Deep-learning methods dominate remote-sensing change detection (CD), yet state-of-the-art models remain parameter-heavy and struggle with crisp boundaries, limiting their use on edge devices. We present LEDGNet, a Lightweight, Edge-knowledge-Distillation-Guided CD Network, that reconciles accuracy, boundary fidelity, and efficiency. LEDGNet integrates three purpose-built components: 1) an Edge Distillation Module that mines multi-scale boundary cues from a high-capacity teacher and transfers them to a compact student through an edge-aware loss; 2) StarLite, a depth-wise separable encoder that preserves fine spatial detail while minimizing floating-point operations; and 3) LiteDecoder, an inexpensive feature-fusion head that restores full resolution without bulky up-sampling. This design halves the parameters and inference time of mainstream fine-grained CD networks while enhancing edge sharpness. On the CDD and LEVIR-CD benchmarks, LEDGNet achieves competitive F1 performance while maintaining a compact footprint of 20.58 M parameters and 35.18 G FLOPs. With an inference time of 255 ms, it strikes a balance between resource consumption and detection efficiency, making it well-suited for high-efficiency remote sensing monitoring.

2:30pm - 2:45pm

Leveraging Pretrained Priors for Weakly Supervised Semantic Segmentation of Remote Sensing Images

Xin Li, Nicola Genzano, Marco Gianinetto, Marco Scaioni

politectinico di milano, Italy

Semantic segmentation of remote sensing imagery (RSI) is essential for urban mapping, land-use monitoring, and many areas. However, pixel-level annotation is expensive, making weakly supervised semantic segmentation (WSSS) that relies on image-level labels an attractive alternative. Leveraging pre-trained models offers strong priors from large-scale learned representations can help the WSSS, yet frozen models often yield sparse and misaligned class activation maps (CAMs) due to domain gaps and static inference. We propose a lightweight and efficient framework that integrates CLIP and DINO to address three challenges: semantic misalignment between generic text prompts and RSI-specific visuals, static CAM quality, and incomplete object coverage. Our design includes: (1) a Textual Prototype-Aware Enrichment (TPE) module that builds an RS-specific knowledge base using LLM generated descriptions to enrich text prompts; (2) a Unified Semantic Relation Mining (USR) module that fuses learnable adapter features with CLIP attention and DINO affinity for online CAM refinement; and (3) a Visual Prototype-Aware Enrichment (VPE) modulemaintains momentumvisualprototypes to complete regions and sharpen boundaries. Using frozen priors while only training a lightweight decoder ensures efficiency and consistently improves segmentation accuracy across diverse remote sensing scenes. Experimental results on the iSAID and ISPRS Potsdam datasets demonstrate the effectiveness of the proposed framework, achieving 38.01% mIoU on iSAID dataset and 47.01% mIoU with 66.89% overall accuracy on Potsdam dataset.

2:45pm - 3:00pm

DeSEO: Physics-Aware Dataset Creation for High-Resolution Satellite Image Shadow Removal

Lorenzo Beltrame^1,2, Jules Salzinger², Filip Svoboda³, Phillipp Fanta-Jende², Jasmin Lampert², Radu Timofte⁴, Marco Körner^1,5,6

¹Technical University of Munich; ²Austrian Institute of Technology; ³University of Cambridge; ⁴University of Würzburg; ⁵Munich Data Science Institute; ⁶ELLIS Unit Jena

Shadows cast by terrain and tall structures remain a major obstacle for high-resolution satellite image analysis. Public resources offering geometry-consistent paired shadow/shadow-free satellite imagery are essentially missing, even though there is a growing body of work on shadow removal in remote sensing, and most large-scale Earth-observation datasets are designed for shadow detection or 3D modelling rather than shadow removal. Existing deep shadow-removal datasets either target ground-level or aerial scenes or rely on unpaired and weakly supervised formulations rather than explicit satellite pairs. We address this gap with deSEO, a geometry-aware and physics-informed methodology that, to the best of our knowledge, is the first to derive paired supervision for satellite shadow removal from the S-EO shadow detection dataset through a fully replicable pipeline. For each tile, deSEO selects a minimally shadowed acquisition as a weak reference and pairs it with shadowed counterparts using temporal and geometric filtering, Jacobian-based orientation normalisation, and LoFTR–RANSAC registration. A per-pixel validity mask restricts learning to reliably aligned regions, enabling supervision despite residual off-nadir parallax. In addition to this paired dataset, we develop a DSM-aware deshadowing model that combines residual translation, perceptual objectives, and mask-constrained adversarial learning. In contrast, a direct adaptation of a UAV-based SRNet/pix2pix architecture fails to converge under satellite viewpoint variability. Our model consistently reduces the visual impact of cast shadows across diverse illumination and viewing conditions, achieving improved structural and perceptual fidelity on held-out scenes. deSEO therefore provides the first reproducible, geometry-aware paired dataset and baseline for shadow removal in satellite Earth observation.

3:30pm - 5:15pm

WG III/1D: Remote Sensing Data Processing and Understanding
Location: 713A

3:30pm - 3:45pm

Spatio-temporal Modeling of Bridge Deformations from Sentinel-1 SAR Images Validated with Multiple In-situ Surveys

Roberto Monti, Federica Gaspari, Rohollah Naeijian, Rebecca Fascia, Mirko Reguzzoni, Lorenzo Rossi, Livio Pinto

Politecnico di Milano, Department of Civil and Environmental Engineering (DICA), 20133 Milan, Italy

Aging bridge infrastructure requires efficient, network-scale monitoring, especially in remote areas where traditional in-situ sensors are costly and logistically challenging. This paper presents a remote sensing framework for structural health monitoring based on spaceborne Synthetic Aperture Radar (SAR). The approach combines Persistent Scatterer Interferometry (PSI) and Least Squares Collocation (LSC), implemented through the PHASE open-source MATLAB software, to derive a millimeter-level spatio-temporal displacement model.

The methodology is applied to a reinforced-concrete viaduct in the Alpine foothills of Lombardy, Italy, using five years of Copernicus Sentinel-1 data. A custom elevation-based spatial filtering strategy enables the isolation of structural displacements from the surrounding topography. The resulting spatio-temporal displacement model captures the expected seasonal thermal behavior of the structure and highlights localized deviations from the dominant cyclic response.

Finally, the SAR-derived model is integrated with UAV photogrammetry and official inspection reports within the P.O.N.T.I. 3D viewer. This multi-source, Digital Twin-like environment facilitates the joint interpretation of remote sensing observations and in-situ evidence, providing a scalable framework to support infrastructure monitoring and management.

3:45pm - 4:00pm

Large-Scale InSAR Deformation Monitoring Using Realistic Simulation-Based Training of a Temporal Convolutional Network: Application to the Phlegraean Fields, Italy

Kourosh Shahryarinia, Mohammad Omidalizarandi, Ingo Neumann

Geodetic Institute Hannover, Leibniz University Hannover, Germany

Large-scale land surface deformation monitoring using Interferometric Synthetic Aperture Radar (InSAR) requires robust detection of changes in long-term deformation trends. However, accurate change point (CP) detection remains challenging due to complex time series characteristics, including seasonal and quasi-periodic components and noise. Classical methods and many existing deep learning approaches rely on restrictive assumptions and training data that do not fully represent real-world InSAR time series, limiting their generalization and scalability in large-scale, real-world applications. In this study, we propose an integrated, fully supervised framework for CP detection in InSAR displacement time series based on Temporal Convolutional Networks (TCNs). The proposed TCN model employs dilated convolutions with multi-scale receptive fields to capture long-term temporal dependencies and complex deformation patterns, enabling robust identification of significant trend changes under noisy conditions. To effectively train this model, we introduce a deep learning-based InSAR time series simulation framework trained on real time series. This simulation framework produces physically consistent InSAR time series that retain essential temporal characteristics while introducing predefined, credible trend changes. Finally, we integrate the trained model into a large-scale anomalous change-detection pipeline that aggregates detected CPs from individual time series into spatially coherent deformation heatmaps suitable for operational monitoring. The proposed framework is evaluated using simulated data and real InSAR time series from the Phlegraean Fields caldera (Campi Flegrei), Italy. The results show clusters of anomalous behavior in the central Campi Flegrei–Pozzuoli area and in parts of Ischia and Procida, consistent with known unrest zones, associated periods, and independent measurements.

4:00pm - 4:15pm

Geometry-conditioned Pix2Pix: leveraging explicit Conditioning on SAR projected local Incidence Angle for SAR-to-EO Translation Quality Improvement

Jinmin Lee, Minkyung Chung, Aisha Javed, Youkyung Han

Seoul National University of Science and Technology, Korea, Republic of (South Korea)

Electro-optical (EO) imagery is intuitive but highly dependent on weather and illumination, whereas synthetic aperture radar (SAR) imagery provides reliable all-weather observations yet offers limited spectral information. To complement these modalities, recent studies have applied cGAN-based image-to-image translation for SAR-to-EO translation. However, side-looking SAR introduces spatial distortions such as foreshortening and layover that cause relative misalignment with EO imagery, undermining pixelwise supervision and yielding structural discrepancies between translated outputs and reference EO imagery. In this study, we propose Geometry-Conditioned Pix2Pix (GC-Pix2Pix), which explicitly conditions on projected local incidence angle (PLIA) information derived from SAR imagery to better preserve structure and alignment in translated EO imagery. The method is based on Pix2Pix and comprises a 2-branch generator and a PatchGAN discriminator. The generator consists of a main network that processes SAR polarimetric channels (VV, VH) and a conditioning subnetwork that extracts PLIA features. The subnetwork uses multi-layer convolutional blocks to capture local PLIA patterns, and the extracted features are then fused with features from the main branch and emphasized via a spatial attention module. For training and evaluation, we assembled a dataset over South Korea that combines Sentinel-1A GRD VV/VH with PLIA and Sentinel-2B Level-2A RGB imagery. We compared GC-Pix2Pix against representative baselines. Across multiple image quality assessment metrics and complementary qualitative analyses, the proposed approach consistently improved SAR-to-EO translation performance.

4:15pm - 4:30pm

Temporal-Spatial Tubelet Embedding for Cloud-Robust MSI Reconstruction using MSI-SAR Fusion: A Multi-Head Self-Attention Video Vision Transformer Approach

Yiqun Wang, Lujun Li, Meiru Yue, Radu State

SEDAN, SnT, the University of Luxembourg, Luxembourg

Cloud cover in multispectral imagery (MSI) significantly hinders early-season crop mapping by corrupting spectral information. Existing Vision Transformer(ViT)-based time-series reconstruction methods, like SMTS-ViT, often employ coarse temporal embeddings that aggregate entire sequences, causing substantial information loss and reducing reconstruction accuracy. To address these limitations, a Video Vision Transformer (ViViT)-based framework with temporal-spatial fusion embedding for MSI reconstruction in cloud-covered regions is proposed in this study. Non-overlapping tubelets are extracted via 3D convolution with constrained temporal span t=2, ensuring local temporal coherence while reducing cross-day information degradation. Both MSI-only and SAR-MSI fusion scenarios are considered during the experiments. Comprehensive experiments on 2020 Traill County data demonstrate notable performance improvements: MTS-ViViT achieves a 2.23% reduction in MSE compared to the MTS-ViT baseline, while SMTS-ViViT achieves a 10.33% improvement with SAR integration over the SMTS-ViT baseline. The proposed framework effectively enhances spectral reconstruction quality for robust agricultural monitoring.

4:30pm - 4:45pm

Evaluating Deep Matching Models for SAR-Optical Image Pairs using the SpaceNet9 Dataset

Constantin Günzel, Michael Schmitt

Department of Aerospace Engineering, University of the Bundeswehr Munich, Germany

This paper focuses on cross-modal image matching between Synthetic Aperture Radar (SAR) and optical imagery, a long-standing challenge due to disparate sensing physics, radiometric behaviour and geometric distortions. Beyond applicational needs in satellite data fusion and downstream mapping, the study is additionally motivated by the rapid advances of feature matching in the field of Computer Vision. Under a unified, lightweight pipeline, the authors evaluate a classical handcrafted baseline (SIFT) against modern deep matchers, including a modality-invariant approach (MINIMA), as well as a SuperPoint+LightGlue pipeline, using the SpaceNet9 dataset with provided ground truth. The aim is to assess each models' ability to establish reliable correspondences without retraining or modality-specific adaptation, aiming to provide practical guidance for other researchers working with SAR-optical fusion. The paper highlights where pretrained multimodal models already yield consistent correspondences, where they still struggle and outlines possible next steps.

4:45pm - 5:00pm

Detecting Marine Pollutants Using Sentinel-1 SAR and Sentinel-2 Optical Imagery

Jason Manesis¹, Paraskevi Mikeli¹, Katerina Kikaki², Ioannis Kakogeorgiou³, Athanasios Karmas², Konstantinos Karantzalos¹

¹National Technical University of Athens; ²Hellenic Space Center; ³IIT, NCSR "Demokritos"

Marine pollution, including Marine Debris and Oil Spills, poses a serious environmental threat that requires systematic monitoring. While satellite observations and machine learning models have been widely applied in this domain, the use of advanced deep learning techniques remains limited. To support progress in this area, we construct a new annotated Sentinel-1 SAR dataset derived from the MADOS Sentinel-2 marine pollution dataset, including labels for oil spills, sea surface, look-alikes, ships, and offshore platforms. We evaluate several deep learning architectures on this dataset, including traditional models such as U-Net, state-of-the-art segmentation models such as SegNeXt and domain-specific frameworks such as MariNeXt. Our results show that MariNeXt achieves the best performance with an F₁-macro score of 92.7%, significantly outperforming U-Net and SegNeXt. Qualitative analysis using paired Sentinel-2 imagery further validates these findings. The study also highlights the persistent difficulty of detecting marine debris in SAR imagery, particularly when complementary optical data are unavailable.

5:00pm - 5:15pm

A coarse-to-fine cross-view localization framework with BEV-guided retrieval and fine-grained pose alignment

Qiong Wu¹, Panwang Xia¹, Yi Wan^1,2, Xinyi Liu^1,2, Yongxiang Yao¹, Dong Wei¹, Yongjun Zhang^1,2

¹Wuhan University, China, People's Republic of; ²Technology Innovation Center for Collaborative Applications of Natural Resources Data in GBA, Ministry of Natural Resources, Guangzhou, 510075, Guangdong, China

This paper introduces a coarse-to-fine cross-view localization framework that unifies image-level retrieval and geometry-level alignment within a single pipeline. The proposed approach first employs a Bird’s-Eye-View (BEV)-guided retrieval module to establish a perspective-consistent intermediary space, enhancing cross-view consistency and retrieval precision. In the fine stage, a geometry-aware alignment module estimates the 3-DoF pose through interpretable point-plane matching based on BEV correspondences. This hierarchical design bridges global retrieval and local geometric reasoning, achieving both scalability and high localization accuracy. Extensive experiments on the VIGOR benchmark demonstrate that the proposed framework achieves state-of-the-art performance in both retrieval and alignment, significantly improving end-to-end localization precision while maintaining computational efficiency.

Date: Thursday, 09-July-2026

8:30am - 10:00am

WG III/1E: Remote Sensing Data Processing and Understanding
Location: 713A

8:30am - 8:45am

Directional Total Least Square for FullWaveform Aerial LiDAR Smoothing

Yining Feng, Debra Laefer, Ivan Selesnick

Tandon School of Engineering, New York University, United States of America

Smoothing aerial LiDAR point clouds is challenging, because they are often noisy, irregularly sampled, and sparse, as well as their inherent high degrees of freedom. Classic methods struggle on such datasets as they were designed for regularly sampled, dense datasets with moderate noise. To address the challenge, this paper proposes a constrained point cloud model with one degree of freedom. The point cloud model incorporates the sensing directions stored in the full waveform LiDAR datasets, and has theoretical advantages in terms of the statistical error bound for normal estimation. Based on the point cloud model, the directional total least square is formulated as a regularized convex optimization problem for points estimation on a tangent plane. Moreover, a non-convex regularizer along with the non-convex regularized directional total least square is proposed to improve the estimation

quality. To solve the proposed optimization problems, an accelerated Douglas-Rachford splitting algorithm is introduced. The proposed methods demonstrate better performances on simulated two-dimensional point clouds in terms of improved root-mean square-

error. For three-dimensional aerial LiDAR point clouds, implemented under the Savitzky-Golay filter framework with local

smoothness prior, the proposed methods demonstrate more smoothing power and robustness than the classic method.

8:45am - 9:00am

Improving Urban Point Cloud Classification Using Dynamic Local Context-Based Point Confidence

Jayati Vijaywargiya, Anandakumar M Ramiya

Indian Institute of Space Science and Technology

Urban mapping for planning and monitoring requires high-resolution spatial data, especially in areas with high landcover diversity. Airborne LiDAR Scanning (ALS) provides accurate 3D point cloud data, but its classification remains challenging due to computational complexity, irregular point distribution, noise, mislabeling and outliers in the dataset. These challenges are amplified in dense urban environments with mixed vegetation and infrastructure. Existing local context-based classification methods consider all points equally, overlooking the impact of their spatial position of the point in the dataset. To address this, we propose a dynamic local context-based point confidence-based optimization that improves classification accuracy by leveraging the spatial context of each point. This approach selects points based on confidence levels derived from position indices in training data and predicted by binary classifiers in test data to enhance robustness of classifier. We evaluated the proposed approach using boosting-based machine learning classifiers on two datasets: Thiruvananthapuram Aerial LiDAR Dataset (TALD) from India and the ISPRS 3D semantic labeling dataset from Vaihingen, Germany. The results showed 90.3% accuracy on TALD and 90.0% on Vaihingen, achieving a 2-4% improvement over conventional local context-based classification.

9:00am - 9:15am

Refinenet: a confidence-aware deep online learning framework to refine real-world point cloud semantic segmentation

Sharath Chandra Madanu¹, Shenglan Du¹, Jantien Stoter¹, Daan van der Heide^1,2

¹3D Geoinformation group, Delft University of Technology, Delft, NL; ²Rijkswaterstaat, Delft, NL

Accurate interpretation and segmentation of 3D point clouds in real-world urban environments is a critical challenge in geospatial analysis, particularly due to the complexity of real-world scenes, inevitable data uncertainties, and potential annotation errors. This paper proposes a confidence-aware deep learning framework to refine the segmentation accuracy of real-world point cloud data. By incorporating multi-source information, such as aerial imagery, and embedding geospatial prior knowledge, this framework models data uncertainty through point-wise confidence scores. Besides, we design an iterative online learning strategy, allowing the network to improve both its predictions and the quality of training labels. Extensive experiments on large-scale airborne laser-scanned data demonstrate that our framework effectively enhances training data by reducing label noise and improving annotation quality, which leads to more robust, generalizable model performance. Our source code is publicly available at https://github.com/AutumnMoon00/RefineNet.

9:15am - 9:30am

A Structured Query Language Approach for processing Smartphone-based LiDAR of Understory Vegetation

Rory Pittman, Baoxin Hu

York University, Canada

LiDAR sensors incorporated within modern smartphone and tablet devices enable relatively quick and inexpensive collection of ground-based LiDAR data applicable for ground truth mapping as needed for modelling understory vegetation. However, this LiDAR data often requires conversion and processing prior to research use. This study presents a workflow with algorithms utilizing structured query language (SQL) to efficiently process detailed rasterized features from LiDAR data collected by an iPhone Pro Max via the ForestScanner app. After transformation of the LiDAR data, SQL has been employed to voxelize the LiDAR data from which rasterized features have been derived. Various cell sizes for voxels and subsequent pixels have been investigated, leading to a recommended spatial resolution of 0.05 m for cell size dimension. SQL provides precise control for advanced querying to process ground-based LiDAR data for vegetational modelling applications.

9:30am - 9:45am

AI Indexing of Aerial LiDAR Point Cloud for Efficient Query

Mohana M, Anandakumar M Ramiya

Indian Institute of Space Science and Technology, Trivandrum, India

In the era of information revolution, with data being the fuel of AI and analytics, efficient information extraction from LiDAR point clouds becomes indispensable for solving real-world problems and aiding decision-making in geospatial domain. Despite having geometric richness, the massive LiDAR point clouds are not only computationally demanding but also lack inherent semantics. The lack of semantics in LiDAR constrains effective data analysis. This paper presents a novel workflow by incorporating Deep Learning derived embeddings as attributes in the geospatial database for the spatio-semantic querying on Aerial LiDAR point clouds. This work leverages AI-based indexing, such as IVFFlat(Inverted File Index with Flat Quantization) on LiDAR point clouds for fast retrieval of queries. The pgPointCloud and pgVector extensions of PostgreSQL aid in importing point clouds into the database and performing similarity-based query retrieval on the embedding space of the point clouds. The methodology developed in this paper explores how semantic embeddings can handle inadequate semantics of point clouds by enabling direct and complex 3D intelligent queries within the database environment, thereby overcoming the limitations of traditional LiDAR representations. Few queries presented in this paper highlight the applications of this proposed framework in individual tree detection, tree species identification, utility management, urban planning and anomaly detection.

9:45am - 10:00am

Intelligent Extraction Method for Geographic Information Feature Based on Human-Machine Collaboration

Hongwei Zhang¹, Jiage Chen², Chenchen Wu²

¹Chinese Academy of Surveying and Mapping, China, People's Republic of; ²National Geomatics Center of China, China, People's Republic of

The development of global geographic information resource products involves massive information processing of PB-level multimodal spatiotemporal data, and faces technical challenges brought by the global scale. In response to the challenges, we have made technological innovations to break through the key technologies for the development of global geographic information data products. With the main themes of "intelligent interpretation of typical elements, multi-source geographic data mining, and intelligent hybrid compilation", we have conducted and completed the overall technical research on the construction of global geographic information resources, formed an autonomous construction capability. Firstly, through crowd-sourced data mining and fusion technology to achieve content information extraction and knowledge fusion; Secondly, using multiple source data features, fast automatic extraction and integration of elements based on deep learning models was processed, and produce digital line graph data based on intelligent hybrid compilation. Based on the automatic feature extraction technology of deep learning, the production of digital line graph data products has been updated, and the accuracy evaluation has reached over 85%.

1:30pm - 3:00pm

WG II/2D: Point Cloud Generation and Processing
Location: 713A

1:30pm - 1:45pm

An Approach for deriving Branch Kinematics of Deciduous Trees from hyper-temporal terrestrial Laser Scanner Data

Anne Bienert, Katja Richter, Sophia Böhme, Hans-Gerd Maas

Dresden University of Technology, Institute of Photogrammetry and Remote Sensing, Germany

Understanding vegetation dynamics in three-dimensional, high-temporal resolution is essential for advancing ecological research and sustainable forest management. This study introduces a novel methodology for tracking branch kinematics in trees using hyper-temporal terrestrial laser scanning (TLS) data. Focusing on a solitary pedunculate oak (Quercus robur) over a one-year period, we employed a geometric feature detection algorithm combined with quantitative structure modeling (QSM) to identify and track distinctive point cloud sections on first- and second-order branches. By leveraging an iterative closest point (ICP) alignment process, branch kinematics were analyzed across multiple epochs, yielding detailed three-dimensional movement trajectories. The results demonstrate that branch movements exhibit screw-shaped patterns. Temporal resolution analysis revealed that a one-week recording interval is sufficient for our study subject to reliably capture kinematic dynamics, whereas longer intervals (e.g., three weeks) result in significant deviations from actual trajectories. The proposed method proved robust against partial occlusions from leaf growth but struggled under extensive occlusions. This research highlights the potential of hyper-temporal TLS for non-contact, high-resolution monitoring of tree canopy dynamics and provides a foundational approach for future studies aimed at modeling vegetation movement and structural changes over time.

1:45pm - 2:00pm

In-Field 3D Wheat Head Instance Segmentation From TLS Point Clouds Using Deep Learning Without Manual Labels

Tomislav Medic¹, Liangliang Nan²

¹ETH Zurich, Switzerland; ²TU Delft, Netherlands

3D instance segmentation for laser scanning (LiDAR) point clouds remains a challenge in many remote sensing-related domains. Successful solutions typically rely on supervised deep learning and manual annotations, and consequently focus on objects that can be well delineated through visual inspection and manual labeling of point clouds. However, for tasks with more complex and cluttered scenes, like in-field plant phenotyping in agriculture, such approaches are often infeasible. In this study, we tackle such a task - in-field wheat head instance segmentation using terrestrial laser scanning (TLS) point clouds. To address the problem and circumvent the need for manual annotations, we propose a novel two-stage pipeline. To obtain the initial 3D instance proposals, the first stage uses 3D-to-2D multi-view projections, the Grounded SAM pipeline for zero-shot 2D object-centric segmentation, and multi-view label fusion. The second stage uses these initial proposals as noisy pseudo-labels to train a supervised 3D panoptic-style segmentation neural network. Our results demonstrate the feasibility of the proposed approach and show significant performance improvements (up to +50\% in F1-score) relative to Wheat3DGS, a recent alternative solution for in-field wheat head instance segmentation without manual 3D annotations based on multi-view RGB images and Gaussian Splatting, showcasing TLS as a competitive sensing alternative. Moreover, the results show that both stages of the proposed pipeline can deliver usable 3D instance segmentation without manual annotations, indicating promising, low-effort transferability to other comparable TLS-based point cloud segmentation tasks.

2:00pm - 2:15pm

Optimal Path Planning for Kinematic Laser Scanning

Julius Knechtel¹, Mohammad Kordgholiabad², Jan-Henrik Haunert¹

¹University of Bonn, Germany; ²Politecnico di Milano, Italy

Prompted by the rapid advancements in software and hardware, 3D building data for numerous different applications is nowadays often captured via mobile or kinematic laser scanning. However, in contrast to other laser scanning methods, there exist only a few approaches tailored for the planning of a kinematic laser scan survey, and none of them provides an optimality guarantee. Therefore, we propose a novel approach based on Mixed Integer Linear Programming (MILP) to find the optimal trajectory for such a survey. To obtain a high-quality point cloud, we account for scanner-related constraints that influence the quality of the resulting point cloud. Moreover, we enable the introduction of tie points to mitigate the effects of uncertainties in the position estimation that are propagated in the acquired data. In our problem formulation, we aim to find the best tour in a properly weighted graph. For this, we propose two different weight settings to either enable a purely length-based optimization or to increase the redundancy in the measurements by incorporating a Visibility Ratio Factor (VRF) into the objective function.

To prove the applicability of our approach for offline panning, we apply our formulation to three different scenarios. In this context, the VRF-based weighting enables a significant speed-up of the solving process while resulting in only slightly prolonged routes. This approach paves the way for applying exact algorithms with an optimality guarantee in the planning process for efficient kinematic laser scanning surveys.

2:15pm - 2:30pm

Non-Contact Modal Analysis of Wind Turbine Blades using Terrestrial Laser Scanner

Martina Göring

Jade Hochschule, Germany

This contribution introduces a novel method for non-contact, marker-free modal analysis of wind turbine blades using terrestrial laser scanning (TLS). As part of a research initiative, TLS's potential for assessing modal properties like natural frequencies and mode shapes—key for extending blade service life—is explored. Traditionally, this analysis relies on numerous accelerometers, incurring high costs and effort. TLS is evaluated as a viable alternative.

In laboratory tests, TLS and photogrammetry were used on a 4-meter test object in vibration. Photogrammetric data, serving as a reference, used 3D coordinates from retroreflective markers for frequency analysis via Fast Fourier Transform (FFT). TLS data were similarly segmented, with frequencies derived using FFT, and both methods showed consistent results, validating TLS's feasibility.

Building on lab results, the method was applied to an 88-meter rotor blade in a field experiment. The laser scanner collected profile data along the blade's longitudinal axis, converted to the object coordinate system. By segmenting the blade, eigenfrequencies were determined. The calculation process was validated with simulations, achieving precise results even with manual blade excitation and amplitudes up to 20 cm.

TLS measurements reveal valuable insights into eigenfrequencies and modal shapes along the blade. This approach offers a cost-effective, efficient alternative to traditional sensor-based analysis, proving its practicality for the wind energy industry.

2:30pm - 2:45pm

Pixel-Accurate Registration of Photogrammetric Images and LiDAR in a Hybrid Airborne Oblique Imaging System

Deyan Deng¹, Rongjun Qin^1,2, Elisa Mariarosaria Farella³, Fabio Remondino³

¹Department of Civil, Environmental and Geodetic Engineering, The Ohio State University, Columbus, USA; ²Department of Electrical and Computer Engineering, The Ohio State University, Columbus, USA; ³3D Optical Metrology (3DOM) unit, Bruno Kessler Foundation (FBK), Trento, Italy

Hybrid airborne imaging systems combining oblique cameras and LiDAR sensors offer significant advantages for applications requiring both geometric precision and rich texture information, including infrastructure monitoring, facility surveying, and detailed urban modeling. Despite capturing temporally consistent multi-modal data, achieving pixel-level registration between imagery and LiDAR remains fundamentally challenging due to insufficient calibration infrastructure and the technical complexity of deeply integrating heterogeneous sensors.

A critical bottleneck is that standard photogrammetric workflows exhibit non-linear cumulative drift, particularly across extended flight strips. This spatially varying deformation causes systematic misalignments when photogrammetric reconstructions are overlaid with LiDAR geometry. Conventional approaches applying global rigid transformations fail to address this issue because photogrammetric drift is inherently non-uniform—a single global registration cannot correct localized geometric deviations throughout the scene.

This work introduces a novel view-dependent registration framework that synergizes LiDAR's global geometric fidelity with photogrammetry's local density. Rather than attempting to warp entire models through global transformations, we decompose the registration problem by treating the geometry within each camera frustum as an independent rigid body. Building upon initial georeferencing, we perform fine-grained local SE(3) rigid registration to anchor each Multi-View Stereo (MVS) depth map directly to sparse LiDAR geometry within its corresponding viewing frustum.

This localized approach enables pixel-accurate alignment within individual frames while effectively compensating for accumulated photogrammetric drift and interpolation errors. By addressing registration at the frustum level rather than globally, our method achieves practical pixel-level fusion of hybrid airborne datasets, unlocking the full potential of integrated camera-LiDAR systems for high-precision geospatial applications.

2:45pm - 3:00pm

Integrating Airborne LiDAR and OpenStreetMap Features for Automated Hydrological Conditioning of Urban Digital Elevation Models

Tommaso Destefanis^1,2,3, Elena Durando³, Matilde Oliveti³, Emanuele Artù Cassin³, Martina Di Rita²

¹Sapienza Università di Roma, DICEA, Rome, Italy; ²Politecnico di Torino, SDG11Lab, Interuniversity Department of Regional and Urban Studies and Planning (DIST), Turin, Italy; ³Ithaca S.r.l., Turin, Italy

High-resolution Digital Elevation Models (DEMs) are essential for urban flood modelling, where small elevation differences govern surface drainage and inundation extent. DEMs frequently contain hydrological inconsistencies: elevated infrastructure such as bridges, tunnels and culverts may appear as artificial barriers disrupting flow continuity, while linear structures such as retaining walls may be underrepresented depending on spatial resolution or point density. These inconsistencies propagate errors through downstream hydraulic simulations.

This paper presents an automated, open-source Python pipeline for generating hydrologically conditioned DEMs by integrating classified airborne LiDAR data with OpenStreetMap (OSM) infrastructure features. The workflow is tested on a 16 km2 area of central Copenhagen using a 2023 national LiDAR acquisition at 13.5 pts/m2. A 0.5 m resolution DSM is generated from LiDAR ground and building classes via Inverse Distance Weighting interpolation, with Nearest Neighbour gap-filling for hydraulic model continuity. Hydrological conditioning is performed through four sequential operations: bridge burning, tunnel enforcing, culvert enforcing, and barrier rasterization. Barrier top-of-wall elevations are estimated directly from the LiDAR point cloud.

Vertical accuracy is assessed by pixel-wise comparison against the Danish national terrain model DHM/Terraen (NMAD = 0.066 m, LE90 = 0.265 m) and by independent checkpoint validation against the HojdefikspunktDanmark geodetic network. The inclusion of shallow tunnel underpasses proved a significant addition: tunnel features alone contributed approximately half of the total depression volume reduction. The conditioned DSM is designed as input for an urban flood simulation chain; full hydraulic validation will be performed by the Danish Meteorological Institute within the CLEAR-EO project.

3:30pm - 5:15pm

SpS4A: Remote Sensing of Atmospheric Components for Climate Change and Air Quality: Bridging ISPRS and AERSS
Location: 713A

3:30pm - 3:45pm

Satellite Remote Sensing and Numerical Simulation of the Impact of Biomass Burning on Black Carbon in East Asia

Huimin Chen¹, Jane Liu^2,3, Bingliang Zhuang⁴, Tijian Wang⁴

¹Suzhou Meteorological Bureau, China; ²Fujian Normal University, China; ³University of Toronto, Canada; ⁴Nanjing University, China

As an essential component in the atmosphere, black carbon (BC) can affect regional and global climate, air quality, and human health. Biomass burning is an important source of BC aerosols, and biomass burning in East Asia is rather active. In this study, we analyze the biomass burning over East Asia in 2010 using MODIS satellite fire data. A global chemical transport model, GEOS-Chem, is used to simulate temporal and spatial variations of BC aerosols and impact of biomass burning on these variations through two numerical experiments: one with all BC emissions while the other without the biomass burning emissions.

The results show that the 2010 biomass burning over East Asia frequently occurred in northeast China, north China, northern India and indo-China Peninsula. In China, biomass burning mostly happened in summer and fall, while in Southeast Asia, biomass burning happened in spring and winter. GEOS-Chem can reasonably reproduce the temporal and spatial variations of BC. The surface concentrations of BC in China are high in the North China and Southwest basins. Such a spatial pattern is similar in four seasons, with seasonality that BC concentrations are the highest in winter, followed by autumn, spring and summer. Sensitivity analysis shows that the biomass burning in East Asia contributed 8.6% BC concentrations in East Asia. Based on the EOF decomposition and correlation analysis, the BC concentrations due to biomass burning in some parts of East Asia was significantly increased through transport of BC in the first mode at 850 hPa in spring and winter.

3:45pm - 4:00pm

Validation of global land-ocean aerosol products retrieved from the DPC-2/GF-5(02) on-orbit measurements

Zihan Zhang¹, Zhengqiang Li^1,2,3

¹Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China; ²State Key Laboratory of Spatial Datum, College of Remote Sensing and Geoinformatics Engineering, Faculty of Geographical Science and Engineering, Henan University, Zhengzhou 450046, China; ³University of Chinese Academy of Sciences, Beijing 100049, China

The Chinese second-generation Directional Polarization Camera (DPC-2) onboard the GF-5(02) satellite provides global multi-angle, multispectral polarization observations, effectively bridging the gap between POLDER/PARASOL and SPEXone/PACE. Using one year of DPC-2/GF-5(02) measurements, land-ocean aerosol products are generated by fully exploiting polarization and angular information to enhance sensitivity to aerosol properties. Ground-based observations from the AErosol RObotic NETwork (AERONET) are used to evaluate the retrieval accuracy of Aerosol Optical Depth at 550 nm (AOD550), Ångström Exponent between 440 nm and 670 nm (AE440-670), and Single Scattering Albedo at 440 nm (SSA440), demonstrating the stability and reliability of the retrievals. For AOD550, the Root Mean Square Error (RMSE) and bias are 0.109 and -0.006 over land, and 0.071 and -0.001 over ocean. For AE440-670, the RMSE and bias are 0.488 and -0.151 over land, and 0.275 and -0.047 over ocean. For SSA440, the RMSE and bias are 0.044 and 0.003 over land, and 0.039 and 0.002 over ocean. Comparisons with mainstream satellite aerosol products indicate comparable and consistent accuracy. Overall, these results provide a coherent global characterization of aerosol distribution and properties, highlighting the strong potential of DPC-2/GF-5(02) for long-term aerosol monitoring and climate research.

4:00pm - 4:15pm

Intra-urban aerosol heterogeneity in Hong Kong based on Lidar observations

Xinyu YU¹, Man Sing WONG^1,2,3,4, Jiaqi Jin¹

¹Department of Land Surveying and Geo-Informatics, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China; ²State Key Laboratory of Climate Resilience for Coastal Cities, The Hong Kong Polytechnic University, Hong Kong, China; ³Research Institute for Sustainable Urban Development, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China; ⁴Research Institute of Land and Space, The Hong Kong Polytechnic University, Kowloon, Hong Kong, China

This study involves remote sensing and Lidar-based data analysis to quantify the aerosol extinction profile under different urban patterns and seasons.

4:15pm - 4:30pm

Contrasting Meteorological Impacts of Dust Storms from the Gobi Desert versus the Taklimakan Desert over China

Jianhong Li, Nana Luo

Beijing University of Civil Engineering and Architecture, China, People's Republic of

Direct and indirect climate forcing from Asian dust storms has been well documented, such as lifted dust aerosols becoming cloud-forming particles and changing radiation flux from surface to the top of atmosphere. However, whether such forcing becomes distinguished as related to dust origins remains unclear. Here we present a comparative analysis of historical dust storms in China originating in Mongolia and Xinjiang from 2016-2023, and determine their respective dominators by involving their individual and combined influence on dust storms. Most dust storms originated in Mongolia, with observed long-range transport and global scale development, in comparison to those originating in Xinjiang. During dust storms, cloud properties such as cloud droplet radius and cloud retrieval fraction liquid had nonlinear response, and a dominant role in 60.2% of the study area. Climate conditions such as surface thermal radiation and dewpoint temperature became dominated in periphery of dust storms. Xinjiang-originated dust storms, in contrast, were dominated by local aridity (65.2%). As the aridity decreased, dust storms were dominated by total precipitation, with increase from 0.5 up to 3.6, and the influence of surface heat flux decreased. Heat-flux-dominated regions encountered increased aridity, and the dominance of total precipitation was neutralized. These findings have important implication for global management and mitigation of Asian dust emissions.

4:30pm - 4:45pm

The Arctic Observing Mission (AOM): A high priority candidate mission for the Government of Canada

Ray Nassar¹, Joseph Mendonca¹, Chris Sioris¹, Alec Casey², Matt Arkett², Josep Aparicio³, Kyle Ziolkowski⁴, Dan-Tam Nguyen⁵, Shen-En Qian⁵, Mary-Anne Fobert⁶, Alexander Trishchenko⁶

¹Environment and Climate Change Canada, Science and Technology Branch, Toronto, Canada; ²Environment and Climate Change Canada, Meteorological Service of Canada, Gatineau, Canada; ³Environment and Climate Change Canada, Science and Technology Branch, Dorval, Canada; ⁴Environment and Climate Change Canada, Science and Technology Branch, Winnipeg, Canada; ⁵Canadian Space Agency, St.-Hubert, Canada; ⁶Natural Resources Canada, Ottawa, Canada

The Arctic Observing Mission (AOM) is a satellite mission concept under study by the Canadian Space Agency (CSA), in partnership with Environment and Climate Change Canada (ECCC) and Natural Resources Canada (NRCan). AOM would use two satellites in a highly elliptical orbit (HEO) to enable frequent observations of meteorological variables, greenhouse gases (GHGs), space weather and air quality (AQ) over northern regions, reaching beyond the usable viewing range of geostationary satellites. These observations are important for operational activities, environmental monitoring and scientific research aligned with the Government of Canada priority of enhancing Arctic and northern situational awareness and security.

4:45pm - 5:00pm

Global Point Source CO2 Emissions Monitoring Based on Hyperspectral Remote Sensing Imagery

Huayi Wang^1,2, Ge Han^1,2, Zhipeng Pei^1,2

¹Hubei Key Laboratory of Quantitative Remote Sensing of Land and Atmosphere, School of Remote Sensing and Information Engineering, Wuhan University; ²Perception and Effectiveness Assessment for Carbon-neutrality Efforts, Engineering Research Center of Ministry of Education, Institute for Carbon Neutrality, Wuhan University

This study presents a hyperspectral remote sensing approach for monitoring global CO₂ point source emissions using China’s GF5 and ZY1 satellites. By applying the matched filter method in the 1.6 μm and 2.0 μm absorption band and the Integrated Mass Enhancement (IME) technique, this study successfully detects and quantifies emissions from multiple facilities within a single scene—demonstrated in a high-density industrial cluster in Xinjiang. Results show current systems can detect power plants with annual emissions above 2.90 MtCO₂, covering 6.74 GtCO₂/year globally across eight sectors. While power and chemical sectors are well captured, cement and petrochemical emissions remain poorly detected, highlighting the need for improved sensitivity to low-intensity sources.

5:00pm - 5:15pm

Remote Sensing of CO, ozone and Their Correlation in Tropical Fire Regions

Jane Liu¹, Jufen Lai², Joey Zhou Zang¹

¹University of Toronto, Canada; ²Jiangsu Ocean University

Biomass burning releases a large amount of pollutants including carbon monoxide (CO), and generates secondary pollutants, e.g., ozone (O3). Both CO and O3 are major pollutants and can also significantly affect tropospheric chemistry. Understanding O3-CO relationship is important for evaluating transport and evolution of the pollutants in fire plumes. Here, we analyse the satellite remote sensing of fire count data from MODIS, satellite remote sensing of CO and O3 from AIRS, and the simulation of the global atmospheric chemistry model GEOS-Chem in the middle and lower troposphere during June and August of 2010. AIRS can capture fire-induced CO and O3 enhancements (ΔCO and ΔO3) well in fire-affected and fire-plume outflow regions. Two areas with high ΔCO and ΔO3 include central Africa and northwestern South America in the tropics, where the numbers of hotspots are the large in the MODIS fire data. AIRS CO and O3 in fire plumes are highly correlated in 850 hPa and 500 hPa. The GEOS-Chem simulation show CO and O3 enhancement in northwestern South America, but with lower ΔO3/ΔCO values. These findings highlight the importance of integrating satellite observations with atmospheric chemistry modelling on refining fire-affected air quality and tropospheric chemistry assessments.

Date: Friday, 10-July-2026

8:30am - 10:00am

WG II/2E: Point Cloud Generation and Processing
Location: 713A

8:30am - 8:45am

Appearance-aware Scaling Diffusion Model for 3D Point Cloud Upsampling

Sunghwan Yoo, Gunho Sohn

York University, Canada

This paper introduces the Appearance-guided Scaling Diffusion Model (AGDM), a novel diffusion-based framework designed to densify sparse airborne laser scanning (ALS) point clouds while preserving fine geometric detail. Traditional diffusion models for 3D upsampling, such as LiDiff and PUDM, operate solely on intrinsic 3D information and struggle to reconstruct sharp edges and continuous surfaces when input data are extremely sparse. AGDM addresses these limitations by integrating two complementary conditional priors: multi-view appearance cues and geometry-aware 3D features.

Sparse point clouds are first rendered into ten synthetic viewpoints, and a Vision Transformer extracts high-level visual embeddings that encode surface appearance and boundary structures. In parallel, a Minkowski-based encoder processes the input geometry to capture spatial continuity and local shape characteristics. A cross-attention fusion module aligns and combines these modalities, producing a unified conditioning signal that guides a scaling diffusion network during iterative denoising.

AGDM is trained and evaluated on the YUTO dataset, where dense ground-truth scenes are reconstructed from multi-mission ALS data. Experiments demonstrate that AGDM achieves superior performance across Chamfer Distance, Jensen–Shannon Divergence, F1 score, and multi-scale IoU metrics. Qualitative results further show that the model produces more uniform, edge-preserving, and structurally coherent point clouds than existing diffusion approaches.

By leveraging appearance guidance alongside geometric priors, AGDM significantly improves the fidelity and practicality of LiDAR point-cloud upsampling, offering an effective pathway for scalable and cost-efficient 3D digital-twin generation.

8:45am - 9:00am

Scan Outlier Ratio (ScOR): LiDAR Scanning and Survey-Aware Filtering of Detached Points in Terrestrial and Permanent Laser Scanning Point Clouds

Ronald Tabernig^1,2, Bernhard Höfle^1,2

¹3DGeo Research Group, Institute of Geography, Heidelberg University, Heidelberg, Germany; ²Interdisciplinary Center for Scientific Computing (IWR), Heidelberg University, Heidelberg, Germany

Accurate 3D surface reconstruction and change analysis relies on point clouds representing persistent solid surfaces and should neglect very small (< laser footprint size) and temporary objects that create outliers. Terrestrial and Permanent Laser Scanning (TLS/PLS) data often contains transient or detached points, which violate assumptions of common cloud-, mesh-, and surface-based 3D change analysis methods. Those points cause wrong correspondences and change values in multi-temporal point cloud comparison. We address this with the Scan Outlier Ratio (ScOR) filter, a LiDAR scanning and survey-aware descriptor designed to identify points unsuitable for most point cloud-based change analysis methods. ScOR compares the measured point spacing with the expected spacing, assuming the surface is locally planar and orthogonal to the incoming laser beam. ScOR works with a single scan or multiple scans acquired from the same position, enabling multi-temporal neighborhoods for filtering. Using data from natural and urban environments, we analyze ScOR across different surfaces, neighborhood sizes, temporal neighborhoods, and compare it with the Statistical Outlier Removal (SOR) algorithm. Results show that ScOR successfully removes non-surface points, while preserving surface information. In our experiments, the true positive rate exceeds 95% in all but one case, while the false positive remains below 10% throughout. With neighborhoods from subsequent and aggregated epochs, the method automatically detects and removes large temporary objects (e.g., a person). Due to its interpretability, efficiency, and range-aware design, ScOR provides an effective pre-processing method for automated and near real-time 3D surface change analysis with TLS/PLS.

9:00am - 9:15am

LiDAR-Enhanced 3D Gaussian Splatting SLAM for Planetary Rover Exploration

Lingxiao Zhang¹, Rong Huang^1,2, Yusheng Xu^1,2, Zhen Ye^1,2, Xiong Xu^1,2, Changjiang Xiao^1,2, Xiaohua Tong^1,2

¹College of Surveying and Geo-Informatics, Tongji University, Shanghai 200092, China; ²Shanghai Key Laboratory for Planetary Mapping and Remote Sensing for Deep Space Exploration, Shanghai 200092, China

Autonomous positioning and scene reconstruction are crucial to the exploration and scientific research tasks of planetary rovers. 3D Gaussian splatting (3DGS) provides a new paradigm for dense reconstruction. However, the reconstruction method that relies only on monocular images will cause scale blur and insufficient geometric consistency. These problems are more prominent in planetary scenes that lack geometric constraints and weak textures. In order to overcome these limitations, we proposed a lidar-enhanced 3DGS-SLAM pipeline. By introducing sparse lidar measurements as prior information to improve depth prediction and ensuring consistent Gaussian initialization on the physical scale. Optimize the camera poses and Gaussian parameters through differentiable rendering to achieve robust localization and photometric-geometric consistency. Experiments on the Erfoud, a planetary similarity dataset, show that our method is superior to the advanced 3DGS-based SLAM system. The ATE has reduced by more than 50%. The PSNR, SSIM, and LPIPS have all improved significantly.

9:15am - 9:30am

Sensor Domain Adaptation for 3D Object Detection via LiDAR Super-Resolution

June Moh Goo, Zichao Zeng, Jan Boehm

University College London, United Kingdom

LiDAR-based perception models’ performance can degrade sharply when applied to data from sensors different to those they were trained on. LiDAR super-resolution aims to enhance sparse point clouds from low-cost sensors. This can help to bridge the sensor domain gap to higher resolution LiDAR. Prior work has primarily focused on reconstruction quality metrics for super-resolution with limited evaluation of downstream perception tasks. We address this gap by conducting a systematic analysis of how super-resolution quality impacts 3D object detection performance. We evaluate detection capability through zero-shot transfer experiments on the KITTI object dataset. Four representative detectors (SECOND, PointPillars, PV-RCNN, PointRCNN) trained on high-resolution data are directly applied to super-resolved low-resolution data without fine-tuning. Results reveal a critical insight: reconstruction improvements yield vastly different detection gains across architectures. PointPillars shows minimal improvement until reaching high reconstruction quality, then performance improves significantly. In contrast, PV-RCNN exhibits steady gains throughout. The highest-quality reconstruction closes up to 86% of the performance gap and enables detection in safety-critical scenarios, including distant vehicles and small pedestrians, where lower-quality methods fail entirely. This work establishes that LiDAR super-resolution effectiveness depends on both reconstruction quality and detector architecture.

9:30am - 9:45am

Ray Queries On Raw Point Clouds

Balthasar Teuscher, Paul Walther, Kwasi Nyarko Poku-Agyemang, Martin Werner

Technical University of Munich, Germany; TUM School of Engineering and Design, Department of Aerospace and Geodesy, Professorship of Big Geospatial Data Management

Retrieving information from point clouds for analysis and visualization has gained ever-increasing interest. A growing niche in this regard is ray queries, commonly used for image synthesis. Ray tracing is widely used in computer graphics, with a multitude of solutions based on bounding volume hierarchies. However, these solutions are rarely straightforward to integrate with raw point cloud data and geospatial analytical workflows. To overcome this, we present a novel approach to ray tracing in raw point clouds that builds upon and extends existing geospatial indices. The solution is exemplified by a fast octree implementation that supports versatile query semantics, such as neighborhood queries with constraints on k and radius for both points and rays, while offering configurable data organization schemes, including layered, fixed, and adaptive depth. The evaluation demonstrates satisfactory speed and capabilities for many scientific use cases, while simultaneously exhibiting low implementation costs, high flexibility, and simplicity in integrating ray tracing into analytical point cloud workflows.

9:45am - 10:00am

Analysis of free large Area covering Elevation Models and improvement by ICESat-2

Karsten Jacobsen

Leibniz University Hannover, Germany

Accuracy analysis of free elevation models TDX-EDEM, AW3D30, SRTM and ASTER GDEM-3. Determination of systematic elevation model errors by Z-shift, model tilt and systematic errors as function of X and Y. Comparison with ICESat-2 data, determination of the systematic elevation model errors by ICESat-2 ATL08 data and correcting the free elevation models. Accuracy analysis of the corrected elevation models by airborne LiDAR data.

The corrections based on the ICESat-2 data significantly improved the free elevation models.

1:30pm - 3:00pm

WG III/1F: Remote Sensing Data Processing and Understanding
Location: 713A

1:30pm - 1:45pm

From Image to Perception: Scene-Graph-Driven Modeling of Human-Scale Urban Experience with Street-view Images

Haipeng Yang, Xian Guo

Beijing University of Civil Engineering and Architecture, China, People's Republic of

This study examines how street-view scenes relate to urban perception using a scene-graph-driven modeling method. Each image is parsed into subject–predicate–object triplets; entity appearance from a CNN backbone and relation semantics from a Transformer detector are fused at node level via a learnable gate. A relation-aware graph neural network performs message passing and attentive readout to predict six perception dimensions (beautiful, boring, depressing, lively, safe, wealthy). Taking Place Pulse 2.0 dataset as benchmark, we convert pairwise votes to binary labels per dimension with standard train/validation/test splits. Experiments compare the graph approach against CNN+SVM and Transformer+SVM baselines under identical protocols. Results show consistently higher accuracy across all six dimensions, with notable gains for beautiful and wealthy. Gradient and integrated-gradient analyses offer node- and edge-level attributions, highlighting elements such as trees, facades, and overhead wires. The method balances accuracy with clarity, and the results point to practical cues that can support human-centered urban design.

1:45pm - 2:00pm

Real-Time Road Condition Detection and Mapping Using YOLOv11 and Built-In Car Dashcam

Harjot Josan¹, Frank Zhang², Baoxin Hu³

¹University of the Fraser Valley (UFV), Canada; ²University of the Fraser Valley (UFV), Canada; ³Dept. of Earth and Space Science and Engineering, York University, Toronto, Canada

Road surface conditions decline due to heavy traffic volumes, severe weather, and recurring utility works, yet still, many road agencies still rely on manual windshield surveys and semi-automated inspections. Not only are these methods time-consuming, but also difficult to scale and labour-intensive. With the help of recent advances in deep learning and the widespread availability of built-in vehicle dashcams, they offer new opportunities for low-cost, automated pavement assessments. This contribution presents a mobile, dashcam-based framework for detecting road-surface defects using the latest YOLOv11, which is combined with geolocation tagging for spatial visualization.

To test out our YOLOv11 training model, we conducted the initial dataset at the University of the Fraser Valley campus and manually annotated it to identify crack fillings, crosswalk markings, speed bumps, lane markings, and other surface conditions. This was just a prototype, which would later be trained to detect all road conditions, such as gravel, potholes, and uneven roads, as well. To address variations in lighting and motion, augmentation techniques were applied. YOLOv11 acquired a mean average precision above 90% across all tested categories.

This prototype demonstrates a practical, low-cost approach for real-time pavement monitoring. Future work includes expanding data collection, developing an operational dashboard for road authorities, having exact GPS coordinates pinned on maps with damaged road images, and evaluating model performance across different data sources, including models trained through Google Images. By producing actionable geospatial information, this system supports more efficient maintenance workflows and offers a scalable pathway for municipalities seeking to modernize road-condition assessment.

2:00pm - 2:15pm

Towards Global Interpretability: Evaluating XAI Metrics in Building Footprint Extraction

Elif Ozlem Yilmaz, Taskin Kavzoglu

Gebze Technical University, Turkiye

Global population is projected to increase by about 70% by 2050, with a growing proportion of people living in urban areas. This trend highlights the importance of accurately assessing urban expansion. Automatic building detection from remotely sensed imagery using deep learning (DL) has demonstrated considerable potential for applications, including sustainable urban planning and infrastructure monitoring. However, the inherent black-box nature of DL models limits their transparency and reduces trust in model-driven decisions. Although various Explainable Artificial Intelligence (XAI) approaches have been proposed to highlight image regions influencing model predictions, qualitative visual inspection alone is insufficient for reliably evaluating the credibility of these explanations. This study evaluates several XAI techniques for building footprint extraction using a U-Net model trained on a refined Massachusetts Buildings Dataset. The segmentation model achieved precision, recall, F1-score, IoU, and overall accuracy values of 89.68%, 85.69%, 87.53%, 79.03%, and 94.35%, respectively. To investigate the model’s decision-making process, three explanation methods, namely Saliency, GradientSHAP, and GuidedGradCAM, were applied. The quality of the generated explanations was then quantitatively assessed using 16 evaluation metrics. Beyond single-image analysis, a dataset-level evaluation was conducted using 547 image patches containing building coverage greater than 20%. The results indicate that GuidedGradCAM produces more consistent and reliable explanations. Furthermore, dataset-level analysis using dense-building samples provides a statistically more robust representation of overall model behaviour compared to evaluations based on individual images. These findings highlight the importance of quantitative assessment in validating the interpretability of DL models for building footprint extraction.

2:15pm - 2:30pm

MaskRoof: A deep Learning Framework and Benchmark Dataset for fine-grained urban Rooftop Utilization and potential Analysis

Jinfeng Xie¹, Haojie Yang², Lingshuang Dong³, Anthony Yeh¹, Yi Zhang²

¹The University of Hong Kong, Hong Kong S.A.R. (China); ²Institute of Future Human Habitats, Tsinghua Shenzhen International Graduate School; ³Huawei Technologies Co., Ltd., Dongguan, Guangdong Province, China

Urban rooftops represent a critical vertical resource for sustainable development, yet comprehensive assessment of their utilization patterns and available capacity remains constrained by inadequate datasets and limited algorithmic capabilities. This study introduces the Urban Rooftop Utilization Dataset (URUD), the first multi-city, pixel-level semantic segmentation dataset encompassing 1,560 high-resolution satellite images from four Chinese cities. URUD establishes eight semantic categories including a novel "available area" class to address ambiguous regions that existing classification schemes fail to capture. The study further proposes MaskRoof, a transformer-based deep learning framework specifically designed for fine-grained rooftop analysis. The model integrates two task-specific modules, Hierarchical Zoom-in Attention (HZA) and Prior-Guided Cross-Attention (PGCA), to address challenges of small-scale target detection and class imbalance. Experimental results demonstrate that MaskRoof achieves superior performance with 94.46% accuracy and 47.29% mIoU, outperforming existing segmentation architectures. Application to Shanghai's outer ring area reveals that 60.74% of rooftop space remains available for utilization, with significant spatial heterogeneity across building types. Industrial and warehouse structures retain substantially greater unutilized areas compared to office and residential buildings. These findings provide quantitative evidence for differentiated urban planning strategies and demonstrate the framework's capability for large-scale rooftop potential assessment in complex urban environments.

2:30pm - 2:45pm

A comparison of CNN, Transformer, and open-vocabulary architectures for real-time photovoltaic defect detection using UAV thermal imagery.

Aissam Salah¹, Mouad Jabrane², Imane Sebari¹

¹Department of Photogrammetry and Cartography, School of Geomatics and Surveying Engineering, IAV Hassan II, Rabat, Morocco; ²Research Unit of Geospatial Technologies for a Smart Decision, IAV Hassan II, Rabat 10101, Morocco

Real-time defect detection in solar farms is critical for profitability and safety. This paper compares state-of-the-art (SOTA) object detection architectures for deployment on edge computing platforms for the purpose of thermal PV defect detection using UAV imagery. We systematically evaluated Closed-Set (YOLOv10, YOLOv12, RT-DETR, RF-DETR) and Open-Vocabulary (YOLO-World, OWL-ViT) models on a public thermal dataset. Our results highlight two leading candidates. The transformer-based RF-DETR sets a new accuracy record at 82.6% mAP@0.50, driven by its self-supervised backbone, but its inference speed is low (12.6 FPS). Conversely, the CNN-based YOLO-World integrates language semantics to reach a competitive 78.1% mAP@0.50 while operating at a real-time speed of 31.3 FPS. We conclude that both RF-DETR and YOLO-World are promising for embedded thermal fault detection. The final selection will depend on on-platform inference performance.

3:30pm - 5:15pm

ThS18: Advances in Reality Capture, AI, and Digital Twin Technologies for Construction Engineering
Location: 713A

3:30pm - 3:45pm

Image sequence based prediction of the temporal evolution of fresh concrete properties under realistic conditions

Max Meyer¹, Amadeus Langer¹, Max Mehltretter¹, Dries Beyer², Max Coenen³, Bastian Strybny³, Tobias Schack⁴, Michael Haist⁴, Christian Heipke¹

¹Institute of Photogrammetry and GeoInformation, Leibniz University Hannover, Germany; ²Feist Construct GmbH, Bad Pyrmont, Germany; ³Institute of Building Materials Science, Leibniz University Hannover, Germany; ⁴Institute of Construction Materials, University of Stuttgart, Germany

Advancing the level of digitalization and automation in concrete manufacturing can substantially contribute to lowering CO2 emissions associated with the concrete production. This work introduces a new methodology for predicting the time-dependent properties of fresh concrete during mixing. For the prediction, a deep learning network is created which uses stereoscopic image sequences of the flowing material together with tabular data as input. Besides mix design parameters and process state data, like energy consumption, moisture and fresh concrete temperature, temporal information is included in the tabular data. The temporal information represents the time interval between image acquisition and the time for which the properties should be predicted. During training, this interval corresponds to the difference between the image acquisition and the time at which reference measurements are taken, allowing the network to implicitly learn the temporal evolution of the material properties, namely the slump flow diameter, yield stress, and plastic viscosity. Incorporating time-dependent prediction enables the forecasting of property changes throughout the mixing process, offering a valuable tool for real-time process control. This capability allows timely adjustments whenever deviations from the desired material behavior are detected. The experimental investigations presented in this paper demonstrate the feasibility of this method under realistic conditions.

3:45pm - 4:00pm

Single-image to model registration for semantic enrichment of indoor BIM

Dorota Włodarczyk, Małgorzata Jarząbek-Rychard

Institute of Geodesy and Geoinformatics, Wrocław University of Environmental and Life Sciences, Poland

Effective integration of geometric and semantic data within Building Information Models (BIM) is essential for the efficient life cycle management of modern facilities. However, maintaining accurate as-is BIM models for existing buildings remains a significant challenge, as manual updates are labour-intensive and full 3D reconstruction is often impractical for incremental changes. In such cases, image-based approaches offer a fast and flexible alternative, but require reliable alignment of 2D imagery with existing BIM geometry. To address this challenge, this study introduces a streamlined pipeline for semantic enrichment that uses a single-image visual localisation approach to directly align 2D imagery with existing BIM geometry. The proposed method integrates transformer-based panoptic segmentation (Mask2Former) with a closed-form Perspective-n-Line solver to estimate 6-degrees-of-freedom (6-DoF) camera poses. The novelty of the proposed approach lies in the explicit use of semantic information as a geometric constraint to guide the selection of 2D–3D correspondences for pose estimation. Semantic labels are utilised to filter line correspondences, ensuring that only stable architectural boundaries (e.g., walls, floors, and ceilings) are used in the registration process. Such semantic filtering stabilises correspondence selection, effectively mitigating pose ambiguity in repetitive indoor layouts or scenes where structural elements are partially obscured by furniture and clutter. Experimental results confirm high accuracy, achieving a median position error of 9.84 cm and an orientation error of 1.05° in complex indoor environments. This robust registration framework provides a reliable foundation for the downstream semantic enrichment and digital twin updates.

4:00pm - 4:15pm

LSTNet: Local Shape Transformer Network for Road Marking Extraction

Jiafeng Wu^1,2,3, Chaorui Liu^1,2,3, Jiajun Shi^1,2,3, Jonathan Li^1,2,3,4, Lingfei Ma^1,2,3,4

¹Key Laboratory of Geographic Information Science (Ministry of Education), East China Normal University, Shanghai 200241, China; ²Key Laboratory of Spatial-temporal Big Data Analysis and Application of Natural Resources in Megacities, Ministry of Natural Resources, East China Normal University, Shanghai 200241, China; ³School of Geospatial Artificial Intelligence, East China Normal University, Shanghai 200241, China; ⁴Hinton STAI Institute, East China Normal University, Shanghai 200241, China

Road markings are vital for HD maps and autonomous driving, yet LiDAR-based extraction is difficult due to missing RGB information, severe class imbalance, and thin, elongated geometry under sparse and noisy returns (Ma et al., 2020). We propose LSTNet, which performs local-shape tokenization by grouping points on tangent planes and encoding tokens from relative coordinates, normals, curvature, and intensity contrast. A geometry-aware transformer aggregates these tokens across multiple scales with attention biased by relative position and normal similarity, capturing long and thin structures while preserving edges. Our contributions can be summarized as follows: (1) We present LSTNet, which directly segments road marking from 3D point clouds, avoiding image conversion and preserving geometric fidelity. (2) We introduce a dedicated point-cloud dataset for road marking extraction to enable training and fair evaluation. (3) We design a task-specific and boundary-aware training objective that improves thin road marking recall and robustness under class imbalance.

4:15pm - 4:30pm

Automatic 3D Building Model Generation for Energy Digital Twins

Oscar Roman^1,2, Giorgio Agugiaro³, Ken Arroyo Ohori³, Maarten Bassier⁴, Elisa Mariarosaria Farella¹, Fabio Remondino¹

¹3D Optical Metrology, Bruno Kessler Foundation, via Sommarive 18, Trento, Italy; ²University of Trento, EICS and DII Department, Trento, Italy; ³3D Geoinformation group, Department of Urbanism, Faculty of Architecture and Built Environment, Delft University of Technology, Delft, The Netherlands; ⁴Department of Civil Engineering, TC Construction - Geomatics, KU Leuven - Faculty of Engineering Technology, Ghent, Belgium

The concept of Digital Twins (DTs) in Architecture, Engineering and Construction (AEC) domain encompasses a wide range of applications and scales, from single buildings to entire cities, spanning monitoring, simulation, energy management and operational control. Regardless of the specific application, a valid Digital Twin (DT) is a dynamic, data-driven model that stays continuously synchronized with its physical counterpart in both time and state via sensors and the Internet of Things (IoT). It must receive real-world input and provide feedback for analysis or control, ultimately progressing toward a self-operational DT. In the energy domain, an Energy Digital Twin (EDT) must be designed to (i) include sufficient geometric information (ii) support continuous monitoring, (iii) assist scenario-based simulation and (iv) enable operational maintenance and decision support. To achieve these objectives, the EDT’s geometry should be managed through two complementary representations: (i) a watertight solid volumetric model for physics-based simulation and (ii) a boundary representation (B-Rep) model for precise topology, semantics and data exchange. A mapping layer keeps the two representations consistent, preserving identity and topology across states and linking to the graph. Consequently, the EDT should adopt a multi-level architecture defining both geometric and data structures. This work introduces a robust Scan-to-Energy Digital Twins (Scan-to-EDTs) framework that generates multi-level building EDTs by integrating geometric, semantic and simulation layers to enable interoperable energy analyses.

4:30pm - 4:45pm

From propagation to prediction: point-level uncertainty evaluation of MLS point clouds under limited ground truth

Ziyang Xu¹, Olaf Wysocki³, Christoph Holst^1,2

¹Chair of Engineering Geodesy, TUM School of Engineering and Design, Technical University of Munich; ²TUM Leonhard Obermeyer Center, Technical University of Munich; ³CV4DT, University of Cambridge

Evaluating uncertainty is critical for reliable use of Mobile Laser Scanning (MLS) point clouds in many high-precision applications such as Scan-to-BIM, deformation analysis, and 3D modeling. However, obtaining the ground truth (GT) for evaluation is often costly and infeasible in many real-world applications. To reduce this long-standing reliance on GT in uncertainty evaluation research, this study presents a learning-based framework for MLS point clouds that integrates optimal neighborhood estimation with geometric feature extraction. Experiments on a real-world dataset show that the proposed framework is feasible and the XGBoost model delivers fully comparable accuracy to Random Forest while achieving substantially higher efficiency (about 3 times faster), providing initial evidence that geometric features can be used to predict point-level uncertainty quantified by the C2C distance. In summary, this study shows that MLS point clouds' uncertainty is learnable, offering a novel learning-based viewpoint towards uncertainty evaluation research.

4:45pm - 5:00pm

Automatic Scan-to-BIM: The Impact of Semantic Segmentation Accuracy on Opening Detection

Jidnyasa Patil, Arcot Sowmya, Mohsen Kalantari

University of New South Wales, Sydney, Australia

The automation of Scan-to-BIM remains a major challenge within the Architecture, Engineering, and Construction industry, particularly in the detection and geometric characterisation of architectural openings such as doors and windows. Although recent advances in 3D semantic segmentation have improved the classification of architectural elements, the effect of segmentation accuracy on downstream geometric detection and reconstruction is still under study. This work compares five state-of-the-art deep learning models, PointNeXt, PointMetaBase, Point Transformer V1, Point Transformer V3, and Swin3D, on opening detection in Scan-to-BIM. A unified evaluation framework integrating DBSCAN clustering with axis-aligned bounding box fitting is introduced to generate per-instance geometric representations. The models are assessed using semantic metrics and geometric reliability indicators, including centroid error, dimensional deviation and 3D IoU. Experiments on the S3DIS Area 5 dataset, reveal notable performance differences across models. Swin3D achieved the highest door detection rate of 96.9%, followed by PointMetaBase at 92.9%, PointNeXt at 87.4%, PTV3 at 85.0%, and PTV1 at 81.9%. Window detection proved more challenging, with Swin3D and PTV3 both achieving 75.0%, PTV1 at 71.2%, and PointNeXt and PointMetaBase at 67.3%. Notably, PointMetaBase produced strong geometric accuracy for doors despite lower semantic scores. These results suggest that high segmentation accuracy does not always lead to precise geometric reconstruction. To assess generalisation, the trained models were applied to 11 Matterport3D rooms, confirming that the observed patterns extend across different scanning environments. This study concludes that in Scan-to-BIM workflows, greater emphasis should be placed on geometric reconstruction algorithms than segmentation performance alone.

5:00pm - 5:15pm

Fast and accurate point surveying using the PIX4Dcatch mobile app

Giulia Rovelli¹, Marta Coelho Lopes², Gaia Amaranta Taberna¹, Adrian Fernandez¹, Paloma Pomares¹, Jean-Baptiste Magnin¹, Andrei Mitache¹, Davide Antonio Cucci¹, Christoph Strecha¹, Pierangelo Rothenbühler¹

¹PIX4D SA, Switzerland; ²École Polytechnique Fédérale de Lausanne (EPFL), Switzerland

The digitalization of the architecture, construction and subsurface utility engineering sectors demands efficient, accurate and flexible 3D point surveying methods. Established ones based on Global Navigation Satellite System (GNSS) rovers or total stations suffer from significant limitations, such as requiring open-sky visibility, high costs and complex setups. This paper introduces a novel method for georeferencing 3D points using the PIX4Dcatch mobile application coupled with an external Real-Time Kinematic (RTK) GNSS receiver. The method enables to survey a point of interest by just aiming the smartphone and tapping on the screen during a capture. A lightweight, modified Bundle Adjustment algorithm runs on the device, delivering accurate 3D coordinates in seconds without any post-processing. We evaluated the method by surveying several known cadaster points for hundreds of times across diverse field conditions, achieving a mean planimetry error norm of approximately 3 cm and 97% of errors below 10 cm. Similar statistics are achieved with single-point measurements using an RTK rover. Although not intended to replace millimeter-precision instruments, the accuracy profile of our method is perfectly suited for many applications, such as subsurface utility mapping, which often have decimeter-level regulatory requirements. Given its high efficiency, low cost and ease of use, we believe that our method has the potential to transform as-built documentation workflows in diverse engineering sectors.

Date: Saturday, 11-July-2026

8:30am - 10:00am

ThS21: The Global-local Exchange Loop: Coupling Earth Observation and Citizen Sciences for LCLU Mapping
Location: 713A

8:30am - 8:45am

OntoLULC-SOTA: An ontology based approach to make systematic reviews for LULC data

Martin Cubaud¹, Ana-Maria Olteanu-Raimond¹, Cidalia C. Fonte^2,3, Diogo Duarte^2,4, Jacinto Estima⁵, Linda See⁶, Nicolas Gonthier¹, Laurence Jolivet¹, Clément Mallet¹, Arnaud Le Bris¹, Vyron Antoniou⁷

¹Univ Gustave Eiffel, Géodata Paris, IGN, LASTIG, F-77454 Marne-la-Vallée, France; ²Institute for Systems Engineering and Computers at Coimbra (INESC Coimbra), 3030-290 Coimbra, Portugal; ³University of Coimbra, Department of Mathematics, Apartado 3008, EC Santa Cruz, 3001-501 Coimbra, Portugal; ⁴Department of Electrical and Computer Engineering, Polo 2, 3030-290 Coimbra, Portugal; ⁵University of Coimbra, CISUC, Department of Informatics Engineering, Rua Sílvio Lima, 3030-290 Coimbra, Portugal; ⁶International Institute for Applied Systems Analysis (IIASA), Laxenburg, Austria; ⁷Hellenic Army Geographical Directorate, 15561 Cholargos, Greece

Land Use (LU) and Land Cover (LC) data allow us to understand the physical and human activities associated with a given land. Thus, LULC is a dynamic and highly researched field. LULC review papers are numerous and provide high-level insights about the proposed approaches, the data used, the study cases, the strengths and limitations, and the identification of new research gaps. Nevertheless, these reviews are not systematic and reproducible. The goal of this work is to propose an ontology to help the research community conduct systematic and shareable literature reviews and comparable analytical analyses of scientific papers. To achieve this, we formalize their metadata, content, strengths, and weaknesses. In particular, we consider the scientific paper as the central element of our ontology and we define formal semantics for all relevant items (data process, LULC life cycle and scientific paper). We hope to open the path to more efficient synthesis, discovery, and reuse of research outcomes from the literature. To facilitate the instantiation process and make it accessible to a broader range of researchers, we designed a tabular-based template. We used our template to simulate the process of conducting a literature review on three use cases: building function, global land cover mapping, and multi-class change detection.

8:45am - 9:00am

Manual Annotations meet Fine-Tuned Foundation Models: a Comparison on Tree Crown Segmentation Task

Rewanth Ravindran, Janik Steier, Samer Karam, Dorota Iwaszczuk

Technical University of Darmstadt, Germany

Accurate segmentation of individual tree crowns (ITCs) from remote-sensing imagery is essential for forest monitoring and ecological analysis, yet remains challenging due to overlapping canopies and structural variability. The Segment Anything Model (SAM) shows strong generalization capabilities but requires effective prompting and domain adaptation for remote sensing applications. In this study, we investigate a lightweight fine-tuning strategy using Low-Rank Adaptation (LoRA) to adapt SAM for ITC segmentation on the BAMFORESTS dataset. The impact of different prompting strategies is evaluated, including manually annotated point and bounding box prompts, as well as automatically generated bounding boxes derived from a pre-trained tree detector. SAM is fine-tuned with instance-level ITC masks, enabling prompt-aware segmentation of multiple tree crowns per image. Performance is assessed before and after fine-tuning using standard instance segmentation metrics, including IoU and F1-score. Results show that LoRA-based adaptation improves mask delineation and robustness to prompt variability, with bounding box prompts consistently outperforming point-based inputs. Automatically generated prompts enable a fully automated workflow, although their effectiveness depends on detection quality. Evaluation on an independent validation site with manually annotated ITC labels shows that the fine-tuned LoRA-SAM model achieves performance comparable to manual annotations, while significantly reducing annotation effort. These findings highlight the importance of prompt design in adapting foundation models for remote sensing tasks and demonstrate that parameter-efficient fine-tuning provides a practical pathway toward scalable ITC segmentation.

9:00am - 9:15am

Evaluation of the IGN FLAIR-HUB Model Transferability Performance for Land Cover Mapping in Iasi, Romania

Ana-Maria Loghin¹, Loredana-Mariana Crenganis¹, Constantin Stoian², Ana-Maria Olteanu-Raimond², Anatol Garioud², Valeria-Ersilia Oniga¹, Bogdan Rusu¹

¹quot;Gheorghe Asachi" Technical University of Iasi, Romania; ²Univ. Gustave Eiffel, IGN-ENSG, LaSTIG – Saint-Mande, France

This research rigorously evaluates the transferability of the pre-trained FLAIR-HUB deep learning model, developed by the French National Institute of Geographical and Forest Information (IGN), in terms of spatial generalizability and multi-resolution robustness, when transferred from its native French domains to the complex urban-agricultural landscape of Iasi, Romania.

The core objective of this investigation is to test the model's performance stability across severe multi-resolution domain shifts and temporal scenarios. The model architecture is applied to orthophotos acquired over Iasi in 2019 (at 0.5 m resolution) and 2024 (at 0.2 m and at a very high resolution of 0.084 m), enabling a comprehensive assessment of cross-resolution and temporal robustness.

A novel validation framework is introduced, combining conventional 2D raster-based evaluation with a 3D point-wise assessment using semantically labeled UAV-derived point clouds. The results demonstrate strong performance for dominant classes such as buildings and herbaceous vegetation, with improved accuracy at higher spatial resolution, while stable classes such as buildings and impervious surfaces show a comparatively robust performance, confirming the model’s capability to consistently represent invariant land cover types. However, performance decreases for heterogeneous and vegetation-related classes due to seasonal variability and class complexity. The 3D validation reveals slightly lower but consistent results, highlighting its role as a more rigorous evaluation approach. Overall, the study confirms the potential of transferring pre-trained semantic segmentation models to new geographic contexts, while emphasizing the importance of spatial resolution, temporal consistency, and validation strategy.

9:15am - 9:30am

Towards efficient Giant Tree Inventories: Deep Learning with crowdsourced Training Data

Yu-Hui Wang¹, Chi-Kuei Wang¹, Chung-Cheng Lee¹, Rebecca Chia-Chun Hsu²

¹Dept. of Geomatics, National Cheng Kung University, Chinese Taipei; ²Forest Ecology Division, Taiwan Forestry Research Institute, Chinese Taipei

Airborne Laser Scanning (ALS) data have been used to identify giant trees in Taiwan, yet current workflow included volunteers to visually inspect ALS profile images. This study proposed to replace the volunteer-based verification step by applying deep learning to ALS profile images. Candidate treetop locations were first extracted from a Canopy Height Model (CHM) using a 65 m threshold and local maxima filtering. For each candidate, a representative ALS profile image was generated following an automated angle-selection method based on terrain fitting.

An EfficientNetV2-S model was trained using volunteer-labelled profile images from previous nationwide surveys. After label cleaning, a refined dataset was constructed, and a hybrid resampling strategy was applied to address class imbalance. The final model achieved 99.0% overall accuracy, 98.1% precision, and 100% recall on the independent test set, successfully detecting every true giant tree.

To evaluate generalization, the model was applied to 97,487 candidates from the latest national ALS survey. Predictions exhibited a strongly bimodal confidence distribution, demonstrating stable between true and false positives and effectively reducing the manual inspection workload.

This study shows that deep learning can reliably replace crowdsourced verification, enabling scalable, supporting efficient updates of large-scale forest inventories.

9:30am - 9:45am

The Global-Local loop: what is missing in bridging the gap between geospatial data from numerous communities ?

Clément Mallet, Ana-Maria Olteanu-Raimond

Univ Gustave Eiffel, IGN, Géodata Paris, LASTIG, France

We face a unprecedented amount of geospatial data, describing directly or indirectly the Earth Surface at multiple spatial, temporal, and semantic scales, and stemming from numerous contributors, from satellites to citizens. The main challenge in all the geospatial-related communities lies in suitably leveraging a combination of some of the sources for either a generic or a thematic application. Certain data fusion schemes are predominantly exploited: they correspond to popular tasks with mainstream data sources, e.g., free archives of Sentinel images coupled with OpenStreetMap data under an open and widespread deep-learning backbone for land-cover mapping purposes. Most of these approaches unfortunately operate under a "master-slave" paradigm, where one source is basically integrated to help processing the "main" source, without mutual advantages (e.g., large-scale estimation of a given biophysical variable using in-situ observations) and under a specific community bias.

We argue that numerous key data fusion configurations, and in particular the effort in symmetrizing the exploitation of multiple data sources, are insufficiently addressed while being highly beneficial for generic or thematic applications. Bridges and retroactions between scales, communities and their respective sources are lacking, neglecting the utmost potential of such a "global-local loop". In this paper, we propose to establish the most relevant interaction schemes through illustrative use cases. We subsequently discuss under-explored research directions that could take advantage of leveraging available data through multiples scales and communities.

10:30am - 12:00pm

WG III/1G: Remote Sensing Data Processing and Understanding
Location: 713A

10:30am - 10:45am

YOLOv8m-CCFM-GSConv: Research on Lightweight Marine Oil Spill Target Detection Based on Improved YOLOv8m Model

Junjie Lu¹, Qingyang Wang^1,2,3, Bo Song¹, Jianwu Jiang^1,2,3, Bin Yang¹, Chen Jiao¹

¹College of Geomatics and Geoinformation, Guilin University of Technology, Guilin 541004, China; ²Guangxi Ecological Spatiotemporal Big Data Perception Service Laboratory, Guilin 541004, China; ³Guangxi Key Laboratory of Spatial Information and Geomatics, Guilin 541004, China

In the application of target detection for marine oil spills, deep learning methods are gradually replacing traditional remote sensing image recognition approaches. While complex models designed for higher accuracy may compromise recognition speed, they often fail to meet the rapid response requirements of terminal device applications (Chai et al, 2025). Therefore, developing a lightweight detection model that balances high accuracy and real-time performance is crucial for enhancing marine oil spill emergency response capabilities (Liang et al, 2024). Based on the yolov8m model, this study introduces GSConv (Li et al, 2024) lightweight convolution and CCFM (Guo et al, 2025) cross-scale feature fusion module, which significantly improves the adaptability of multi-scale target detection and recognition accuracy in complex backgrounds while maintaining model lightweightness, thereby offering a novel and effective solution for marine oil spill target detection.

10:45am - 11:00am

Detecting moving vehicles on Sentinel-2 imagery using semi-automatic labeling from S2A/S2C tandem phase

Guillaume Buthmann¹, Florentin Poucin¹, Jérémy Anger^1,2

¹Kayrros SAS; ²Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, 91190, Gif-sur-Yvette, France

During the commissioning phase of ESA's Sentinel-2C, tandem images with Sentinel-2A were acquired with a delay of 30 seconds. We present a novel, automated method for labeling moving vehicles in Sentinel-2 images, leveraging the temporal offset between these tandem acquisitions. We propose a filtering process that isolates pixels corresponding to vehicles that moved between the two acquisitions. We generate a training dataset based on this process, removing the need for a large manual labeling phase. The dataset is used to train a standard deep-learning-based vehicle detection model. Experimental results, as well as a validation study using ground-truth data from California, highlight the quality of the proposed labeling method, and show that a vehicle detection model can be successfully trained from quasi-simultaneous acquisitions.

11:00am - 11:15am

LAD-Enhancer: A Lightweight All in One Aerial Detection Enhancer Under Adverse Weather

Yu Wan¹, Jie Li¹, Liupeng Lin², Zaiyan Zhang¹, Qiangqiang Yuan¹, Huanfeng Shen²

¹School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China; ²School of Resource and Environmental Sciences, Wuhan University, Wuhan 430079, China

With the rapid development of aerial imaging technology, aerial target detection has become a research hotspot with broad applications in intelligent transportation, agricultural monitoring, and military surveillance. However, the performance of aerial detection models is often degraded under adverse weather conditions such as fog, sandstorms, and low illumination. In such environments, aerial images typically suffer from reduced contrast and color distortion, which significantly affects the model’s ability to accurately identify targets. To this end, a Lightweight All-in-One Aerial Detection Enhancer Under Adverse Weather (LAD-Enhancer) has been proposed. The designed enhancer processes and restores degraded aerial images, thereby enhancing the detection model’s ability to perceive potential targets. Unlike conventional image restoration models, LAD-Enhancer integrates detection labels as additional supervision during training to ensure that enhancement is detection-oriented rather than purely visual. Furthermore, a three-stage training strategy and a Mixture of Experts (MoE) framework are employed to adaptively classify and process images captured under different degradation conditions. Experimental results demonstrate that, with an increase of fewer than 3K parameters, the proposed LAD-Enhancer significantly improves detection performance under adverse weather conditions while maintaining almost unchanged performance on clear-weather images.

11:15am - 11:30am

A Collaborative Detection Method of Small Unmanned Aerial Vehicle Target via Multi-modal Feature Fusion in Complex Background

Wen Jiang, Keyi Zhang, Yanping Wang, Yun Lin, Fukun Bi

North China University of Technology, Beijing, People's Republic of China

Currently, the state-of-the-art methods for detecting small unmanned aerial vehicles (UAVs) continue to struggle in complex urban settings due to several persistent challenges, namely, frequent target occlusion, high similarity in thermal radiation signatures between UAVs and their surroundings, and the inherently low visual saliency of small UAV targets, all of which contribute to degraded detection performance. To tackle these issues, this paper introduces a novel multi-modal feature fusion collaborative detection (MFFCD) framework grounded in learnable spatial mapping. The architecture consists of three key components: firstly, a multi-branch parallel feature extraction module (MBPFE) that simultaneously processes infrared, visible, and radar range-azimuth images, complemented by a feature fusion module (FFM) designed to enhance both intra-modal and inter-modal feature interactions; then, an adaptive spatially-aware dynamic detection head module (DDH) that dynamically recalibrates feature weights to strengthen target representation and boost detection accuracy; and a feature collaborative enhancement module (FCE) that employs a learnable affine transformation to align and fuse multi-modal features, thereby producing more robust and reliable detection outcomes. Extensive experiments show that the proposed MFFCD framework substantially outperforms existing methods under challenging urban conditions, achieving a 56.89% gain in Mean Average Precision (mAP) for small UAV detection.

11:30am - 11:45am

Infrared-Visible Image Fusion Method Based on Differential Feature Enhancement and Cross-Modal Attention

Huang Zhang¹, Lina Xu¹, Qing Zhou¹, Tiyou Zhou², Siyu Liu¹, Xincai Chang¹, Hao LI¹

¹Hubei Subsurface Multi-scale Imaging Key Laboratory, School of Geophysics and Geomatics, China University of Geosciences, Wuhan, 430074, China; ²State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, 430079, China

Infrared and visible remote sensing image fusion is crucial for improving scene perception in complex environments, but existing autoencoder-based methods suffer from insufficient information interaction between modalities, inadequate deep feature fusion, and ineffective loss functions in extreme scenarios. To address these issues, this study proposes a Differential Feature Enhancement and Cross-modal Fusion (DFECF) method. The DFECF adopts an end-to-end architecture consisting of dual-stream encoders, cross-modal fusion modules, Transformer global perception modules, and decoders. Specifically, the Differential Enhancement (DE) module extracts differential information between infrared and visible features, combined with spatial and channel attention to enhance feature representation. The cross-modal fusion module adaptively integrates deep features based on channel attention, adjusting feature weights according to scene characteristics. The Transformer module supplements the global receptive field to capture long-range feature dependencies, and a joint loss function is designed to optimize fusion performance. Experimental results on public datasets show that the proposed method outperforms existing state-of-the-art methods in both subjective visual effects and objective evaluation metrics, especially in extreme environments such as strong light and thick smoke. It effectively improves the integrity of scene perception and provides high-quality data support for practical applications such as forest fire prevention, mining area monitoring, and autonomous driving.