JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at isprs2026@icsevents.com.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Daily Overview

Location: 713A
125 theatre

Date: Wednesday, 08-July-2026

8:30am - 10:00am

WG II/2C: Point Cloud Generation and Processing
Location: 713A

8:30am - 8:45am

Differentiable deep consistency for point cloud registration

Tian Zhang, Sagi Filin

Technion - Israel Institute of Technology, Israel

Point cloud registration is a key facilitator for scan alignment in mapping, autonomous driving, and robotic applications. Current pipelines increasingly adopt neural-based paradigms, where most research focuses on learning view-consistent descriptors for correspondence matching. Due to outliers, matching is typically followed by a geometric verification phase that assesses correspondences by enforcing distance or angular consistency to support transformation estimation. Although effective, this verification stage scales quadratically, creating a computational bottleneck that hampers efficient registration. More importantly, since matching and verification are usually optimized separately, the verification stage cannot guide the learned descriptors or foster their geometric awareness. To address both limitations, we introduce a novel end-to-end neural registration framework that unifies correspondence learning and verification within a single differentiable formulation. Specifically, we propose a new consistency-driven cross-attention module that dynamically correlates cross-scan neighborhoods to suppress inconsistent matches and reinforce inter-scan feature coherence. In doing so, it produces robust and discriminative descriptors without incurring the quadratic cost of explicit pairwise verification. Our formulation is readily applicable, and we demonstrate its seamless integration into the GeoTransformer and RoITr state-of-the-art architectures without additional supervision or post-processing. Results show that our method excels in challenging low-overlap scenarios, where competing methods often yield few correct correspondences or fail entirely. It consistently achieves superior inlier ratios and the lowest registration errors on 3DMatch, 3DLoMatch, and KITTI, improving registration recall by up to 2.6%. Beyond accuracy, it converges faster during training and achieves the quickest inference among state-of-the-art methods.

8:45am - 9:00am

Cross-source Point Cloud Registration in the Bird’s-eye Domain: Aligning Street-level LiDAR with High-resolution Aerial Orthoimagery

Seunghwan Hong¹, Junghun Suh¹, Seokhyun Moon², Mohammad Gholami Farkoushi³, Hong-gyoo Sohn³

¹Kakao Mobility, Republic of Korea; ²University of Seoul, Republic of Korea; ³Yonsei University, Republic of Korea

Combining terrestrial Mobile Mapping System (MMS) point clouds with aerial photogrammetric data offers a practical route to comprehensive 3D urban models that integrate street-level geometric detail with wide-area coverage. However, direct 3D-to-3D registration between these data sources often fails because of large differences in viewpoint, point density, scale, and scene composition. This study presents an orthoimage-based registration framework that reformulates cross-source alignment in the Bird's-Eye-View (BEV) domain. After removing transient objects and extracting ground-level points from the MMS cloud, the data are rasterised into a synthetic orthoimage aligned in resolution and projection with a geo-referenced Unmanned Aerial Vehicle (UAV) orthoimage. A learned dense matcher establishes image correspondences, which are geometrically verified and lifted to 3D for coarse alignment, followed by tile-wise point-to-plane Iterative Closest Point (ICP) refinement and global trajectory regularisation via robust factor-graph optimisation. The aligned MMS and UAV point clouds are then integrated through reliability-driven voxel-level fusion. Experiments on a 3.7km urban corridor in Seoul demonstrate that the proposed framework achieves a 3D root-mean-square error of 6.19cm, indicating that BEV-domain orthoimage matching combined with local 3D refinement and trajectory regularisation provides a viable approach for large-scale MMS-UAV registration in dense urban environments.

9:00am - 9:15am

Automated Alignment Enhancement of Backpack Image-LiDAR Data in a Forest Environment

Jidong Liu, Sangyoon Park, Songling Fei, Ayman Habib

Purdue University, United States of America

In recent years, backpack mobile mapping systems (MMS) have shown great promise for under-canopy forest mapping. These systems integrate cameras, LiDAR sensors, and Global Navigation Satellite System/Inertial Navigation System (GNSS/INS) units to provide multi-modal geospatial data essential for modern forest applications that require both geometric and spectral information. However, transportation logistics and improper handling can degrade the system calibration. Moreover, canopy-induced GNSS signal outages will cause trajectory errors. The resulting misalignments between the image-LiDAR data necessitate the application of image–LiDAR registration. Such algorithms can be broadly classified as 2D-3D, 3D-3D, or 2D-2D, depending on the domain in which image-LiDAR features are identified. Due to the inherent modality differences, 2D–3D methods often struggle with feature matching. These methods typically require manual feature selection (Habib et al., 2005) or the availability of prominent features in urban environments (Liao et al., 2023). In contrast, 3D-3D methods rely on generating 3D image point clouds, which imposes strict requirements on image overlap (Yang et al., 2015). Although 2D–2D approaches are less demanding on image data (Hu et al., 2023), none have been applied in under-canopy forests, where establishing multi-modal correspondences remains challenging. To overcome these limitations, this study introduces a post-processing framework for automated image–LiDAR alignment enhancement for backpack MMS in forest environments. This method utilizes a 2D–2D image–LiDAR registration approach based on semantic tree-trunk features.

9:15am - 9:30am

A Marker-based Method for precise 3D Registration between CT-Data and photogrammetric Datasets

Sebastian Preussel¹, Cyrill Milkau², Hans-Gerd Maas¹

¹TU Dresden, Germany; ²HTW Dresden, Germany

In order to enable photogrammetric tracking of objects from a computed tomography (CT) dataset with a multi-camera system, a transformation between the CT data space and a photogrammetric reference frame is required, typically based on control points. To achieve a robust and precise registration between CT and photogrammetric datasets, this work proposes a marker-based approach. The main goal is to use a marker model that allows straightforward segmentation and control point estimation in CT voxel space, while also supporting reliable and precise control point estimation in the photogrammetric images. As a proof-of-concept, spherical markers were investigated, since they allow centre estimation in both domains. In the CT data, marker centres were determined by intensity-based thresholding followed by sphere fitting, while in the photogrammetric data they were estimated by intensity-based thresholding, edge detection, circle fitting, and multi-image spatial intersection. Two different marker models were tested. The results show that the proposed method is feasible and yields sub-millimetre standard deviations of unit weight for both marker types. However, since a sufficient stochastic model is not yet available, the reported accuracy measures may be optimistic and should therefore be interpreted with caution. Future work will address these limitations, in particular uncertainty modelling as well as remaining lighting and contrast issues.

9:30am - 9:45am

Advances in Historical Aerial Image Analysis: Boosting SfM Pipelines with Learned Models

Lucas Kugler¹, Jan Dirk Wegner¹, Camilo Rada^2,3, Livia Piermattei¹

¹University of Zurich, Switzerland; ²University of Magallanes; ³University of British Columbia

Scanned aerial images acquired with film cameras (hereafter referred to as historical images) over the past century is a unique source for deriving Digital Elevation Models (DEMs) and orthoimage to reconstruct past Earth’s surface and quantify long-term changes from glacier to landscape and urban development.

The Historical Structure-from-Motion (HSfM) pipeline (Knuth et al., 2023) currently represents the state of the art to fully automatically generate these historical DEMs. However, struggles with inconsistent image quality, distortions, distinct geometries and above all is based on the commercial software Metahape. Therefore, we aim to: (1) develop a fully open-source solution in COLMAPs environment, (2) integrate learned models in different SfM-steps to better handle the complex properties that come with historical imagery, and (3) compare our output against HSfM.

Our work is based on 180 historical aerial images acquired above the challenging terrain of Gran Campo Nevado Glacier. The results show that our photogrammetric workflow leads to a 0.26 px smaller mean reprojection error as well as roughly 9-times more tie-points for the sparse point cloud compared to the HSfM. The mean DEM difference with a reference DEM on stable terrain and the 95%-quantile DEM difference are also smaller in our experiments (0.71m vs. 10.10 m and 73.62 m vs. 99.03 m).

Further tests of our workflow include employing alternative models for feature extraction, matching, and dense reconstruction as well as evaluating multitemporal approaches (as adopted in Knuth et al., 2023) to enable a more representative comparison.

9:45am - 10:00am

Trinocular Multi-Object 3D Reconstruction in Camera-Simulating virtual Environments for Knee Arthroplasty

Arne Schierbaum¹, Tobias Neiss-Theuerkauff², Thomas Luhmann¹, Frank Wallhoff², Till Sieberth¹

¹Jade University of Applied Sciences, Institute for Applied Photogrammetry and Geoinformatics, Oldenburg, Germany; ²Jade University of Applied Sciences, Institute for Technical Assistive Systems, Oldenburg, Germany

In knee arthroplasty, computer-assisted navigation enhances the accuracy of prosthesis placement. However, current methods rely on invasively drilled locators to track the knee position during surgery, prolonging the healing process. For this reason, research is focused on markerless approaches capable of determining knee orientation and transferring preoperative planning into the surgical environment. This work presents a trinocular multi-object 3D reconstruction system designed for intraoperative acquisition of the knee surface, providing a foundation for marker less navigation. Due to the scarcity of real surgical data with ground truth, a synthetic dataset was created using Blender to simulate optical image acquisition of a virtual knee model under controlled camera and lighting conditions. The dataset enables a systematic evaluation of how camera motion and viewpoint affect pose estimation and 3D reconstruction accuracy. The results demonstrate that moderate camera deflection between 15° and 25° achieve the best balance between accurate camera pose estimation and surface reconstruction quality. The work confirms the potential of trinocular SLAM for robust bone surface tracking while also identifying the limitations of synthetic data, such as the absence of real-world visual variability. These results form the basis for future work on 3D reconstruction during dynamic knee movements and their tracking, as well as on the integration of markerless optical navigation systems into surgery.

1:30pm - 3:00pm

WG III/1C: Remote Sensing Data Processing and Understanding
Location: 713A

1:30pm - 1:45pm

Remote sensing image semantic segmentation sample generation using a decoupled latent diffusion framework

Yue XU¹, Honghao Liu², Ruixia Yang³, Zhengchao Chen¹

¹Aerospace Information Research Institute, Chinese Academy of Sciences, China, People's Republic of; ²University of Chinese Academy of Sciences, China, People's Republic of; ³International Research Center of Big Data for Sustainable Development Goals, China, People's Republic of

Semantic segmentation deep learning algorithms still depend on large quantities of high-quality annotated samples. Because remote sensing imagery spans vast areas and highly variable land surface environments, annotation demands substantial expertise and is both time-consuming and labour-intensive, leaving the field with an acute shortage of first-rate training samples. Moreover, object categories in land cover data are inherently imbalanced. Models trained under imbalance often underperform in small sample categories.

This study proposed a decoupled latent diffusion framework for RS semantic segmentation sample generation, and a proportion-aware loss to optimize balance of sample classes. We tested the proposed method on the ISPRS Potsdam dataset and compared it with two classic image generation baselines. The results show that our approach outperforms the baselines, producing synthetic samples with superior visual quality and semantic consistency. To verify downstream utility, we trained DeeplabV3+, PSPNet, and SegFormer segmentation models with the synthesized data. Across all three networks, overall segmentation accuracy and class balance metrics improved markedly; gains were especially pronounced for the rare “Clutter” and “Car” categories, underscoring the proposed method’s generality and robustness. We further analysed how the proportion of synthetic samples affects performance. As the ratio of synthetic to real samples increased, mIoU and mF1 first rose and then declined; the best results were obtained when the proportion of synthesized samples approached 40%. This indicates that a moderate amount of synthetic sample can significantly boost segmentation performance, whereas excessive synthetic data risks over-fitting or misclassification.

1:45pm - 2:00pm

Bright-CC: A Novel Change Captioning Benchmark for Cross-Modal Remote Sensing Images

Jing Yao¹, Dongwei Sun², Yun Lin³, Zhangqi Zhou³, Haizhen Chen³, Xiangyong Cao², Zhengchao Chen¹

¹State Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences; ²School of Computer Science and Technology, Xi’an Jiaotong University; ³Ningbo Institute of Surveying, Mapping and Remote Sensing

Existing remote sensing change captioning methods are limited to optical-only data, precluding all-weather, all-day monitoring. To address this, we introduce Bright-CC, the first large-scale benchmark for cross-modal (Optical-to-SAR) change captioning. Curated from the newly-proposed BRIGHT dataset, Bright-CC comprises 9,953 paired images focused on building damage assessment. It features dense four-class semantic labels (intact, damaged, destroyed) and a rich corpus of 49,765 GPT-4O-generated sentences (5 per pair), moving beyond simple binary change labels. Furthermore, we propose the Hybrid Feature Alignment Network (HFA-Net) as a robust baseline for this new task. HFA-Net is specifically designed to tackle the significant domain shift between heterogeneous sensors. Its architecture features: (1) a pseudo-siamese alignment module (HFEA) to project features into a common space; (2) a multi-scale atrous convolution module (CSTDF) to refine change context; and (3) a novel Lightweight Caption Generator (LCG), which is a parameter-efficient Transformer trained from scratch to avoid overfitting. Experiments show HFA-Net substantially outperforms adapted optical-only baselines (RSICCFormer, Chg2Cap) on all standard metrics. This work provides the community with a critical dataset and a strong baseline for future cross-modal spatio-temporal intelligence.

2:00pm - 2:15pm

Remote Sensing Change/Damage Image Generator Based on Prior Foundation Model and Multimodal Reference Information

Peng Chen, Guorui Ma, Haiming Zhang, Di Wang, Lunjun Fan

Wuhan University, China, People's Republic of

The scarcity and high cost of acquiring high-quality post-event remote sensing images (due to cloud cover, satellite limitations, and security risks) severely constrain the development and accuracy of change/damage detection models. This data gap is especially critical in disaster or military conflict scenarios. Existing cross-temporal image generation methods often lack precise spatial and semantic control, leading to inconsistent or unrealistic synthetic results. To address this core challenge, this paper introduces the Remote Sensing Change/Damage Generator (RSCDG), a novel method based on the Latent Diffusion Model for high-fidelity simulation of post-event satellite imagery. The RSCDG’s core innovation lies in its multimodal condition embedding framework, which integrates three specialized control pathways:The Pre-event Visual Prompt Adapter (built on PrithviModel) ensures high structural consistency between the pre-event and generated post-event images.The Spatial Location Control Pathway (using a ControlNet architecture and change/damage masks) precisely dictates the geometric location of the simulated change.The Generation Content Controller (using a CLIP Text Encoder) enhances semantic realism by guiding the model with natural language descriptions of the change/damage.Furthermore, we introduce a Mask Alignment Loss to enforce spatial and semantic adherence to detection rules. Results demonstrate that RSCDG accurately simulates complex scenarios like new urban construction and catastrophic building collapse. RSCDG is a powerful new tool designed to augment training data and significantly accelerate high-precision disaster response and urban monitoring.

2:15pm - 2:30pm

Edge Knowledge Distillation Guided Lightweight Change Detection Network

Tingyu Ji^1,2, Yixin Chen^5,6, Ruiqian Zhang¹, Xiaogang Ning¹, Xiao Huang³, Hanchao Zhang^1,7, Weibin Ma¹, Chunquan Cheng¹, Jiaming Wang⁴

¹State Key Laboratory of Spatial Datum, Chinese Academy of Surveying and Mapping, Beijing 100036, China; ²the College of Geodesy and Geomatics, Shandong University of Science and Technology, Qingdao 266590, China; ³the Department of Environmental Sciences, Emory University, Atlanta, GA 30322, USA; ⁴the Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan 430205, China; ⁵Sichuan Institute of Land Science and Technology (Sichuan Center of Satellite Application Technology), Chengdu 610045, China; ⁶Key Laboratory of Investigation, Monitoring, Protection and Utilization for Cultivated Land Resources, MNR, Chengdu 610045, China; ⁷Joint Laboratory of Spatial Intelligent Perception and Large Model Application

Deep-learning methods dominate remote-sensing change detection (CD), yet state-of-the-art models remain parameter-heavy and struggle with crisp boundaries, limiting their use on edge devices. We present LEDGNet, a Lightweight, Edge-knowledge-Distillation-Guided CD Network, that reconciles accuracy, boundary fidelity, and efficiency. LEDGNet integrates three purpose-built components: 1) an Edge Distillation Module that mines multi-scale boundary cues from a high-capacity teacher and transfers them to a compact student through an edge-aware loss; 2) StarLite, a depth-wise separable encoder that preserves fine spatial detail while minimizing floating-point operations; and 3) LiteDecoder, an inexpensive feature-fusion head that restores full resolution without bulky up-sampling. This design halves the parameters and inference time of mainstream fine-grained CD networks while enhancing edge sharpness. On the CDD and LEVIR-CD benchmarks, LEDGNet achieves competitive F1 performance while maintaining a compact footprint of 20.58 M parameters and 35.18 G FLOPs. With an inference time of 255 ms, it strikes a balance between resource consumption and detection efficiency, making it well-suited for high-efficiency remote sensing monitoring.

2:30pm - 2:45pm

Leveraging Pretrained Priors for Weakly Supervised Semantic Segmentation of Remote Sensing Images

Xin Li, Nicola Genzano, Marco Gianinetto, Marco Scaioni

politectinico di milano, Italy

Semantic segmentation of remote sensing imagery (RSI) is essential for urban mapping, land-use monitoring, and many areas. However, pixel-level annotation is expensive, making weakly supervised semantic segmentation (WSSS) that relies on image-level labels an attractive alternative. Leveraging pre-trained models offers strong priors from large-scale learned representations can help the WSSS, yet frozen models often yield sparse and misaligned class activation maps (CAMs) due to domain gaps and static inference. We propose a lightweight and efficient framework that integrates CLIP and DINO to address three challenges: semantic misalignment between generic text prompts and RSI-specific visuals, static CAM quality, and incomplete object coverage. Our design includes: (1) a Textual Prototype-Aware Enrichment (TPE) module that builds an RS-specific knowledge base using LLM generated descriptions to enrich text prompts; (2) a Unified Semantic Relation Mining (USR) module that fuses learnable adapter features with CLIP attention and DINO affinity for online CAM refinement; and (3) a Visual Prototype-Aware Enrichment (VPE) modulemaintains momentumvisualprototypes to complete regions and sharpen boundaries. Using frozen priors while only training a lightweight decoder ensures efficiency and consistently improves segmentation accuracy across diverse remote sensing scenes. Experimental results on the iSAID and ISPRS Potsdam datasets demonstrate the effectiveness of the proposed framework, achieving 38.01% mIoU on iSAID dataset and 47.01% mIoU with 66.89% overall accuracy on Potsdam dataset.

2:45pm - 3:00pm

DeSEO: Physics-Aware Dataset Creation for High-Resolution Satellite Image Shadow Removal

Lorenzo Beltrame^1,2, Jules Salzinger², Filip Svoboda³, Phillipp Fanta-Jende², Jasmin Lampert², Radu Timofte⁴, Marco Körner^1,5,6

¹Technical University of Munich; ²Austrian Institute of Technology; ³University of Cambridge; ⁴University of Würzburg; ⁵Munich Data Science Institute; ⁶ELLIS Unit Jena

Shadows cast by terrain and tall structures remain a major obstacle for high-resolution satellite image analysis. Public resources offering geometry-consistent paired shadow/shadow-free satellite imagery are essentially missing, even though there is a growing body of work on shadow removal in remote sensing, and most large-scale Earth-observation datasets are designed for shadow detection or 3D modelling rather than shadow removal. Existing deep shadow-removal datasets either target ground-level or aerial scenes or rely on unpaired and weakly supervised formulations rather than explicit satellite pairs. We address this gap with deSEO, a geometry-aware and physics-informed methodology that, to the best of our knowledge, is the first to derive paired supervision for satellite shadow removal from the S-EO shadow detection dataset through a fully replicable pipeline. For each tile, deSEO selects a minimally shadowed acquisition as a weak reference and pairs it with shadowed counterparts using temporal and geometric filtering, Jacobian-based orientation normalisation, and LoFTR–RANSAC registration. A per-pixel validity mask restricts learning to reliably aligned regions, enabling supervision despite residual off-nadir parallax. In addition to this paired dataset, we develop a DSM-aware deshadowing model that combines residual translation, perceptual objectives, and mask-constrained adversarial learning. In contrast, a direct adaptation of a UAV-based SRNet/pix2pix architecture fails to converge under satellite viewpoint variability. Our model consistently reduces the visual impact of cast shadows across diverse illumination and viewing conditions, achieving improved structural and perceptual fidelity on held-out scenes. deSEO therefore provides the first reproducible, geometry-aware paired dataset and baseline for shadow removal in satellite Earth observation.

3:30pm - 5:15pm

WG III/1D: Remote Sensing Data Processing and Understanding
Location: 713A

3:30pm - 3:45pm

Spatio-temporal Modeling of Bridge Deformations from Sentinel-1 SAR Images Validated with Multiple In-situ Surveys

Roberto Monti, Federica Gaspari, Rohollah Naeijian, Rebecca Fascia, Mirko Reguzzoni, Lorenzo Rossi, Livio Pinto

Politecnico di Milano, Department of Civil and Environmental Engineering (DICA), 20133 Milan, Italy

Aging bridge infrastructure requires efficient, network-scale monitoring, especially in remote areas where traditional in-situ sensors are costly and logistically challenging. This paper presents a remote sensing framework for structural health monitoring based on spaceborne Synthetic Aperture Radar (SAR). The approach combines Persistent Scatterer Interferometry (PSI) and Least Squares Collocation (LSC), implemented through the PHASE open-source MATLAB software, to derive a millimeter-level spatio-temporal displacement model.

The methodology is applied to a reinforced-concrete viaduct in the Alpine foothills of Lombardy, Italy, using five years of Copernicus Sentinel-1 data. A custom elevation-based spatial filtering strategy enables the isolation of structural displacements from the surrounding topography. The resulting spatio-temporal displacement model captures the expected seasonal thermal behavior of the structure and highlights localized deviations from the dominant cyclic response.

Finally, the SAR-derived model is integrated with UAV photogrammetry and official inspection reports within the P.O.N.T.I. 3D viewer. This multi-source, Digital Twin-like environment facilitates the joint interpretation of remote sensing observations and in-situ evidence, providing a scalable framework to support infrastructure monitoring and management.

3:45pm - 4:00pm

Large-Scale InSAR Deformation Monitoring Using Realistic Simulation-Based Training of a Temporal Convolutional Network: Application to the Phlegraean Fields, Italy

Kourosh Shahryarinia, Mohammad Omidalizarandi, Ingo Neumann

Geodetic Institute Hannover, Leibniz University Hannover, Germany

Large-scale land surface deformation monitoring using Interferometric Synthetic Aperture Radar (InSAR) requires robust detection of changes in long-term deformation trends. However, accurate change point (CP) detection remains challenging due to complex time series characteristics, including seasonal and quasi-periodic components and noise. Classical methods and many existing deep learning approaches rely on restrictive assumptions and training data that do not fully represent real-world InSAR time series, limiting their generalization and scalability in large-scale, real-world applications. In this study, we propose an integrated, fully supervised framework for CP detection in InSAR displacement time series based on Temporal Convolutional Networks (TCNs). The proposed TCN model employs dilated convolutions with multi-scale receptive fields to capture long-term temporal dependencies and complex deformation patterns, enabling robust identification of significant trend changes under noisy conditions. To effectively train this model, we introduce a deep learning-based InSAR time series simulation framework trained on real time series. This simulation framework produces physically consistent InSAR time series that retain essential temporal characteristics while introducing predefined, credible trend changes. Finally, we integrate the trained model into a large-scale anomalous change-detection pipeline that aggregates detected CPs from individual time series into spatially coherent deformation heatmaps suitable for operational monitoring. The proposed framework is evaluated using simulated data and real InSAR time series from the Phlegraean Fields caldera (Campi Flegrei), Italy. The results show clusters of anomalous behavior in the central Campi Flegrei–Pozzuoli area and in parts of Ischia and Procida, consistent with known unrest zones, associated periods, and independent measurements.

4:00pm - 4:15pm

Geometry-conditioned Pix2Pix: leveraging explicit Conditioning on SAR projected local Incidence Angle for SAR-to-EO Translation Quality Improvement

Jinmin Lee, Minkyung Chung, Aisha Javed, Youkyung Han

Seoul National University of Science and Technology, Korea, Republic of (South Korea)

Electro-optical (EO) imagery is intuitive but highly dependent on weather and illumination, whereas synthetic aperture radar (SAR) imagery provides reliable all-weather observations yet offers limited spectral information. To complement these modalities, recent studies have applied cGAN-based image-to-image translation for SAR-to-EO translation. However, side-looking SAR introduces spatial distortions such as foreshortening and layover that cause relative misalignment with EO imagery, undermining pixelwise supervision and yielding structural discrepancies between translated outputs and reference EO imagery. In this study, we propose Geometry-Conditioned Pix2Pix (GC-Pix2Pix), which explicitly conditions on projected local incidence angle (PLIA) information derived from SAR imagery to better preserve structure and alignment in translated EO imagery. The method is based on Pix2Pix and comprises a 2-branch generator and a PatchGAN discriminator. The generator consists of a main network that processes SAR polarimetric channels (VV, VH) and a conditioning subnetwork that extracts PLIA features. The subnetwork uses multi-layer convolutional blocks to capture local PLIA patterns, and the extracted features are then fused with features from the main branch and emphasized via a spatial attention module. For training and evaluation, we assembled a dataset over South Korea that combines Sentinel-1A GRD VV/VH with PLIA and Sentinel-2B Level-2A RGB imagery. We compared GC-Pix2Pix against representative baselines. Across multiple image quality assessment metrics and complementary qualitative analyses, the proposed approach consistently improved SAR-to-EO translation performance.

4:15pm - 4:30pm

Temporal-Spatial Tubelet Embedding for Cloud-Robust MSI Reconstruction using MSI-SAR Fusion: A Multi-Head Self-Attention Video Vision Transformer Approach

Yiqun Wang, Lujun Li, Meiru Yue, Radu State

SEDAN, SnT, the University of Luxembourg, Luxembourg

Cloud cover in multispectral imagery (MSI) significantly hinders early-season crop mapping by corrupting spectral information. Existing Vision Transformer(ViT)-based time-series reconstruction methods, like SMTS-ViT, often employ coarse temporal embeddings that aggregate entire sequences, causing substantial information loss and reducing reconstruction accuracy. To address these limitations, a Video Vision Transformer (ViViT)-based framework with temporal-spatial fusion embedding for MSI reconstruction in cloud-covered regions is proposed in this study. Non-overlapping tubelets are extracted via 3D convolution with constrained temporal span t=2, ensuring local temporal coherence while reducing cross-day information degradation. Both MSI-only and SAR-MSI fusion scenarios are considered during the experiments. Comprehensive experiments on 2020 Traill County data demonstrate notable performance improvements: MTS-ViViT achieves a 2.23% reduction in MSE compared to the MTS-ViT baseline, while SMTS-ViViT achieves a 10.33% improvement with SAR integration over the SMTS-ViT baseline. The proposed framework effectively enhances spectral reconstruction quality for robust agricultural monitoring.

4:30pm - 4:45pm

Evaluating Deep Matching Models for SAR-Optical Image Pairs using the SpaceNet9 Dataset

Constantin Günzel, Michael Schmitt

Department of Aerospace Engineering, University of the Bundeswehr Munich, Germany

This paper focuses on cross-modal image matching between Synthetic Aperture Radar (SAR) and optical imagery, a long-standing challenge due to disparate sensing physics, radiometric behaviour and geometric distortions. Beyond applicational needs in satellite data fusion and downstream mapping, the study is additionally motivated by the rapid advances of feature matching in the field of Computer Vision. Under a unified, lightweight pipeline, the authors evaluate a classical handcrafted baseline (SIFT) against modern deep matchers, including a modality-invariant approach (MINIMA), as well as a SuperPoint+LightGlue pipeline, using the SpaceNet9 dataset with provided ground truth. The aim is to assess each models' ability to establish reliable correspondences without retraining or modality-specific adaptation, aiming to provide practical guidance for other researchers working with SAR-optical fusion. The paper highlights where pretrained multimodal models already yield consistent correspondences, where they still struggle and outlines possible next steps.

4:45pm - 5:00pm

Detecting Marine Pollutants Using Sentinel-1 SAR and Sentinel-2 Optical Imagery

Jason Manesis¹, Paraskevi Mikeli¹, Katerina Kikaki², Ioannis Kakogeorgiou³, Athanasios Karmas², Konstantinos Karantzalos¹

¹National Technical University of Athens; ²Hellenic Space Center; ³IIT, NCSR "Demokritos"

Marine pollution, including Marine Debris and Oil Spills, poses a serious environmental threat that requires systematic monitoring. While satellite observations and machine learning models have been widely applied in this domain, the use of advanced deep learning techniques remains limited. To support progress in this area, we construct a new annotated Sentinel-1 SAR dataset derived from the MADOS Sentinel-2 marine pollution dataset, including labels for oil spills, sea surface, look-alikes, ships, and offshore platforms. We evaluate several deep learning architectures on this dataset, including traditional models such as U-Net, state-of-the-art segmentation models such as SegNeXt and domain-specific frameworks such as MariNeXt. Our results show that MariNeXt achieves the best performance with an F₁-macro score of 92.7%, significantly outperforming U-Net and SegNeXt. Qualitative analysis using paired Sentinel-2 imagery further validates these findings. The study also highlights the persistent difficulty of detecting marine debris in SAR imagery, particularly when complementary optical data are unavailable.

5:00pm - 5:15pm

A coarse-to-fine cross-view localization framework with BEV-guided retrieval and fine-grained pose alignment

Qiong Wu¹, Panwang Xia¹, Yi Wan^1,2, Xinyi Liu^1,2, Yongxiang Yao¹, Dong Wei¹, Yongjun Zhang^1,2

¹Wuhan University, China, People's Republic of; ²Technology Innovation Center for Collaborative Applications of Natural Resources Data in GBA, Ministry of Natural Resources, Guangzhou, 510075, Guangdong, China

This paper introduces a coarse-to-fine cross-view localization framework that unifies image-level retrieval and geometry-level alignment within a single pipeline. The proposed approach first employs a Bird’s-Eye-View (BEV)-guided retrieval module to establish a perspective-consistent intermediary space, enhancing cross-view consistency and retrieval precision. In the fine stage, a geometry-aware alignment module estimates the 3-DoF pose through interpretable point-plane matching based on BEV correspondences. This hierarchical design bridges global retrieval and local geometric reasoning, achieving both scalability and high localization accuracy. Extensive experiments on the VIGOR benchmark demonstrate that the proposed framework achieves state-of-the-art performance in both retrieval and alignment, significantly improving end-to-end localization precision while maintaining computational efficiency.