JavaScript is Disabled
Your browser's JavaScript functionality is disabled. It has to be enabled to use this function of ConfTool.
Here you can find information on how to enable JavaScript
If you have any problems, please contact the organizers at isprs2026@icsevents.com.

Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Agenda Overview

Session

WG III/1C: Remote Sensing Data Processing and Understanding

Time:

Wednesday, 08-July-2026:

1:30pm - 3:00pm

Location: 713A

125 theatre

Session Topics:

Remote Sensing Data Processing and Understanding (WG III/1)

External Resource: http://www.commission3.isprs.org/wg1

Presentations

1:30pm - 1:45pm

Remote sensing image semantic segmentation sample generation using a decoupled latent diffusion framework

Yue XU¹, Honghao Liu², Ruixia Yang³, Zhengchao Chen¹

¹Aerospace Information Research Institute, Chinese Academy of Sciences, China, People's Republic of; ²University of Chinese Academy of Sciences, China, People's Republic of; ³International Research Center of Big Data for Sustainable Development Goals, China, People's Republic of

Semantic segmentation deep learning algorithms still depend on large quantities of high-quality annotated samples. Because remote sensing imagery spans vast areas and highly variable land surface environments, annotation demands substantial expertise and is both time-consuming and labour-intensive, leaving the field with an acute shortage of first-rate training samples. Moreover, object categories in land cover data are inherently imbalanced. Models trained under imbalance often underperform in small sample categories.

This study proposed a decoupled latent diffusion framework for RS semantic segmentation sample generation, and a proportion-aware loss to optimize balance of sample classes. We tested the proposed method on the ISPRS Potsdam dataset and compared it with two classic image generation baselines. The results show that our approach outperforms the baselines, producing synthetic samples with superior visual quality and semantic consistency. To verify downstream utility, we trained DeeplabV3+, PSPNet, and SegFormer segmentation models with the synthesized data. Across all three networks, overall segmentation accuracy and class balance metrics improved markedly; gains were especially pronounced for the rare “Clutter” and “Car” categories, underscoring the proposed method’s generality and robustness. We further analysed how the proportion of synthetic samples affects performance. As the ratio of synthetic to real samples increased, mIoU and mF1 first rose and then declined; the best results were obtained when the proportion of synthesized samples approached 40%. This indicates that a moderate amount of synthetic sample can significantly boost segmentation performance, whereas excessive synthetic data risks over-fitting or misclassification.

1:45pm - 2:00pm

Bright-CC: A Novel Change Captioning Benchmark for Cross-Modal Remote Sensing Images

Jing Yao¹, Dongwei Sun², Yun Lin³, Zhangqi Zhou³, Haizhen Chen³, Xiangyong Cao², Zhengchao Chen¹

¹State Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences; ²School of Computer Science and Technology, Xi’an Jiaotong University; ³Ningbo Institute of Surveying, Mapping and Remote Sensing

Existing remote sensing change captioning methods are limited to optical-only data, precluding all-weather, all-day monitoring. To address this, we introduce Bright-CC, the first large-scale benchmark for cross-modal (Optical-to-SAR) change captioning. Curated from the newly-proposed BRIGHT dataset, Bright-CC comprises 9,953 paired images focused on building damage assessment. It features dense four-class semantic labels (intact, damaged, destroyed) and a rich corpus of 49,765 GPT-4O-generated sentences (5 per pair), moving beyond simple binary change labels. Furthermore, we propose the Hybrid Feature Alignment Network (HFA-Net) as a robust baseline for this new task. HFA-Net is specifically designed to tackle the significant domain shift between heterogeneous sensors. Its architecture features: (1) a pseudo-siamese alignment module (HFEA) to project features into a common space; (2) a multi-scale atrous convolution module (CSTDF) to refine change context; and (3) a novel Lightweight Caption Generator (LCG), which is a parameter-efficient Transformer trained from scratch to avoid overfitting. Experiments show HFA-Net substantially outperforms adapted optical-only baselines (RSICCFormer, Chg2Cap) on all standard metrics. This work provides the community with a critical dataset and a strong baseline for future cross-modal spatio-temporal intelligence.

2:00pm - 2:15pm

Remote Sensing Change/Damage Image Generator Based on Prior Foundation Model and Multimodal Reference Information

Peng Chen, Guorui Ma, Haiming Zhang, Di Wang, Lunjun Fan

Wuhan University, China, People's Republic of

The scarcity and high cost of acquiring high-quality post-event remote sensing images (due to cloud cover, satellite limitations, and security risks) severely constrain the development and accuracy of change/damage detection models. This data gap is especially critical in disaster or military conflict scenarios. Existing cross-temporal image generation methods often lack precise spatial and semantic control, leading to inconsistent or unrealistic synthetic results. To address this core challenge, this paper introduces the Remote Sensing Change/Damage Generator (RSCDG), a novel method based on the Latent Diffusion Model for high-fidelity simulation of post-event satellite imagery. The RSCDG’s core innovation lies in its multimodal condition embedding framework, which integrates three specialized control pathways:The Pre-event Visual Prompt Adapter (built on PrithviModel) ensures high structural consistency between the pre-event and generated post-event images.The Spatial Location Control Pathway (using a ControlNet architecture and change/damage masks) precisely dictates the geometric location of the simulated change.The Generation Content Controller (using a CLIP Text Encoder) enhances semantic realism by guiding the model with natural language descriptions of the change/damage.Furthermore, we introduce a Mask Alignment Loss to enforce spatial and semantic adherence to detection rules. Results demonstrate that RSCDG accurately simulates complex scenarios like new urban construction and catastrophic building collapse. RSCDG is a powerful new tool designed to augment training data and significantly accelerate high-precision disaster response and urban monitoring.

2:15pm - 2:30pm

Edge Knowledge Distillation Guided Lightweight Change Detection Network

Tingyu Ji^1,2, Yixin Chen^5,6, Ruiqian Zhang¹, Xiaogang Ning¹, Xiao Huang³, Hanchao Zhang^1,7, Weibin Ma¹, Chunquan Cheng¹, Jiaming Wang⁴

¹State Key Laboratory of Spatial Datum, Chinese Academy of Surveying and Mapping, Beijing 100036, China; ²the College of Geodesy and Geomatics, Shandong University of Science and Technology, Qingdao 266590, China; ³the Department of Environmental Sciences, Emory University, Atlanta, GA 30322, USA; ⁴the Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan 430205, China; ⁵Sichuan Institute of Land Science and Technology (Sichuan Center of Satellite Application Technology), Chengdu 610045, China; ⁶Key Laboratory of Investigation, Monitoring, Protection and Utilization for Cultivated Land Resources, MNR, Chengdu 610045, China; ⁷Joint Laboratory of Spatial Intelligent Perception and Large Model Application

Deep-learning methods dominate remote-sensing change detection (CD), yet state-of-the-art models remain parameter-heavy and struggle with crisp boundaries, limiting their use on edge devices. We present LEDGNet, a Lightweight, Edge-knowledge-Distillation-Guided CD Network, that reconciles accuracy, boundary fidelity, and efficiency. LEDGNet integrates three purpose-built components: 1) an Edge Distillation Module that mines multi-scale boundary cues from a high-capacity teacher and transfers them to a compact student through an edge-aware loss; 2) StarLite, a depth-wise separable encoder that preserves fine spatial detail while minimizing floating-point operations; and 3) LiteDecoder, an inexpensive feature-fusion head that restores full resolution without bulky up-sampling. This design halves the parameters and inference time of mainstream fine-grained CD networks while enhancing edge sharpness. On the CDD and LEVIR-CD benchmarks, LEDGNet achieves competitive F1 performance while maintaining a compact footprint of 20.58 M parameters and 35.18 G FLOPs. With an inference time of 255 ms, it strikes a balance between resource consumption and detection efficiency, making it well-suited for high-efficiency remote sensing monitoring.

2:30pm - 2:45pm

Leveraging Pretrained Priors for Weakly Supervised Semantic Segmentation of Remote Sensing Images

Xin Li, Nicola Genzano, Marco Gianinetto, Marco Scaioni

politectinico di milano, Italy

Semantic segmentation of remote sensing imagery (RSI) is essential for urban mapping, land-use monitoring, and many areas. However, pixel-level annotation is expensive, making weakly supervised semantic segmentation (WSSS) that relies on image-level labels an attractive alternative. Leveraging pre-trained models offers strong priors from large-scale learned representations can help the WSSS, yet frozen models often yield sparse and misaligned class activation maps (CAMs) due to domain gaps and static inference. We propose a lightweight and efficient framework that integrates CLIP and DINO to address three challenges: semantic misalignment between generic text prompts and RSI-specific visuals, static CAM quality, and incomplete object coverage. Our design includes: (1) a Textual Prototype-Aware Enrichment (TPE) module that builds an RS-specific knowledge base using LLM generated descriptions to enrich text prompts; (2) a Unified Semantic Relation Mining (USR) module that fuses learnable adapter features with CLIP attention and DINO affinity for online CAM refinement; and (3) a Visual Prototype-Aware Enrichment (VPE) modulemaintains momentumvisualprototypes to complete regions and sharpen boundaries. Using frozen priors while only training a lightweight decoder ensures efficiency and consistently improves segmentation accuracy across diverse remote sensing scenes. Experimental results on the iSAID and ISPRS Potsdam datasets demonstrate the effectiveness of the proposed framework, achieving 38.01% mIoU on iSAID dataset and 47.01% mIoU with 66.89% overall accuracy on Potsdam dataset.

2:45pm - 3:00pm

DeSEO: Physics-Aware Dataset Creation for High-Resolution Satellite Image Shadow Removal

Lorenzo Beltrame^1,2, Jules Salzinger², Filip Svoboda³, Phillipp Fanta-Jende², Jasmin Lampert², Radu Timofte⁴, Marco Körner^1,5,6

¹Technical University of Munich; ²Austrian Institute of Technology; ³University of Cambridge; ⁴University of Würzburg; ⁵Munich Data Science Institute; ⁶ELLIS Unit Jena

Shadows cast by terrain and tall structures remain a major obstacle for high-resolution satellite image analysis. Public resources offering geometry-consistent paired shadow/shadow-free satellite imagery are essentially missing, even though there is a growing body of work on shadow removal in remote sensing, and most large-scale Earth-observation datasets are designed for shadow detection or 3D modelling rather than shadow removal. Existing deep shadow-removal datasets either target ground-level or aerial scenes or rely on unpaired and weakly supervised formulations rather than explicit satellite pairs. We address this gap with deSEO, a geometry-aware and physics-informed methodology that, to the best of our knowledge, is the first to derive paired supervision for satellite shadow removal from the S-EO shadow detection dataset through a fully replicable pipeline. For each tile, deSEO selects a minimally shadowed acquisition as a weak reference and pairs it with shadowed counterparts using temporal and geometric filtering, Jacobian-based orientation normalisation, and LoFTR–RANSAC registration. A per-pixel validity mask restricts learning to reliably aligned regions, enabling supervision despite residual off-nadir parallax. In addition to this paired dataset, we develop a DSM-aware deshadowing model that combines residual translation, perceptual objectives, and mask-constrained adversarial learning. In contrast, a direct adaptation of a UAV-based SRNet/pix2pix architecture fails to converge under satellite viewpoint variability. Our model consistently reduces the visual impact of cast shadows across diverse illumination and viewing conditions, achieving improved structural and perceptual fidelity on held-out scenes. deSEO therefore provides the first reproducible, geometry-aware paired dataset and baseline for shadow removal in satellite Earth observation.