Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
|
Agenda Overview |
| Session | ||
WG III/1B: Remote Sensing Data Processing and Understanding
Session Topics: Remote Sensing Data Processing and Understanding (WG III/1)
| ||
| External Resource: http://www.commission3.isprs.org/wg1 | ||
| Presentations | ||
1:30pm - 1:45pm
Multi-modal semantic segmentation for open vocabulary interactions with remote sensing images Southwest Jiaotong University, Chengdu 611756, China Semantic segmentation of multi-modal remote sensing imagery plays a pivotal role in land use/land cover (LULC) mapping, environmental monitoring, and precision earth observation. Current multi-modal approaches mainly focus on integrating complementary visual modalities (e.g., optical and synthetic aperture radar (SAR) imagery), yet neglect the incorporating of non-visual textual data a rich source of knowledge that can bridge semantic gaps between visual patterns and real-world concepts. To address this limitation, we propose TSMNet, a text supervised multi-modal open vocabulary semantic segmentation network that synergistically integrates textual supervision with visual representation for open-vocabulary semantic segmentation. Unlike conventional multi-modal segmentation frameworks, TSMNet introduces a dual-branch text encoder to extract both scene-level semantic and object-level label information from various textual data, enabling dynamic cross-modal fusion. These text-derived features dynamically interact with visual embeddings through the proposed text-guided visual semantic fusion module, enabling domain-aware feature refinement and human-interpretable decision-making. Moreover, integrating text opens pathways for open-vocabulary semantic segmentation, enabling systems to recognize and classify unseen categories through natural language descriptions, thereby overcoming the rigid constraints of predefined class taxonomies. To verify our method, we innovatively construct two new multi-modal datasets, and do a lot of extensive experiments are carried out to make a comprehensive comparison between the proposed method and other state-of-the-art (SOTA) semantic segmentation models. Results demonstrate that TSMNet achieves superior segmentation accuracy while exhibiting robust generalization capabilities across diverse geographical and sensor-specific scenarios. This work establishes a new paradigm for explainable remote sensing analysis, demonstrating that textual knowledge integration significantly enhances model generalizability. 1:45pm - 2:00pm
Meta-Prompting with Open-Source Language Models for Zero-Shot Scene Classification in Remote Sensing 1Remote Sensing Lab, National Technical University of Athens, Greece; 2Department of Engineering and Sciences, Universitas Mercatorum, Rome, Italy Zero-shot visual recognition with vision-language models (VLMs) has shown strong generalization to unseen categories in natural-image benchmarks, yet its effectiveness in remote-sensing (RS) imagery remains less explored. In this paper, we investigate whether meta-prompting with large language models (LLMs) can improve zero-shot scene classification in RS by automatically generating semantically rich class descriptions. Building on the Meta-Prompting for Visual Recognition (MPVR) framework, we evaluate three open-source LLMs, Mixtral-8x7B, Qwen 2.5 7B, and LLaMA 3.1 8B, as prompt generators across five RS benchmark datasets. The resulting descriptions are encoded with several VLMs, including CLIP, MetaCLIP, RemoteCLIP, and CLIP-LAION-RS, and compared against generic single-template and handcrafted domain-specific prompting baselines. Our results show that LLM-generated prompts are competitive with, and in several cases improve upon, manually designed templates, while revealing that the gains depend on both the dataset and the visual backbone. Overall, the study highlights the potential of open-source LLMs as scalable prompt generators for zero-shot remote-sensing recognition and provides insight into the transferability of meta-prompting beyond natural-image domains. 2:00pm - 2:15pm
Knowledge graph enhanced for zero-shot semantic segmentation in remote sensing imagery 1School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China; 2Hubei Luojia Laboratory, Wuhan 430079, China Zero-shot semantic segmentation (ZSSS) is a crucial task in remote sensing image understanding, yet existing methods still suffer from limited generalization to unseen classes. To address this issue, we propose a Knowledge Graph (KG) enhanced ZSSS framework, which introduces explicit hierarchical and relational information into class embeddings to achieve more structured and semantically consistent representations. Specifically, a KG class encoder is designed, consisting of the class enhanced query (CEQ) and class enhanced embedding (CEE) modules, which extract class-relevant subgraphs from a self-constructing Remote Sensing Semantic Class Knowledge Graph (RSSCKG) and generate knowledge-enriched embeddings through a text encoder. Experiments on three public remote sensing datasets demonstrate that the proposed method consistently improves performance across seven state-of-the-art ZSSS frameworks. The integration of KG-based embeddings yields significant gains in the evaluation metrics, with particularly strong improvements on unseen classes, while maintaining accuracy on seen classes. Compared with enhancement strategies based on large language model (LLM) generated descriptions, the proposed KG class encoder exhibit superior semantic separability and stability. These results validate the effectiveness, generalization, and scalability of the proposed framework for ZSSS in remote sensing imagery. 2:15pm - 2:30pm
Segmentation-driven statistics-aware workflow for detailed scene description of UAV images using Mistral and LORA fused model Indian Institute of Space Science and Technology, Thiruvananthapuram, Kerala, India In the era of explainable AI, rapid data processing, analysis, and generation have become essential. Over the past few years, many approaches have been developed to process such heavy data and present it in an explainable manner, including in the field of remote sensing. One of such applications is remote sensing scene description. Many established workflows and models exist, but these models either fail to incorporate essential geospatial information or suffer from hallucination. We present a hybrid multimodal captioning methodology that tightly couples semantic segmentation outputs (via a LoRA-adapted Segment Anything Model) with a small, high-quality LLM- Mistral to produce descriptive, interpretable, and data-grounded scene captions. Rather than relying on direct image-to-text pipelines, our approach first extracts structured scene statistics (class proportions), spatial context (quadrant dominance and object localization), and color fingerprints (dominant colors per semantic class). These structured signals are converted into compact, factual prompts that the LLM consumes to generate coherent, informative, and verifiable captions. A comparison with the established Florence-2 model in terms of quantitative description demonstrates a significant improvement, with the Precision Vocabulary Index increasing from 0.077 to 0.232 due to the proposed workflow. 2:30pm - 2:45pm
Evaluating the Adaptation Potential of SAM2 for Glacier Segmentation in Severe Weather Dresden University of Technology, Germany Ground based time lapse cameras provide continuous, high frequency observations of glacier dynamics; however, automated analysis of these image streams remains challenging due to fog, snowfall, lens contamination, and variable illumination. This study investigates the potential of adapting the foundation segmentation model Segment Anything Model 2 (SAM2) for glacier segmentation from ground-based monitoring. To enable integration into automated pipelines, SAM2 is configured in image mode with a learned prompt generation strategy, while fine-tuning is restricted to the prompt encoder and mask decoder. In addition, the internal Intersection over Union (IoU) prediction head is utilized as a confidence estimator to assess segmentation reliability. Experimental results demonstrate that the adapted model achieves stable segmentation under moderate environmental variability, while degrading under severe visibility loss. This stability is consistent across model scales and input resolutions. The confidence estimation further provides a meaningful signal for identifying uncertain predictions, supporting reliability-aware processing in downstream workflows. 2:45pm - 3:00pm
Reasoning-guided ego-path segmentation for autonomous trains using vision–language models York University, Canada Autonomous train perception must identify the train’s valid path under complex railway geometry, particularly at merging and diverging switches where multiple candidate tracks coexist. Existing approaches are primarily trained as purely visual predictors and typically do not provide justification for route selection, despite the fact that valid paths depend on structured cues such as blade–stock contact, rail gaps, and track continuity. In this work, we adapt the Large Language Instructed Segmentation Assistant (LISA) to railway ego-path perception and formulate the task as reasoning-guided segmentation: given a forward-facing railway image and a natural-language query, the model predicts the valid ego-path mask and, when prompted, generates a textual explanation grounded in visible switch geometry. Our approach integrates railway-specific prompting, a tailored annotation scheme, and efficient finetuning, along with semantic segmentation supervision to support general scene understanding. Experiments on a RailSem19-based evaluation set show improved ego-path segmentation performance over the original LISA checkpoint and increased robustness to prompt variation, while qualitative results indicate that the model can produce plausible, though not always consistent, reasoning. Notably, these capabilities emerge despite the reasoning-specific dataset consisting of only 54 samples, highlighting the data efficiency of the approach. These results highlight the potential of vision-language models for more interpretable railway perception, while also underscoring the need for stronger supervision and evaluation in safety-critical settings. Code and reasoning segmentation data are available at https://github.com/mvakili96/Railway_Perception_FoundationModel. | ||

