Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
|
Daily Overview |
| Session | ||
WG II/4A: AI/ML for Geospatial Data
Session Topics: AI/ML for Geospatial Data (WG II/4)
| ||
| External Resource: http://www.commission2.isprs.org/wg4 | ||
| Presentations | ||
1:30pm - 1:45pm
Target Vessel Identification in Aerial Search Imagery via MLLM-Based Attribute Extraction and Geolocation Fusion Dept. of Geoinformatics, University of Seoul, Seoul, Republic of Korea Identifying a distressed vessel among many ships detected in wide-area aerial imagery is a critical challenge in maritime Search and Rescue (SAR) operations. Conventional methods cannot determine which vessel matches the incident description, especially when Automatic Identification System (AIS) reports are uncertain. This study proposes an integrated framework that combines MLLM-based semantic attribute extraction with geolocation fusion to prioritize candidate vessels according to their consistency with Situation Report (SITREP) based scenarios. The method detects vessels using YOLOv8, tracks them with Deep Simple Online and Real-time Tracking (DeepSORT), and performs image-based georeferencing using onboard metadata. A Multi-modal Large Language Model (MLLM) extracts appearance/status attributes from representative vessel images, while scenario descriptions are also converted to attributes. Both sets are encoded using MiniLM embeddings. Finally, semantic similarity is fused with geolocation proximity within an Support Vector Machine (SVM) classifier to produce a probability-ranked list of candidates. Experiments using real aerial search footage demonstrate robust identification performance across a range of scenario quality levels. The correct vessel appears within the top three candidates in more than 73\% of cases and within the top five in more than 91\%, even when attribute extraction is affected by low resolution, illumination effects, or missing scenario information. These results show that coarse semantic cues, when combined with approximate geolocation, provide a resilient basis for identifying target vessels under high uncertainty. The proposed framework offers a practical foundation for automated SAR decision support, enabling faster and more reliable prioritization during wide-area maritime search operations. 1:45pm - 2:00pm
Zero-shot Vision-Language Reranking for Cross-View Geolocalization 1Dept. of Electrical and Computer Engineering, The Ohio State University, United States of America; 2US Army Engineer Research and Development Center, Geospatial Research Laboratory, United States of America; 3Dept. of Civil Engineering, The Ohio State University, United States of America Cross-view geolocalization (CVGL) systems, while effective at retrieving a list of relevant candidates (high Recall@k), often fail to identify the single best match (low Top-1 accuracy). This work investigates the use of zero-shot Vision-Language Models (VLMs) as rerankers to address this gap. We propose a two-stage framework: state-of-the-art (SOTA) retrieval followed by VLM reranking. We systematically compare two strategies: (1) Pointwise (scoring candidates individually) and (2) Pairwise (comparing candidates relatively). Experiments on the VIGOR dataset show a clear divergence: all pointwise methods cause a catastrophic drop in performance. In contrast, a pairwise comparison strategy using LLaVA improves Top-1 accuracy over the strong retrieval baseline. Our analysis concludes that VLMs are poorly calibrated for absolute relevance scoring but are effective at fine-grained relative visual judgment, making pairwise reranking a promising direction for enhancing CVGL precision. 2:00pm - 2:15pm
From Pixels to Polygons: Evaluating Vision Foundation Models for High-Resolution Orthophoto Segmentation and Vectorization University of the Bundeswehr Munich, Germany Accurate and topologically consistent vector data extraction from aerial imagery is essential for geospatial applications such as urban mapping and the enrichment of volunteered geographic information (VGI) platforms like OpenStreetMap (OSM). While deep learning has advanced automated feature extraction, traditional supervised networks remain constrained by large annotation requirements and limited generalization. Recent vision foundation models offer a promising alternative through promptable, zero-shot segmentation capabilities. This study presents a modular “pixels-to-polygons” workflow for transforming high-resolution orthophotos (20 cm DOP20) into GIS-ready vector data, comprising four stages: data preparation, segmentation, vectorization, and validation. The framework is model-agnostic and designed to integrate different vision foundation models without modification to downstream processing. As a representative instantiation, we evaluate the Segment Anything Model 2 (SAM2) in automatic mask generation mode over the University of the Bundeswehr Munich campus. The resulting segmentation masks are polygonized and compared against OSM reference layers. Preliminary results indicate that vision foundation models can effectively delineate major built-up areas, vegetation, and transport infrastructure without task-specific training, producing geometrically smooth and topologically valid polygons. However, smaller or shadowed objects remain challenging, often leading to partial merging or fragmentation. Future work will include quantitative evaluation using raster-based Intersection-over-Union and vector-domain metrics such as completeness, correctness, and geometric fidelity. The study provides an initial assessment of vision foundation models for high-resolution remote sensing and highlights their potential to bridge the gap between general-purpose segmentation models and GIS-compatible vector data generation. 2:15pm - 2:30pm
Polarization-Aware Segmentation for Camouflaged Threat Detection from UAVs Department of Earth and Space Science and Engineering, York University Surface-laid unexploded ordnance (UXO) and landmines constitute a critical humanitarian crisis. While unmanned aerial vehicles (UAVs) provide a scalable remote sensing solution, detecting modern, non-metallic explosive devices in cluttered environments remains a profound Camouflaged Object Detection (COD) challenge. Traditional optical sensors frequently suffer from foreground-background confusion when a target's texture mimics its surroundings. To overcome these physical bottlenecks, we introduce XPol-Net, a novel multimodal architecture synergizing the semantic reasoning of Vision Transformers with the deterministic physics of polarimetric imaging. Built on a hierarchical PVTv2 backbone, XPol-Net utilizes a progressive Dual Cross-Attention Strategy for effective modality fusion. In early stages, Channel Cross-Attention (CCA) filters material-specific Degree of Linear Polarization (DoLP) cues to suppress background clutter. In deeper stages, Spatial Cross-Attention (SCA) dynamically aligns high-level RGB semantics with strict structural boundaries. To enhance robustness and prevent modality collapse, we deploy a multi-task auxiliary learning framework that reconstructs the continuous Angle of Linear Polarization (AoLP) map. On the PCOD benchmark, XPol-Net achieves state-of-the-art results in global structural alignment (E_phi of 0.980 and 0.984 at 352 x 352 and 704 x 704, respectively). While minor trade-offs are observed in localized metrics such as S_alpha or F_beta, XPol-Net remains highly competitive, consistently delivering superior results in E_phi and MAE. By prioritizing structural recall over localized strictness, XPol-Net ensures the complete discovery of concealed targets, establishing a reliable, physics-aware foundation for humanitarian demining operations. | ||

