Conference Agenda
Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
|
Daily Overview | |
|
Location: 713A 125 theatre |
| Date: Monday, 06-July-2026 | |
| 8:30am - 10:00am | WG III/1A: Remote Sensing Data Processing and Understanding Location: 713A |
|
|
8:30am - 8:45am
Cube Kernel: A Novel Approach to Enable Local Gradient Flow Across Channels in CNNs University of Glasgow, United Kingdom Understanding inter-band and cross-channel relationships is essential for human color perception and object recognition. Yet, local gradients in standard convolutions are tied to fixed input–output channel pairs, and thus channels are fused by a dense, fully-coupled weight tensor: each output channel aggregates all input channels in a uniform way at every spatial location. This leads to heavy computation and does not exploit structured sparsity or selective local channel mixing. To overcome this limitation, we introduce Cube Kernel, a novel convolutional operator that introduces structured cross-channel groups into the local gradient. This design strengthens cross-channel feature fusion, improves optimization efficiency, and reduces computational overhead. Extensive building extraction experiments validate its effectiveness: Cube Kernel consistently outperforms standard convolutions and Involution when integrated into UNet, and replacing a single layer in DeepLabV3+, Swin-UNet, or UNet leads to consistent performance gains. Beyond serving as a lightweight plug-in module, Cube Kernel also scales effectively as a fundamental building block. A Cube-enhanced ConvNeXt variant, ConvNeXt-Cube, achieves state-of-the-art performance across all models (0.9095 IoU / 0.9535 F1 on WBD and 0.9133 IoU / 0.9547 F1 on WHU), demonstrating strong stackability and architectural potential. These results highlight a largely overlooked space in CNN design: enhancing cross-channel interaction at the gradient level. Cube Kernel offers a scalable and efficient alternative to deepen networks for channel mixing, laying a foundation for future advancements in convolutional architecture design. 8:45am - 9:00am
Land Surface Dynamics Modeling and Prediction with dual Latent-Space Representations 1Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China; 2School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, China; 3Yazhou Bay Innovation Institute, Hainan Tropical Ocean University, Sanya, China; 4The University of Hong Kong, Hong Kong, China Modeling land surface dynamics from satellite observations is crucial for revealing change patterns and predicting future states, although effective modeling methods remain limited. For complex systems such as reaction-diffusion, two approaches have proven particularly effective: (i) Direct modeling in the high-dimensional observation space with deep networks(e.g., (Wang et al., 2022)). These methods are often autoregressive. Errors accumulate during rolling extrapolation. (ii) Modeling in a reduced-dimensional latent space(e.g., (Chen et al., 2022)). One reduces dimension and then learns the evolution. Some works estimate the intrinsic dimension (ID) and model in the ID latent space. This improves long-term stability, but reliance on latent representations may reduce accuracy. This route is promising if two issues are addressed: (1) effectively modeling multi-scale spatiotemporal data with long sampling intervals; (2) combining ID-space modeling with other latent dimensions to balance accuracy and stability. This paper proposes a Dual Latent-Space Representation-based Land Surface Dynamic Model (DLS-LSDM). The core contributions are: (1) a stacked-convolution and multi-scale linear-attention autoencoder to obtain a base latent, together with ID estimation to derive an ID latent; (2) a long-horizon scheme that combines ID and base latents to achieve both stability and high accuracy ; (3) comprehensive evaluation on ten-year MODIS NDVI across multiple climate zones, demonstrating superiority. 9:00am - 9:15am
Revealing Feature Contribution Mechanisms for Interpretable CNN-Transformer Remote Sensing Classification 1Wuhan university; 2China University of Geosciences; 3Nanjing University of Information Science and Technology Deep learning models have become the backbone of remote sensing image intelligent classification, enabling high-precision recognition of land cover, geospatial objects, and scene categories. However, their inherent "black-box" nature—where decision logic is embedded in complex parameter spaces—poses critical barriers to deployment in high-stakes domains such as military reconnaissance, disaster monitoring, and environmental governance. These fields demand transparent reasoning to validate model reliability, yet traditional interpretability methods suffer from two key limitations when applied to remote sensing data: They are primarily designed for natural images, failing to account for remote sensing-specific characteristics. They focus on local feature attribution or saliency mapping but lack quantitative analysis of how core image features (shape, texture, spectrum) contribute to global classification decisions, especially across different network architectures.To address these problem, this study proposes a comprehensive feature contribution analysis framework tailored to remote sensing images, with the core objectives of: (1) Decoupling and extracting shape, texture, and spectrum features from remote sensing images in a physically meaningful manner; (2) Quantifying the contribution of each feature type to classification decisions; (3) Revealing differences in feature processing mechanisms between CNN and Transformer architectures. 9:15am - 9:30am
EfficientViM-CD: An Efficient Remote Sensing Change Detection Network Based on Hidden State-Mixer 1State Key Laboratory of Information Engineering in Surveying , Mapping and Remote Sensing, wuhan university, China, People's Republic of; 2School of Information Science and Engineering, Wuchang Shouyi University High-resolution optical remote sensing change detection (CD) is of great significance in urban evolution monitoring, disaster assessment, and land management. Traditional deep models often face computational, memory, and inference latency bottlenecks when processing large high-resolution imagery. To address this, we propose EfficientViM-CD: a Hidden-State Mixer based efficient remote sensing change detection network. The approach builds upon the EfficientViM backbone, migrating global interaction operations into a compact hidden state space and leveraging Hidden State Mixer based on state space duality (HSM-SSD) to fuse global context while reducing computational complexity. We employ a Siamese encoding architecture to extract multi-scale features and hidden states from paired temporal images, and utilize a Cross-Hidden Fusion module to integrate hidden semantic interactions between time points. At each scale, local difference features are computed and enhanced in hidden state space, and a multi-scale decoder reconstructs a pixel-level change probability map. We conducted experiments on four public datasets (LEVIR-CD+, WHU-CD, S2Looking, SVCD) and compared against nine state-of-the-art methods. Results demonstrate that EfficientViM-CD achieves competitive accuracy while delivering significant advantages in inference speed and memory efficiency. This method offers a lightweight, efficient, and scalable solution for high-resolution remote sensing change detection, with potential for real-time monitoring and emergency response systems. 9:30am - 9:45am
Local NMS: Enhancing Object Detection in Large-Scale Remote Sensing Images via iterative pipelined Postprocessing Fraunhofer IOSB, Germany Object detection in large, dense remote sensing imagery is difficult because targets are often small and arbitrarily oriented, and state-of-the-art detectors cannot process very large images directly without a reduction in accuracy. Tiling-based inference workflows mitigate the latter issue by running inference iteratively on overlapping tiles, but introduce pre- and postprocessing overhead for image tiling and Non-Maximum Suppression (NMS). We introduce local NMS, an asynchronous tile-wise postprocessing scheme. Local NMS runs in a separate subprocess in parallel to tile-wise inference and collects intermediate results enqueued by the inference process, immediately applying postprocessing. Intelligent reordering of tiles in a preprocessing step ensures optimal usage of computing resources. We assess our method using three state-of-the art object detection models for horizontal and oriented bounding box detection on two benchmark datasets containing large dense aerial and satellite images, DOTA-v2.0 and Izembek Lagoon Birds, stratifying by image size and average object density. Local NMS consistently reduces end-to-end runtime across models and datasets without significant impact on mAP. A maximum runtime reduction of 60.77% on large dense DOTA-v2.0 scenes could be achieved without modifying model architectures or retraining. 9:45am - 10:00am
ERD: Extended RAW-Diffusion Framework for De-rendering sRGB Images 1Department of Computer Science, University of Toronto, Canada; 2Faculty of Geographical Science, Beijing Normal University, China Recovering RAW sensor measurements from rendered sRGB images is important for radiometric calibration, low-level vision, and computational photography. However, reversing a camera’s proprietary Image Signal Pipeline (ISP) is highly challenging, especially when the ISP is unknown. Existing inverse-ISP and diffusion-based approaches have several issues: they depend on known ISPs from the sensor, require one model per sensor, or generalize poorly across camera brands. This work presents ERD (Extended RAW-Diffusion), a unified diffusion-model framework for de-rendering sRGB images into RAW format for any given image, and does not require ISP to be known or camera information from the image. ERD extends the RAW-Diffusion architecture by incorporating camera metadata only during training, allowing the model to learn a shared representation across heterogeneous sensors. To capture global sensor characteristics, ERD introduces a conditioning mechanism, Feature-wise Linear Modulation (FiLM) for global features such as CFA patterns and color gains. To enhance structural consistency, ERD integrates a ControlNet branch that injects edge and gradient priors derived from the sRGB input, stabilizing RAW reconstruction under diverse tone-mapping operations. For practical adaptation to new sensors, ERD supports efficient few-shot tuning via LoRA. Evaluations on Adobe FiveK (Nikon and Canon) and RAW-NOD (Nikon and Sony) show that ERD outperforms state-of-the-art baselines in PSNR and SSIM, offering improved robustness to unseen camera models. ERD enables a practical, general-purpose inverse ISP process across heterogeneous imaging devices. |
| 1:30pm - 3:00pm | WG II/2A: Point Cloud Generation and Processing Location: 713A |
|
|
1:30pm - 1:45pm
LGSSM: Local-to-global state space model for serialized point cloud semantic segmentation School of Geodesy and Geomatics, Hubei Luojia Laboratory, Wuhan University Point clouds have become essential data for describing real-world objects. Accurate and efficient 3D semantic segmentation plays a crucial role in environment understanding and scene reconstruction. However, current segmentation methods still face challenges from unordered data, high computational complexity, limited scene perception, and insufficient generalization. To address these issues, we propose a local-to-global semantic segmentation method based on a state-space model (LGSSM). Specifically, the proposed method uses three-dimensional serialization encoding to serialize point clouds along the x, y, and z directions, effectively addressing the inherent disorder of point clouds and enhancing spatial representation. Then, the local state space model extracts fine-grained local geometric structural information and the global state space model captures the overall scene representation, improving the modeling ability for both short and long distances. Finally, the serialized context aggregation module is utilized to fuse contextual features to promote spatial semantic consistency. Extensive experiments conducted on ScanNet, ScanNet200, and S3DIS demonstrate that our model achieves state-of-the-art segmentation accuracy compared with other existing methods. 1:45pm - 2:00pm
Hierarchical Gaussian Partitioning for Semantic Segmentation of Airborne LiDAR Scenes 1Alteia, France; 2Inria Sophia-Antipolis, France In this paper, we present a novel approach to semantic segmentation of airborne LiDAR point clouds that integrates a hierarchical Gaussian Mixture Model (hGMM) within the Superpoint Transformer (SPT) framework. The hGMM constructs a coarse-to-fine representation of the scene by recursively fitting Gaussian components to spatially coherent subsets of the point cloud, resulting in a hierarchical and structured decomposition that serves as a structured token set for the segmentation objective. While Gaussian Mixture Models (GMMs) can virtually fit any distribution, we constrain their use to structured suburban scenes, where their parametric form is naturally suited to represent planar and ellipsoidal geometries, hence allowing parsimonious mixtures. Experimental results on the DALES benchmark demonstrate that our method achieves competitive performance with respect to state-of-the-art approaches, with notable improvements on classes such as ground and buildings. Results on indoor S3DIS confirm the method's intended specificity to outdoor environments. These findings validate hGMM as a principled and effective alternative to heuristic partitioning techniques, integrating stochastic modelling with transformer-based semantic reasoning in large-scale 3D environments. 2:00pm - 2:15pm
MCPF-Net: Multi-stage LiDAR-Image Collaborative Perception Fusion Network for Point Cloud Semantic Segmentation in Urban Scenes 1School of Geodesy and Geomatics, Wuhan University, Wuhan 430079, China; 2Hinton STAI Institute, East China Normal University, Minhang, Shanghai 200241, China; 3Hubei Luojia Laboratory, Wuhan 430079, China Point cloud semantic segmentation through multi-modal fusion provides a fundamental basis for surface observation and visual perception tasks. LiDAR provides precise geometric structural information, while optical images offer rich semantic and textural details. However, existing fusion methods still suffer from limited cross-modal perception and insufficient information complementarity. To address these challenges, we propose a multi-stage LiDAR-image collaborative perception fusion network (MCPFNet) for point cloud semantic segmentation in urban scenes. In the mid-stage, the network introduces a geometric-aware fusion (GAFM) and a semantic-aware fusion module (SAFM) to achieve bi-directional injection of structural and semantic features between LiDAR and image modalities. In the later stage, an adaptive feature fusion module (AFFM) is designed to refine semantic representations through gated weighting and bi-directional attention mechanisms. Extensive experiments demonstrated that MCPFNet achieved the best mIoU scores of 74.51%, 72.10%, and 95.15% on the ISPRS Vaihingen, FRACTAL, and N3C datasets, respectively, validating its superior performance in multi-modal semantic segmentation. 2:15pm - 2:30pm
Cross-Sensor Robustness and Spatial Generalization for 3D Railway Point Cloud Semantic Segmentation CINTECX, GeoTECH group, Universidade de Vigo This contribution investigates the cross-sensor and spatial generalization of deep learning methods for 3D semantic segmentation in railway environments. Although current models achieve high accuracy on large benchmark datasets, their robustness under real-world acquisition variability remains insufficiently understood. To address this gap, three state-of-the-art architectures—Point Transformer v3, Swin3D, and MinkUNet—were trained on the SemanticRail3D dataset and evaluated on a newly acquired 120-m railway section captured with three heterogeneous LiDAR systems: a Faro Focus S150+ terrestrial laser scanner, a CHCNAV RS10 handheld device, and a GeoSLAM ZEB Go SLAM-based scanner. The case-study point clouds were carefully registered, normalized, voxelized, and manually annotated to provide consistent ground truth across sensors. A standardized preprocessing and test-time augmentation pipeline was applied to ensure compatibility with the training domain. Generalization performance was analysed through per-class IoU, cross-model agreement, and sensor-dependent degradation patterns. Results show significant variability across acquisition platforms, with denser, low-noise scans enabling better transferability, while sparser SLAM-based point clouds remain challenging for thin or small components such as overhead wires. To mitigate cross-sensor variability, an IoU-weighted ensemble strategy was introduced, leveraging complementary model strengths without requiring retraining. This ensemble consistently improved or matched the performance of individual models on the case-study datasets. Overall, the study demonstrates the importance of evaluating semantic segmentation models under realistic multi-sensor conditions and provides a practical benchmark and methodology for assessing domain-shift effects in railway point clouds. 2:30pm - 2:45pm
Revisiting NeRF for Street Scene Point Cloud Semantic Segmentation in the Era of 3DGS University of Oxford, United Kingdom Accurate semantic segmentation of urban point clouds is fundamental for autonomous driving and city mapping. Recent advances in neural scene representations, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), have significantly improved photorealistic reconstruction quality. However, 3DGS is primarily designed for small-scale, object-centric scenes with dense viewpoints, and its optimization becomes sub-optimal in large-scale street scenes with trajectory-constrained observations, leading to semantic errors and distorted geometry. In this work, we revisit NeRF-based scene representation in the era of 3DGS to address these challenges. Our method integrates the explicit and efficient modeling strategy of 3DGS with the surface-constrained sampling nature of NeRF. Specifically, we employ Deformable Neural Mesh Primitives (DNMPs) to jointly encode geometry and semantics, enabling efficient ray–mesh intersection sampling and neural field interpolation. This formulation achieves 3D-annotation-free point cloud semantic segmentation by leveraging rendered image supervision. Experiments on the KITTI-360 dataset demonstrate that our approach surpasses the Street Gaussians baseline in overall mIoU and across most semantic categories. The improvement mainly stems from reducing semantic errors caused by limited viewpoints during 3D Gaussian optimization, providing a robust and scalable solution for street scene semantic understanding. 2:45pm - 3:00pm
Extraction of Pole-like Road Objects from MMS Point Clouds Using Deep Learning and Geometric-Topological Feature Fusion AERO TOYOTA CORPORATION, Japan This paper presents a fusion framework for the automatic extraction of pole-like road objects—such as traffic lights, road signs, streetlights, and utility poles—from Mobile Mapping System (MMS) point clouds. The proposed method integrates KPConv-based semantic segmentation with geometric–topological reasoning to achieve structural completion and false-positive suppression without retraining or additional annotated data. The framework was trained on 8 km of manually labeled MMS data from the Kinki region, Japan, and evaluated on large-scale unseen data from Hokkaido (≈ 26 km, 2.53 billion points) and the Paris–Lille-3D benchmark (France) acquired with a different LiDAR sensor. The proposed approach significantly outperformed the KPConv baseline. On the Hokkaido dataset, the F₁-score improved from 0.8263 to 0.8689 (+0.0426), successfully reconstructing lamp tops, signal arms, and previously unseen snow delineator posts (snow poles). On the Paris–Lille-3D benchmark, recall increased by 15.5 points, yielding an overall F₁-score gain of +0.0802. The 26 km Hokkaido dataset was processed in less than 13 hours on a single NVIDIA Quadro RTX 8000. These results demonstrate that the proposed deep learning–geometry–topology fusion achieves robust, scalable, and efficient performance across diverse geographic and sensor domains, supporting nationwide road-asset mapping and digital-twin generation. |
| 3:30pm - 5:15pm | ThS4A: Toward Smart Forests: Emerging Tools in Remote Sensing, Artificial Intelligence, and Field Robotics Location: 713A |
|
|
3:30pm - 3:45pm
AI-Enabled Forest Inventory in TerraScan: integrating Georeferencing, Species Identification, and Volume Computation Terrasolid LTD, Hatsinanpuisto 8, 02600, Espoo, Finland The Terrasolid software suite provides an automated and scalable framework for large-scale LiDAR data processing, widely adopted in both national and private forest inventories. Its unified processing pipeline covers all essential steps—from point cloud import and georeferencing to ground classification, object detection, tree segmentation, and computation of individual-tree attributes such as diameter at breast height (DBH), height, volume, and tree species. Georeferencing is initially performed in TerraScan using signal markers or automatically detected tree trunks, with optional refinement in TerraMatch, which corrects angular misalignments between flight lines. Following object classification, individual trees are extracted from points labeled as trees. The semi-manual Group Inspection tools support efficient correction of segmentation errors, such as merged or over-segmented trees, after which stem-wise metrics are automatically updated. These conventional modules rely on optimized algorithms capable of processing hundreds of millions of points within minutes. A recent innovation, the Tree Species tool, introduces one of the first AI-based extensions within Terrasolid software. It employs a machine learning approach that integrates 2D raster-based features with 3D point cloud descriptors to achieve accurate tree species identification. Validation was conducted using the FOR-species20K dataset, comprising 33 species collected worldwide. Among several tested classifiers, the Histogram Gradient Boosting Classifier (HGBC) achieved the highest accuracy. To mitigate class imbalance, multiple side-view rasterizations and SVM-SMOTE oversampling were applied, significantly improving the separability of underrepresented species and overall classification robustness. 3:45pm - 4:00pm
Spatiotemporal Foundation Model for Aboveground Biomass Estimation: A case study in Mixedwood Plains Ecozone, Ontario, Canada 1McMaster University; 2Environment and Climate Change Canada Traditional aboveground biomass estimation for forested areas relies on allometric equations (Návar, 2009), which use input variables such as diameter at breast height (DBH), tree height, and tree species or broader taxonomic group. Although allometric equations can estimate the biomass of individual trees, and stand-level equations exist for larger scales, they often require extensive field data, making them less suitable for densely clustered or remote forests. However, satellite images provide increasingly detailed global observations of forested areas, and spaceborne lidar data like GEDI (Duncanson et al., 2022) provide accurate measurements for canopy height across different ecozones worldwide. In recent years, foundation models (FMs) inspired by large language models (Vaswani et al., 2017) have become the new paradigm to leverage large amounts of unlabelled data through self-supervised pre-training and have shown capacity to benefit multiple downstream tasks. In this work, we adopt the Granite foundation model (Muszynski et al., 2024) as a baseline to improve aboveground biomass estimation on different satellite data, using the Mixedwood Plains Ecozone (MPE) as a case study. We also explore adding temporal, geospatial, and spatiotemporal features and validate the proposed spatiotemporal foundation model with field sampling plots. 4:00pm - 4:15pm
Improving Tree Species Detection for Operational Forestry: The Role of Dataset Design Department of Environmental Systems Science, Institute of Terrestrial Ecosystems, ETH Zurich, 8092 Zurich, Switzerland Accurate detection and mapping of individual trees and their species are vital for sustainable forest management. Traditional field-based inventories remain the golden standard in forest monitoring, but are increasingly overwhelmed by temporal, spatial and accessibility constraints. Remote sensing offers faster, repeatable, and high-resolution data that complement and scale beyond field inventories. However, species-level detection remains difficult due to overlapping crowns, and spatial mismatches between crowns and trunks. Deep learning (DL) methods, particularly convolutional neural networks, have advanced crown delineation by automatically learning spatial and spectral patterns from imagery. Yet, their success depends heavily on dataset quality, class balance, and diversity. To address this, we applied a DL object detection framework for tree crown and species detection in Swiss forests and evaluate how dataset composition and training strategies influence accuracy and generalization. We test three dataset configurations: (1) an unbalanced masked dataset, (2) a class-balanced masked dataset, and (3) a mixed dataset combining masked and unmasked imagery. Results show that class balancing improved accuracy for both dominant and minority species, while mixed data enhances generalization. 4:15pm - 4:30pm
Self-Supervised Leaf-Off Segmentation of Tree Functional Types and Buildings from Airborne NIRGB and LiDAR Data in Southern Ontario 1McMaster University, School of Earth Environment Society, Canada; 2Environment and Climate Change Canada High-resolution airborne sensing enables joint mapping of urban infrastructure and forest composition at ecological scales. This study presents a self-supervised segmentation framework that fuses 0.5 m Near-Infrared + RGB (NIRGB) orthophotography from the Ontario Imagery Program (2013–2026) with Canopy-height models (CHM) derived from the Ontario Elevation Mapping Program (8–10 pulses m⁻², 5–10 cm vertical accuracy). Imagery was collected during the leaf-off season, providing strong spectral–structural contrast between evergreen and deciduous crowns, to produce high-fidelity land- cover segmentations that differentiate vegetation functional types and built structures as a prerequisite for tree-level biomass and carbon-stock estimation. 4:30pm - 4:45pm
Updating Forestry Road networks in Ontario Using Single Photon LiDAR and Deep Learning-enhanced algorithms Department of Wood and Forest Sciences, Université Laval, Québec, Canada Spatially accurate forestry road networks are essential for effective forestry operations, sustainable resource management, and conservation. Current forestry road databases in Ontario have significant location errors due to limitations and human errors associated with conventional road delineation approaches such as GPS-based field surveys and photointerpretation. A previously developed algorithm, which used airborne laser scanning (ALS) data, successfully corrected road locations in Quebec. However, its design limited its application in other landscapes, ALS instruments, and road construction and maintenance practices. This study advances that algorithm by integrating a deep learning component to improve its robustness and scalability for diverse forest conditions. A hybrid workflow combines the original friction-based conductivity surface with a road probability surface generated by an Attention Residual U-Net model trained on 11 LiDAR-derived features using road segments from five forest sites in Quebec. The enhanced workflow was applied to two forest management units in Ontario: Nipissing and Dryden. The results showed significant improvement in road alignment when compared to the existing provincial data and the outputs from the earlier automated approach. The deep learning-enhanced algorithm lowered mean positional error by 78% (from 9.36 m to 2.07 m) and increased the proportion of road centerline points within 3 m of the reference from 66.7% to 87.2%. These improved centerline accuracies will further support a scalable tool for rapid and accurate forestry road network mapping, which in turn will aid sustainable forest management and conservation planning at both provincial and national scales. 4:45pm - 5:00pm
Attention-guided Multi-Scale Deep Learning Approach for Tree Health Detection Using Very High-Resolution Aerial Imagery Department of Environmental Systems Science, Institute of Terrestrial Ecosystems, ETH Zurich, Zurich, Switzerland Monitoring tree health is essential for detecting early signs of stress, defoliation, and potential mortality, supporting effective forest management, ecosystem conservation, and early warning systems. Advances in deep learning have enabled automated analysis of trees in remote sensing imagery through object detection methods that leverage both spectral and spatial information. However, assessing tree defoliation remains challenging, as subtle differences between defoliation levels make accurate classification difficult. To address this, we propose the hybrid ResNet-Swin Transformer, an object detection architecture built on a Faster R-CNN framework, incorporating a fused ResNet and Swin Transformer backbone with attention-based feature fusion. This design captures rich, multiscale representations by combining convolutional and transformer-based features and progressively refines them through channel-wise attention blocks for robust detection and classification. The architecture was evaluated on a very high-resolution aerial dataset from Switzerland, partially annotated with five classes: Conifer (healthy), Conifer (defoliated), Broadleaf (healthy), Broadleaf (defoliated) and Dead. Comparative experiments with state-of-the-art object detection and classification methods demonstrate that the proposed approach achieves higher accuracy and robustness, highlighting its potential for precise and reliable automated tree health monitoring. 5:00pm - 5:15pm
Fine-grained vegetation segmentation in complex urban park environments using a deeply supervised parallel SegFormer Department of Landscape Architecture, Tianjin University, 300072 Tianjin, China, Accurate vegetation mapping in complex urban environments is essential for ecological monitoring, biodiversity assessment, and sustainable park management. However, fine-grained vegetation segmentation remains challenging because of the high diversity of plant species, overlapping canopies, and the interference of artificial objects. To address these challenges, a deeply supervised parallel architecture based on the SegFormer backbone was proposed in this paper. The model incorporated a SegFormer-ASPP-low-level (SAL) head, which fused high-level semantic representations, multi-scale contextual information, and low-level spatial details through a parallel decoding mechanism. Two auxiliary heads, a pyramid pooling module (PSP) and a fully convolutional network (FCN), were added to provide deep supervision and improve the recognition of blurred boundaries and rare categories. High-resolution UAV imagery was used to perform fine-grained semantic segmentation of 17 vegetation categories. The dataset included multiple tree species as well as non-tree classes such as Nelumbo sp. (lotus) and dead trees. Experimental results showed that our model achieved a mean intersection over union (mIoU) of 73.57%, outperforming architectures such as SegFormer-b1, DeepLab v3+, ConvNeXt and SCTNet. Visual analysis further demonstrated the model's robustness in complex urban park scenes, showing superior boundary delineation, improved recognition of small and spectrally similar species, and resilience to interference from artificial objects like plastic lawns and landscape lighting. The proposed approach offers valuable insights for precision forestry, ecological monitoring, and intelligent UAV-based remote sensing applications. |

