Main track accepted papers (Guangzhou)

56: Multi-View Learning with Context-Guided Receptance for Image Denoising
Authors: Binghong Chen, Tingting Chai, Wei Jiang, Yuanrong Xu, Guanglu Zhou, Xiangqian Wu
Location: Guangzhou | Day: TBD
Show Abstract
Image denoising is essential in low-level vision applications such as photography and automated driving. Existing methods struggle with distinguishing complex noise patterns in real-world scenes and consume significant computational resources due to reliance on Transformer-based models. In this work, the Context-guided Receptance Weighted Key-Value (CRWKV) model is proposed, combining enhanced multi-view feature integration with efficient sequence modeling. The Context-guided Token Shift (CTS) mechanism is introduced to effectively capture local spatial dependencies and enhance the model’s ability to model real-world noise distributions. Also, the Frequency Mix (FMix) module extracting frequency-domain features is designed to isolate noise in high-frequency spectra, and is integrated with spatial representations through a multi-view learning process. To improve computational efficiency, the Bidirectional WKV (BiWKV) mechanism is adopted, enabling full pixel-sequence interaction with linear complexity while overcoming the causal selection constraints. The model is validated on multiple real-world image denoising datasets, outperforming the state-of-the-art methods quantitatively and reducing inference time up to 40%. Qualitative results further demonstrate the ability of our model to restore fine details in various scenes. The code is publicly available at https://github.com/Seeker98/CRWKV.
57: Time-Frequency Disentanglement Boosted Pre-Training: A Universal Spatio-Temporal Modeling Framework
Authors: Yudong Zhang, Zhaoyang Sun, Xu Wang, Xuan Yu, Kai Wang, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Current spatio-temporal modeling techniques largely rely on the abundant data and the design of task-specific models. However, many cities lack well-established digital infrastructures, making data scarcity and the high cost of model development significant barriers to application deployment. Therefore, this work aims to enable spatio-temporal learning to cope with the problems of few-shot data modeling and model generalizability. To this end, we propose a Universal Spatio-Temporal Correlationship pre-training framework (USTC), for spatio-temporal modeling across different cities and tasks. To enhance the spatio-temporal representations during pre-training, we propose to decouple the time-frequency patterns within data, and leverage contrastive learning to maintain the time-frequency consistency. To further improve the adaptability to downstream tasks, we design a prompt generation module to mine personalized spatio-temporal patterns on the target city, which can be integrated with the learned common spatio-temporal representations to collaboratively serve downstream tasks. Extensive experiments conducted on real-world datasets demonstrate that USTC significantly outperforms the advanced baselines in forecasting, imputation, and extrapolation across cities.
62: Robustness to Spurious Correlations via Dynamic Knowledge Transfer
Authors: Xiaoling Zhou, Wei Ye, Zhemg Lee, Shikun Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Spurious correlations pose a significant challenge to the robustness of statistical models, often resulting in unsatisfactory performance when distributional shifts occur between training and testing data. To address this, we propose to transfer knowledge across spuriously correlated categories within the deep feature space. Specifically, samples’ deep features are enriched using semantic vectors extracted from both their respective category distributions and those of their spuriously correlated counterparts, enabling the generation of diverse class-specific factual and counterfactual augmented deep features. We then demonstrate the feasibility of optimizing a surrogate robust loss instead of conducting explicit augmentations by considering an infinite number of augmentations. As spurious correlations between samples and classes evolve during training, we develop a reinforcement learning-based training framework called Dynamic Knowledge Transfer (DKT) to facilitate dynamic adjustments in the direction and intensity of knowledge transfer. Within this framework, a target network is trained using the derived robust loss to enhance robustness, while a strategy network generates sample-wise augmentation strategies in a dynamic and automatic way. Extensive experiments validate the effectiveness of the DKT framework in mitigating spurious correlations, achieving state-of-the-art performance across three typical learning scenarios susceptible to such correlations.
63: MVP-CBM: Multi-layer Visual Preference-enhanced Concept Bottleneck Model for Explainable Medical Image Classification
Authors: Chunjiang Wang, Kun Zhang, Yandong Liu, Zhiyang He, Xiaodong Tao, S. Kevin Zhou
Location: Guangzhou | Day: TBD
Show Abstract
The concept bottleneck model (CBM), as a technique improving interpretability via linking predictions to human-understandable concepts, makes high-risk and life-critical medical image classification credible. Typically, existing CBM methods associate the final layer of visual encoders with concepts to explain the model’s predictions. However, we empirically discover the phenomenon of concept preference variation, that is, the concepts are preferably associated with the features at different layers than those only at the final layer; yet a blind last-layer-based association neglects such a preference variation and thus weakens the accurate correspondences between features and concepts, impairing model interpretability. To address this issue, we propose a novel Multi-layer Visual Preference-enhanced Concept Bottleneck Model (MVP-CBM), which comprises two key novel modules: (1) intra-layer concept preference modeling, which captures the preferred association of different concepts with features at various visual layers, and (2) multi-layer concept sparse activation fusion, which sparsely aggregates concept activations from multiple layers to enhance performance. Thus, by explicitly modeling concept preferences, MVP-CBM can comprehensively leverage multi-layer visual information to provide a more nuanced and accurate explanation of model decisions. Extensive experiments on several public medical classification benchmarks demonstrate that MVP-CBM achieves state-of-the-art accuracy and interoperability, verifying its superiority. Code is available at https://github.com/wcj6/MVP-CBM.
66: Variational Graph Auto-Encoder Driven Graph Enhancement for Sequential Recommendation
Authors: Yuwen Liu, Lianyong Qi, Xingyuan Mao, Weiming Liu, Shichao Pei, Fan Wang, Xuyun Zhang, Amin Beheshti, Xiaokang Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Recommender systems play a critical role in many applications by providing personalized recommendations based on user interactions. However, it remains a major challenge to capture complex sequential patterns and address noise in user interaction data. While advanced neural networks have enhanced sequential recommendation by modeling high-order item dependencies, they typically assume that the noisy interaction data as the user’s preferred preferences. This assumption can lead to suboptimal recommendation results. We propose a Variational Graph Auto-Encoder driven Graph Enhancement (VGAE-GE) method for robust augmentation in sequential recommendation. Specifically, our method first constructs an item transition graph to capture higher-order interactions and employs a Variational Graph Auto-Encoder (VGAE) to generate latent variable distributions. By utilizing these latent variable distributions for graph reconstruction, we can improve the item representation. Next, we use a Graph Convolutional Network (GCN) to transform these latent variables into embeddings and infer more robust user representations from the updated item embeddings. Finally, we obtain the reconstructed user check-in data, and then use a Mamba-based recommender to make the recommendation process more efficient and the recommendation results more accurate. Extensive experiments on five public datasets demonstrate that our VGAE-GE model improves recommendation performance and robustness.
70: Adversarial Propensity Weighting for Debiasing in Collaborative Filtering
Authors: Kuiyu Zhu, Tao Qin, Pinghui Wang, Xin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Debiased recommendation focuses on alleviating the negative impact of various biases on recommendation quality to achieve fairer personalized recommendations. Current research mainly relies on propensity score estimation or causal inference methods to alleviate selection bias; at the same time, research on prevalence bias has proposed a variety of methods based on causal graphs and contrastive learning. However, these methods have shortcomings in dealing with unstable propensity score estimates, bias interactions, and decoupling of interest and bias signals, which limits the performance improvement of recommender systems. To this end, this paper proposes APWCF, a collaborative filtering debiased method that combines dynamic propensity modeling and adversarial learning. APWCF solves the problem of high variance in propensity scores through the dynamic propensity factor, and decouples user interests and bias signals through the adversarial learning to effectively remove multiple biases. Experiments show that APWCF significantly outperforms existing methods across various benchmark datasets from different domains. Compared with the current optimal baseline PDA, Recall@10 and NDCG@10 improve by 0.10%-5.42% and 1.01%-8.60% respectively.
71: GSDNet: Revisiting Incomplete Multimodality-Diffusion Emotion Recognition from the Perspective of Graph Spectrum
Authors: Yuntao Shou, Jun Yao, Tao Meng, Wei Ai, Cen Chen, Keqin Li
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal Emotion Recognition (MER) combines technologies from multiple fields (e.g., computer vision, natural language processing, and audio signal processing), aiming to infer an individual’s emotional state by analyzing information from different sources (i.e., video, audio, and text). Compared with single modality, by fusing complementary semantic information from different modalities, the model can obtain more robust knowledge representation. However, the modality missing problem limits the performance of MERC in practical scenarios. Recent work has achieved impressive performance on modality completion using graph neural networks and diffusion models, respectively. This inspires us to combine these two dimensions in the completion network to obtain more powerful representation capabilities. However, we argue that directly running a full-rank score-based diffusion model on the entire graph adjacency matrix space may adversely affect the learning process of the diffusion model. This is because the model assumes a direct relationship between each pair of nodes and ignores local structural features and sparse connections between nodes, thereby significantly reducing the quality of the generated data. Based on the above ideas, we propose a novel Graph Spectral Diffusion Network (GSDNet), which utilizes a low-rank score-based diffusion model to map Gaussian noise to the graph spectral distribution space of missing modalities and recover the missing data according to its original distribution. Extensive experiments have demonstrated that GSDNet achieves state-of-the-art emotion recognition performance in various modality loss scenarios.
73: A Methodological Framework for Measuring Spatial Labeling Similarity
Authors: Yihang Du, Jiaying Hu, Suyang Hou, Yueyang Ding, Xiaobo Sun
Location: Guangzhou | Day: TBD
Show Abstract
Spatial labeling assigns labels to specific spatial locations to characterize their spatial properties and relationships, with broad applications in scientific research and practice. Measuring the similarity between two spatial labelings is essential for understanding their differences and the contributing factors, such as changes in location properties or labeling methods. An adequate and unbiased measurement of spatial labeling similarity should consider the number of matched labels (label agreement), the topology of spatial label distribution, and the heterogeneous impacts of mismatched labels. However, existing methods often fail to account for all these aspects. To address this gap, we propose a methodological framework to guide the development of methods that meet these requirements.
Given two spatial labelings, the framework transforms them into graphs based on location organization, labels, and attributes (e.g., location significance). The distributions of their graph attributes are then extracted, enabling an efficient computation of distributional discrepancy to reflect the dissimilarity level between the two labelings. We further provide a concrete implementation of this framework, termed Spatial Labeling Analogy Metric (SLAM), along with an analysis of its theoretical foundation, for evaluating spatial labeling results in spatial transcriptomics (ST) as per their similarity with ground truth labeling. Through a series of carefully designed experimental cases involving both simulated and real ST data, we demonstrate that SLAM provides a comprehensive and accurate reflection of labeling quality compared to other well-established evaluation metrics. Our code is available at https://github.com/YihDu/
SLAM.
79: Dual-Perspective United Transformer for Object Segmentation in Optical Remote Sensing Images
Authors: Yanguang Sun, Jiexi Yan, Jianjun Qian, Chunyan Xu, Jian Yang, Lei Luo
Location: Guangzhou | Day: TBD
Show Abstract
Automatically segmenting objects from optical remote sensing images (ORSIs) is an important task. Most existing models are primarily based on either convolutional or Transformer features, each offering distinct advantages. Exploiting both advantages is valuable research, but it presents several challenges, including the heterogeneity between the two types of features, high complexity, and large parameters of the model. However, these issues are often overlooked in existing the ORSIs methods, causing sub-optimal segmentation. For that, we propose a novel Dual-Perspective United Transformer (DPU-Former) with a unique structure designed to simultaneously integrate long-range dependencies and spatial details. In particular, we design the global-local mixed attention, which captures diverse information through two perspectives and introduces a Fourier-space merging strategy to obviate deviations for efficient fusion. Furthermore, we present a gated linear feed-forward network to increase the expressive ability. Additionally, we construct a DPU-Former decoder to aggregate and strength features at different layers. Consequently, the DPU-Former model outperforms the state-of-the-art methods on multiple datasets. Code: https://github.com/CSYSI/DPU-Former.
80: Rethinking Contrastive Learning in Graph Anomaly Detection: A Clean-View Perspective
Authors: Di Jin, Jingyi Cao, Xiaobao Wang, Bingdao Feng, Dongxiao He, Longbiao Wang, Jianwu Dang
Location: Guangzhou | Day: TBD
Show Abstract
Graph anomaly detection aims to identify unusual patterns in graph-based data, with wide applications in fields such as web security and financial fraud detection. Existing methods typically rely on contrastive learning, assuming that a lower similarity between a node and its local subgraph indicates abnormality. However, these approaches overlook a crucial limitation: the presence of interfering edges invalidates this assumption, since it introduces disruptive noise that compromises the contrastive learning process. Consequently, this limitation impairs the ability to effectively learn meaningful representations of normal patterns, leading to suboptimal detection performance. To address this issue, we propose a Clean-View Enhanced Graph Anomaly Detection framework (CVGAD), which includes a multi-scale anomaly awareness module to identify key sources of interference in the contrastive learning process. Moreover, to mitigate bias from the one-step edge removal process, we introduce a novel progressive purification module. This module incrementally refines the graph by iteratively identifying and removing interfering edges, thereby enhancing model performance. Extensive experiments on five benchmark datasets validate the effectiveness of our approach.
86: A Dynamic Knowledge Update-Driven Model with Large Language Models for Fake News Detection
Authors: Di Jin, Jun Yang, Xiaobao Wang, Junwei Zhang, Shuqi Li, Dongxiao He
Location: Guangzhou | Day: TBD
Show Abstract
As the Internet and social media evolve rapidly, distinguishing credible news from a vast amount of complex information poses a significant challenge. Due to the suddenness and instability of news events, the authenticity labels of news can potentially shift as events develop, making it crucial for fake news detection to obtain the latest event updates. Existing methods employ retrieval-augmented generation to fill knowledge gaps, but they suffer from issues such as insufficient credibility of retrieved content and interference from noisy information. We propose a dynamic knowledge update-driven model for fake news detection (DYNAMO), which leverages knowledge graphs to achieve continuous updating of new knowledge and integrates with large language models to fulfill dual functions: news authenticity detection and verification of new knowledge correctness, solving the two key problems of ensuring the authenticity of new knowledge and deeply mining news semantics. Specifically, we first construct a news-domain-specific knowledge graph. Then, we use Monte Carlo Tree Search to decompose complex news and verify them step by step. Finally, we extract and update new knowledge from verified real news texts and reasoning paths. Experimental results demonstrate that DYNAMO achieves the best performance on two real-world datasets.
99: Multi-Modal Point Cloud Completion with Interleaved Attention Enhanced Transformer
Authors: Chenghao Fang, Jianqing Liang, Jiye Liang, Hangkun Wang, Kaixuan Yao, Feilong Cao
Location: Guangzhou | Day: TBD
Show Abstract
Multi-modal point cloud completion, which utilizes a complete image and a partial point cloud as input, is a crucial task in 3D computer vision. Previous methods commonly employ a cross-attention mechanism to fuse point clouds and images. However, these approaches often fail to fully leverage image information and overlook the intrinsic geometric details of point clouds that could complement the image modality. To address these challenges, we propose an interleaved attention enhanced Transformer (IAET) with three main components, i.e., token embedding, bidirectional token supplement, and coarse-to-fine decoding. IAET incorporates a novel interleaved attention mechanism to enable bidirectional information supplementation between the point cloud and image modalities. Additionally, to maximize the use of the supplemented image information, we introduce a view-guided upsampling module that leverages image tokens as queries to guide the generation of detailed point cloud structures. Extensive experiments demonstrate the effectiveness of IAET, highlighting its state-of-the-art performance on multi-modal point cloud completion benchmarks in various scenarios. The source code is freely accessible at https://github.com/doldolOuO/IAET.
101: ActiveHAI: Active Collection Based Human-AI Diagnosis with Limited Expert Predictions
Authors: Xuehan Zhao, Jiaqi Liu, Xin Zhang, Zhiwen Yu, Bin Guo
Location: Guangzhou | Day: TBD
Show Abstract
Recent studies indicate that human-AI collaboration performs better than either alone, particularly in medical diagnosis. Beyond collaboration methods that focus on assigning tasks to humans or AI, like deferral, combining human and AI decisions with their confidence scores is emerging as a promising strategy. Due to high cognitive load, doctors often struggle to provide confidence assessments, necessitating explicit human uncertainty evaluation through a limited number of additional expert predictions. There are two challenges. (1) how to actively collect limited yet representative expert predictions? (2) how to accurately evaluate human uncertainty with limited expert predictions? To address the challenges, we propose ActiveHAI, an active human-AI diagnosis method that reduces expert costs through a median-window sampling strategy that actively selects representative samples near the estimated median; and evaluate expert confidence through an evaluator module that integrates sample features and expert predictions, converting them into probability distributions. Experiments on three real-world datasets show that ActiveHAI surpasses doctor and other human-AI methods by 16.3% and 3.6% in accuracy, respectively. Furthermore, ActiveHAI reaches 97.2% relative accuracy, even with just eight expert predictions per class.
113: INFP: INdustrial Video Anomaly Detection via Frequency Prioritization
Authors: Qianzi Yu, Kai Zhu, Yang Cao, Yu Kang
Location: Guangzhou | Day: TBD
Show Abstract
Industrial video anomaly detection aims to perform real-time analysis of video streams from industrial production lines and provide anomaly alerts. Conventional video anomaly detection methods focus more on the overall image, as they aim to identify anomalies among multiple normal samples appearing simultaneously. However, industrial scenarios, where the primary focus is on a single type of product, require attention to local areas to capture fine-grained details and specific patterns. Directly applying conventional methods to industrial scenarios can result in an inability to focus on products moving along fixed trajectories, ineffective utilization of their equidistant periodicity, and greater susceptibility to lighting variations. To address these issues, we propose FreqNet, an encoder-decoder framework that learns frequency-domain features from videos to capture periodic and dynamic characteristics, enhancing the model’s robustness. Specifically, a trajectory filter is proposed that takes advantage of the significant difference between moving objects and static backgrounds in the frequency domain by assigning higher weights to fixed moving trajectories. Moreover, a multi-feature fusion module is proposed, in which the frequency domain features of the video are first extracted to leverage the unique equidistant periodicity information of videos from industrial production lines. The extracted frequency domain features are subsequently fused with spatio-temporal features and contextual information is further integrated from the fused representation, effectively mitigating the impact of lighting variations on production lines. Extensive experiments on the benchmark IPAD dataset demonstrate the superiority of our proposed method over the state-of-the-art.
125: Optimizing Personalized Federated Learning Through Adaptive Layer-Wise Learning
Authors: Weihang Chen, Cheng Yang, Jie Ren, Zhiqiang Li, Zheng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Real-life deployment of federated Learning (FL) often faces non-IID data, which leads to poor accuracy and slow convergence. Personalized FL (pFL) tackles these issues by tailoring local models to individual data sources and using weighted aggregation methods for client-specific learning. However, existing pFL methods often fail to provide each local model with global knowledge on demand while maintaining low computational overhead. Additionally, local models tend to over-personalize their data during the training process, potentially dropping previously acquired global information. We propose FLAYER, a novel layer-wise learning method for pFL that optimizes local model personalization performance. FLAYER considers the different roles and learning abilities of neural network layers of individual local models. It incorporates global information for each local model as needed to initialize the local model cost-effectively. It then dynamically adjusts learning rates for each layer during local training, optimizing the personalized learning process for each local model while preserving global knowledge. Additionally, to enhance global representation in pFL, FLAYER selectively uploads parameters for global aggregation in a layer-wise manner. We evaluate FLAYER on four representative datasets in computer vision and natural language processing domains. Compared to eight state-of-the-art pFL methods, FLAYER improves the inference accuracy, on average, by 5.20% (up to 14.29%). Code is available at https://github.com/lancasterJie/FLAYER/.
128: Pre-defined Keypoints Promote Category-level Articulation Pose Estimation via Multi-Modal Alignment
Authors: Wenbo Xu, Li Zhang, Liu Liu, Yan Zhong, Haonan Jiang, Xue Wang, Rujing Wang
Location: Guangzhou | Day: TBD
Show Abstract
Articulations are essential in everyday interactions, yet traditional RGB-based pose estimation methods often struggle with issues such as lighting variations and shadows. To overcome these challenges, we propose a novel Pre-defined keypoint based framework for category-level articulation pose estimation via multi-modal Alignment, coined PAGE. Specifically, we first propose a customized keypoint estimation method, aiming to avoid the divergent distance pattern between heuristically generated keypoints and visible points. In addition, to reduce the mutual information redundancy between point clouds and RGB images, we design the geometry-color alignment, which fuses the features after aligning two modalities. This is followed by decoding the radius for each visible point, and applying our proposal integration scoring strategy to predict keypoints. Ultimately, the framework outputs the per-part 6D pose of the articulation. We conduct extensive experiments to evaluate PAGE across a variety of datasets, from synthetic to real-world scenarios, demonstrating its robustness and superior performance.
149: Robust Misinformation Detection by Visiting Potential Commonsense Conflict
Authors: Bing Wang, Ximing Li, Changchun Li, Bingrui Zhao, Bo Fu, Renchu Guan, Shengsheng Wang
Location: Guangzhou | Day: TBD
Show Abstract
The development of Internet technology has led to an increased prevalence of misinformation, causing severe negative effects across diverse domains. To mitigate this challenge, Misinformation Detection (MD), aiming to detect online misinformation automatically, emerges as a rapidly growing research topic in the community. In this paper, we propose a novel plug-and-play augmentation method for the MD task, namely Misinformation Detection with Potential Commonsense Conflict (MD-PCC). We take inspiration from the prior studies indicating that fake articles are more likely to involve commonsense conflict. Accordingly, we construct commonsense expressions for articles, serving to express potential commonsense conflicts inferred by the difference between extracted commonsense triplet and golden ones inferred by the well-established commonsense reasoning tool COMET. These expressions are then specified for each article as augmentation. Any specific MD methods can be then trained on those commonsense-augmented articles. Besides, we also collect a novel commonsense-oriented dataset named CoMis, whose all fake articles are caused by commonsense conflict. We integrate MD-PCC with various existing MD backbones and compare them across both 4 public benchmark datasets and CoMis. Empirical results demonstrate that MD-PCC can consistently outperform the existing MD baselines.
164: An Out-Of-Distribution Membership Inference Attack Approach for Cross-Domain Graph Attacks
Authors: Jinyan Wang, Liu Yang, Yuecen Wei, Jiaxuan Si, Chenhao Guo, Qingyun Sun, Xianxian Li, Xingcheng Fu
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Network-based methods face privacy leakage risks due to the introduction of topological structures about the targets, which allows attackers to bypass the target’s prior knowledge of the sensitive attributes and realize membership inference attacks (MIA) by observing and analyzing the topology distribution. As privacy concerns grow, the assumption of MIA, which presumes that attackers can obtain an auxiliary dataset with the same distribution, is increasingly deviating from reality. In this paper, we categorize the distribution diversity issue in real-world MIA scenarios as an Out-Of-Distribution (OOD) problem, and propose a novel Graph OOD Membership Inference Attack (GOOD-MIA) to achieve cross-domain graph attacks. Specifically, we construct shadow subgraphs with distributions from different domains to model the diversity of real-world data. We then explore the stable node representations that remain unchanged under external influences and consider eliminating redundant information from confounding environments and extracting task-relevant key information to more clearly distinguish between the characteristics of training data and unseen data. This OOD-based design makes cross-domain graph attacks possible. Finally, we perform risk extrapolation to optimize the attack’s domain adaptability during attack inference to generalize the attack to other domains. Experimental results demonstrate that GOOD-MIA achieves superior attack performance in datasets designed for multiple domains.
186: SOTA: Spike-Navigated Optimal TrAnsport Saliency Region Detection in Composite-bias Videos
Authors: Wenxuan Liu, Yao Deng, Kang Chen, Xian Zhong, Zhaofei Yu, Tiejun Huang
Location: Guangzhou | Day: TBD
Show Abstract
Existing saliency detection methods struggle in real-world scenarios due to motion blur and occlusions. In contrast, spike cameras, with their high temporal resolution, significantly enhance visual saliency maps. However, the composite noise inherent to spike camera imaging introduces discontinuities in saliency detection. Low-quality samples further distort model predictions, leading to saliency bias. To address these challenges, we propose Spike-navigated Optimal TrAnsport Saliency Region Detection (SOTA), a framework that leverages the strengths of spike cameras while mitigating biases in both spatial and temporal dimensions. Our method introduces Spike-based Micro-debias (SM) to capture subtle frame-to-frame variations and preserve critical details, even under minimal scene or lighting changes. Additionally, Spike-based Global-debias (SG) refines predictions by reducing inconsistencies across diverse conditions. Extensive experiments on real and synthetic datasets demonstrate that SOTA outperforms existing methods by eliminating composite noise bias. Our code and dataset will be released at https://github.com/lwxfight/sota.
190: Constructive Conflict-Driven Multi-Agent Reinforcement Learning for Strategic Diversity
Authors: Yuxiang Mai, Qiyue Yin, Wancheng Ni, Pei Xu, Kaiqi Huang
Location: Guangzhou | Day: TBD
Show Abstract
In recent years, diversity has emerged as a useful mechanism to enhance the efficiency of multi-agent reinforcement learning (MARL). However, existing methods predominantly focus on designing policies based on individual agent characteristics, often neglecting the interplay and mutual influence among agents during policy formation. To address this gap, we propose Competitive Diversity through Constructive Conflict (CoDiCon), a novel approach that incorporates competitive incentives into cooperative scenarios to encourage policy exchange and foster strategic diversity among agents. Drawing inspiration from sociological research, which highlights the benefits of moderate competition and constructive conflict in group decision-making, we design an intrinsic reward mechanism using ranking features to introduce competitive motivations. A centralized intrinsic reward module generates and distributes varying reward values to agents, ensuring an effective balance between competition and cooperation. By optimizing the parameterized centralized reward module to maximize environmental rewards, we reformulate the constrained bilevel optimization problem to align with the original task objectives. We evaluate our algorithm against state-of-the-art methods in the SMAC and GRF environments. Experimental results demonstrate that CoDiCon achieves superior performance, with competitive intrinsic rewards effectively promoting diverse and adaptive strategies among cooperative agents.
196: Logic Distillation: Learning from Code Function by Function for Decision-making Tasks
Authors: Dong Chen, Shilin Zhang, Fei Gao, Yueting Zhuang, Siliang Tang, Qidong Liu, Mingliang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Large language models (LLMs) have garnered increasing attention owing to their powerful comprehension and generation capabilities. Generally, larger LLMs (L-LLMs) that require paid interfaces exhibit significantly superior performance compared to smaller LLMs (S-LLMs) that can be deployed on a variety of devices. Knowledge distillation (KD) aims to empower S-LLMs with the capabilities of L-LLMs, while S-LLMs merely mimic the outputs of L-LLMs, failing to get the powerful decision-making capability for new situations. Consequently, S-LLMs are helpless when it comes to continuous decision-making tasks that require logical reasoning. To tackle the identified challenges, we propose a novel framework called Logic Distillation (LD). Initially, LD employs L-LLMs to instantiate complex instructions into discrete functions and illustrates their usage to establish a function base. Subsequently, LD fine-tunes S-LLMs based on the function base to learn the logic employed by L-LLMs in decision-making. During testing, S-LLMs will yield decision-making outcomes, function by function, based on current states. Experiments demonstrate that with the assistance of LD, S-LLMs can achieve outstanding results in continuous decision-making tasks, comparable to, or even surpassing, those of L-LLMs. The code and data for the proposed method are provided for research purposes https://github.com/Anfeather/Logic-Distillation.
200: MTGIB-UNet: A Multi-Task Graph Information Bottleneck and Uncertainty Weighted Network for ADMET Prediction
Authors: Xuqiang Li, Wenjie Du, Jun Xia, Jianmin Wang, Xiaoqi Wang, Yang Yang, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Accurate prediction of ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties is crucial in drug development, as these properties directly impact a drug’s efficacy and safety. However, existing multi-task learning models often face challenges related to noise interference and task conflicts when dealing with complex molecular structures. To address these issues, we propose a novel multi-task Graph Neural Network (GNN) model, \textbf{MTGIB-UNet}. The model begins by encoding molecular graphs to capture intricate molecular structure information. Subsequently, based on the Graph Information Bottleneck (GIB) principle, the model compresses the information flow by extracting subgraphs, retaining task-relevant features while removing noise for each task. These embeddings are then fused through a gated network that dynamically adjusts the contribution weights of auxiliary tasks to the primary task. Specifically, an uncertainty weighting (UW) strategy is applied, with additional emphasis placed on the primary task, allowing dynamic adjustment of task weights while strengthening the influence of the primary task on model training. Experiments on standard ADMET datasets demonstrate that our model outperforms existing methods. Additionally, the model shows good interpretability by identifying key molecular substructures related to specific ADMET endpoints.
206: MTPNet: Multi-Grained Target Perception for Unified Activity Cliff Prediction
Authors: Zishan Shu, Yufan Deng, Hongyu Zhang, Zhiwei Nie, Jie Chen
Location: Guangzhou | Day: TBD
Show Abstract
Activity cliff prediction is a critical task in drug discovery and material design. Existing computational methods are limited to handling single binding targets, which restricts the applicability of these prediction models. In this paper, we present the Multi-Grained Target Perception network (MTPNet) to incorporate the prior knowledge of interactions between the molecules and their target proteins. Specifically, MTPNet is a unified framework for activity cliff prediction, which consists of two components: Macro-level Target Semantic (MTS) guidance and Micro-level Pocket Semantic (MPS) guidance. By this way, MTPNet dynamically optimizes molecular representations through multi-grained protein semantic conditions. To our knowledge, it is the first time to employ the receptor proteins as guiding information to effectively capture critical interaction details. Extensive experiments on 30 representative activity cliff datasets demonstrate that MTPNet significantly outperforms previous approaches, achieving an average RMSE improvement of 18.95% on top of several mainstream GNN architectures. Overall, MTPNet internalizes interaction patterns through conditional deep learning to achieve unified predictions of activity cliffs, helping to accelerate compound optimization and design. Codes are available at: https://github.com/ZishanShu/MTPNet.
216: Point Cloud Mixture-of-Domain-Experts Model for 3D Self-supervised Learning
Authors: Yaohua Zha, Tao Dai, Hang Guo, Yanzi Wang, Bin Chen, Ke Chen, Shu-Tao Xia
Location: Guangzhou | Day: TBD
Show Abstract
Point clouds, as a primary representation of 3D data, can be categorized into scene domain point clouds and object domain point clouds. Point cloud self-supervised learning (SSL) has become a mainstream paradigm for learning 3D representations. However, existing point cloud SSL primarily focuses on learning domain-specific 3D representations within a single domain, neglecting the complementary nature of cross-domain knowledge, which limits the learning of 3D representations. In this paper, we propose to learn a comprehensive Point cloud Mixture-of-Domain-Experts model (Point-MoDE) via a block-to-scene pre-training strategy. Specifically,
We first propose a mixture-of-domain-expert model consisting of scene domain experts and multiple shared object domain experts. Furthermore, we propose a block-to-scene pretraining strategy, which leverages the features of point blocks in the object domain to regress their initial positions in the scene domain through object-level block mask reconstruction and scene-level block position regression. By integrating the complementary knowledge between object and scene, this strategy simultaneously facilitates the learning of both object-domain and scene-domain representations, leading to a more comprehensive 3D representation.
Extensive experiments in downstream tasks demonstrate the superiority of our model.
220: Expanding the Category of Classifiers with LLM Supervision
Authors: Derui Lyu, Xiangyu Wang, Taiyu Ban, Lyuzhou Chen, Xiren Zhou, Huanhuan Chen
Location: Guangzhou | Day: TBD
Show Abstract
Zero-shot learning has shown significant potential for creating cost-effective and flexible systems to expand classifiers to new categories. However, existing methods still rely on manually created attributes designed by domain experts. Motivated by the widespread success of large language models (LLMs), we introduce an LLM-driven framework for class-incremental learning that removes the need for human intervention, termed Classifier Expansion with Multi-vIew LLM knowledge (CEMIL). In CEMIL, an LLM agent autonomously generates detailed textual multi-view descriptions for unseen classes, offering richer and more flexible class representations than traditional expert-constructed vectorized attributes. These LLM-derived textual descriptions are integrated through a contextual filtering attention mechanism to produce discriminative class embeddings. Subsequently, a weight injection module maps the class embeddings to classifier weights, enabling seamless expansion to new classes. Experimental results show that CEMIL outperforms existing methods using expert-constructed attributes, demonstrating its effectiveness for fully automated classifier expansion without human participation.
239: Towards Regularized Mixture of Predictions for Class-Imbalanced Semi-Supervised Facial Expression Recognition
Authors: Hangyu Li, Yixin Zhang, Jiangchao Yao, Nannan Wang, Bo Han
Location: Guangzhou | Day: TBD
Show Abstract
Semi-supervised facial expression recognition (SSFER) effectively assigns pseudo-labels to confident unlabeled samples when only limited emotional annotations are available. Existing SSFER methods are typically built upon an assumption of the class-balanced distribution. However, they are far from real-world applications due to biased pseudo-labels caused by class imbalance. To alleviate this issue, we propose Regularized Mixture of Predictions (ReMoP), a simple yet effective method to generate high-quality pseudo-labels for imbalanced samples. Specifically, we first integrate feature similarity into the linear prediction to learn a mixture of predictions. Furthermore, we introduce a class regularization term that constrains the feature geometry to mitigate imbalance bias. Being practically simple, our method can be integrated with existing semi-supervised learning and SSFER methods to tackle the challenge associated with class-imbalanced SSFER effectively. Extensive experiments on four facial expression datasets demonstrate the effectiveness of the proposed method across various imbalanced conditions. The source code is made publicly available at https://github.com/hangyu94/ReMoP.
240: Template3D-AD: Point Cloud Template Matching Method Based on Center Points for 3D Anomaly Detection
Authors: Yi Liu, Changsheng Zhang, Yufei Yang
Location: Guangzhou | Day: TBD
Show Abstract
Existing 3D anomaly detection methods mainly include reconstruction-based methods and memory-based methods. However, reconstruction-based methods rely on anomaly simulation strategies, while the memory bank of memory-based methods cannot cover the features of all points. Different from existing methods, this paper proposes Template3D-AD, a 3D anomaly detection method based on template matching. Template3D-AD matches the test sample with the template based on center points, and extracts the global features and local features of the center point respectively. Considering that the appearance of anomalies is related to the change of surface shape, this paper proposes a curvature-based local feature representation method, which increases the feature difference between abnormal surfaces and normal surfaces. Then, this paper designs a global-local detection strategy, which combines global feature differences and local feature differences for anomaly detection. Extensive experiments show that Template3D-AD outperforms the state-of-the-art methods, achieving 84.4% (1.5% ↑) I-AUROC on the Real3D-AD dataset and 86.5% (11.6% ↑) I-AUROC on the Anomaly-ShapeNet dataset. Code at https://github.com/CaedmonLY/Template3D-AD.
245: Enhancing Semantic Clarity: Discriminative and Fine-grained Information Mining for Remote Sensing Image-Text Retrieval
Authors: Yu Liu, Haipeng Chen, Yuheng Liang, Yuheng Yang, Xun Yang, Yingda Lyu
Location: Guangzhou | Day: TBD
Show Abstract
Remote sensing image-text retrieval is a fundamental task in remote sensing multimodal analysis, promoting the alignment of visual and language representations. The mainstream approaches commonly focus on capturing shared semantic representations between visual and textual modalities. However, the inherent characteristics of remote sensing image-text pairs lead to a semantic confusion problem, stemming from redundant visual representations and high inter-class similarity. To tackle this problem, we propose a novel Discriminative and Fine-grained Information Mining (DFIM) model, which aims to enhance semantic clarity by reducing visual redundancy and increasing the semantic gap between different classes. Specifically, the Dynamic Visual Enhancement (DVE) module adaptively enhances the visual discriminative features under the guidance of multimodal fusion information. Meanwhile, the Fine-grained Semantic Matching (FSM) module cleverly models the matching relationship between image regions and text words as an optimal transport problem, thereby refining intra-instance matching. Extensive experiments on two benchmark datasets justify the superiority of DFIM in terms of retrieval accuracy and visual interpretability over the leading methods.
255: KnowRA: Knowledge Retrieval Augmented Method for Document-level Relation Extraction with Comprehensive Reasoning Abilities
Authors: Chengcheng Mai, Yuxiang Wang, Ziyu Gong, Hanxiang Wang, Yihua Huang
Location: Guangzhou | Day: TBD
Show Abstract
Document-level relation extraction (Doc-RE) aims to extract relations between entities across multiple sentences. Therefore, Doc-RE requires more comprehensive reasoning abilities like humans, involving complex cross-sentence interactions between entities, contexts, and external general knowledge, compared to the sentence-level RE. However, most existing Doc-RE methods focus on optimizing single reasoning ability, but lack the ability to utilize external knowledge for comprehensive reasoning on long documents. To solve these problems, a knowledge retrieval augmented method, named KnowRA, was proposed with comprehensive reasoning to autonomously determine whether to accept external knowledge to assist Doc-RE. Firstly, we constructed a document graph for semantic encoding and integrated the co-reference resolution model to augment the co-reference reasoning ability. Then, we expanded the document graph into a document knowledge graph by retrieving the external knowledge base for common-sense reasoning and a novel knowledge filtration method was presented to filter out irrelevant knowledge. Finally, we proposed the axis attention mechanism to build direct and indirect associations with intermediary entities for achieving cross-sentence logical reasoning. Extensive experiments conducted on two datasets verified the effectiveness of our method compared to the state-of-the-art baselines. Our code is available at https://anonymous.4open.science/r/KnowRA.
257: Causal Learning Meet Covariates: Empowering Lightweight and Effective Nationwide Air Quality Forecasting
Authors: Jiaming Ma, Zhiqing Cui, Binwu Wang, Pengkun Wang, Zhengyang Zhou, Zhe Zhao, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Air quality prediction plays a crucial role in the development of smart cities, garnering significant attention from both academia and industry. Current air quality prediction models encounter two major limitations: their high computational complexity limits scalability to nationwide datasets, and they often regard weather covariates as optional auxiliary information. In reality, weather covariates can have a substantial impact on air quality indices (AQI), exhibiting a significant causal association. In this paper, we first present a nationwide air quality dataset to address the lack of open-source, large-scale datasets in this field. Then we propose a causal learning model, CauAir, for air quality prediction that harnesses the powerful representation capabilities of the Transformer to explicitly model the causal association between weather covariates and AQI. To address the high complexity of traditional Transformers, we design CachLormer, which features two key innovations: a simplified architecture with redundant components removed, and a cache-attention mechanism that employs learnable embeddings for perceiving causal association between AQI and weather covariates in a coarsegrained perspective. We use information theory to illustrate the superiority of the proposed model. Finally, experimental results on three datasets with 28 as the baseline demonstrate that our model achieves competitive performance, while maintaining high training efficiency and low memory consumption. The source code is available at CauAir Official Repository.
274: Optimized View and Geometry Distillation from Multi-view Diffuser
Authors: Youjia Zhang, Zikai Song, Junqing Yu, Yawei Luo, Wei Yang
Location: Guangzhou | Day: TBD
Show Abstract
Generating multi-view images from a single input view using image-conditioned diffusion models is a recent advancement and has shown considerable potential. However, issues such as the lack of consistency in synthesized views and over-smoothing in extracted geometry persist. Previous methods integrate multi-view consistency modules or impose additional supervisory to enhance view consistency while compromising on the flexibility of camera positioning and limiting the versatility of view synthesis. In this study, we consider the radiance field optimized during geometry extraction as a more rigid consistency prior, compared to volume and ray aggregation used in previous works. We further identify and rectify a critical bias in the traditional radiance field optimization process through score distillation from a multi-view diffuser. We introduce an Unbiased Score Distillation (USD) that utilizes unconditioned noises from a 2D diffusion model, greatly refining the radiance field fidelity. We leverage the rendered views from the optimized radiance field as the basis and develop a two-step specialization process of a 2D diffusion model, which is adept at conducting object-specific denoising and generating high-quality multi-view images. Finally, we recover faithful geometry and texture directly from the refined multi-view images. Empirical evaluations demonstrate that our optimized geometry and view distillation technique generates comparable results to the state-of-the-art models trained on extensive datasets, all while maintaining freedom in camera positioning. Source code of our work is publicly available at: https://youjiazhang.github.io/USD/.
275: Mask Does Not Matter: A Unified Latent Diffusion-Enhanced Framework for Mask-Free Virtual Try-On
Authors: Chenghu Du, Junyin Wang, Kai Liu, Shengwu Xiong, Yi Rong
Location: Guangzhou | Day: TBD
Show Abstract
A good virtual try-on model should introduce minimal redundant conditional information to avoid instability and increase inference efficiency. Existing methods rely on inpainting masks to guide the generation of the object, but the masks, generated by unstable human parsers, often produce unreliable results with fabric residues due to wrong segmentation. Moreover, large mask regions can lose spatial structure and identity information, requiring extra conditional inputs to compensate, which increases model instability and reduces efficiency. To tackle the problem, we present a novel Mask-Free virtual Try-ON (MFTON) framework. Specifically, we propose a mask-free strategy to eliminate all denoising conditions except for clothing and person images, thereby directly extracting spatial structure and identity information from the person image to improve efficiency and reduce instability. Additionally, to optimize the generated clothing regions, we propose a clothing texture-aware attention mechanism to enable the model to focus on texture generation with significant visual differences. We then introduce a geometric detail capture loss to further enable the model to capture more high-frequency information. Finally, we propose an appearance consistency inference method to reduce the initial randomness of the sampling process significantly. Extensive experiments on popular datasets demonstrate that our method outperforms state-of-the-art virtual try-on methods.
291: Prompt-Free Conditional Diffusion for Multi-object Image Augmentation
Authors: Haoyu Wang, Lei Zhang, Wei Wei, Chen Ding, Yanning Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Diffusion model has underpinned much recent advances of dataset augmentation in various computer vision tasks. However, when involving generating multi-object images as real scenarios, most existing methods either rely entirely on text condition, resulting in a deviation between the generated objects and the original data, or rely too much on the original images, resulting in a lack of diversity in the generated images, which is of limited help to downstream tasks. To mitigate both problems with one stone, we propose a prompt-free conditional diffusion framework for multi-object image augmentation. Specifically, we introduce a local-global semantic fusion strategy to extract semantics from images to replace text, and inject knowledge into the diffusion model through LoRA to alleviate the category deviation between the original model and the target dataset. In addition, we design a reward model based counting loss to assist the traditional reconstruction loss for model training. By constraining the object counts of each category instead of pixel-by-pixel constraints, bridging the quantity deviation between the generated data and the original data while improving the diversity of the generated data. Experimental results demonstrate the superiority of the proposed method over several representative state-of-the-art baselines and showcase strong downstream task gain and out-of-domain generalization capabilities. Code is available at \href{https://github.com/00why00/PFCD}{here}.
302: Enhancing Long-Tail Bundle Recommendations Utilizing Composition Pattern Modeling
Authors: Tianhui Ma, Shuyao Wang, Zhi Zheng, Hui Xiong
Location: Guangzhou | Day: TBD
Show Abstract
Bundle recommendation aims to provide users with a one-stop service by offering a collection of related items. However, these systems face a significant challenge, where a small portion of bundles accumulate most interactions while the long-tail bundles receive few interactions.This imbalance leads to poor performance for long-tail bundles despite their potential to satisfy diverse user needs.
Existing long-tail item recommendation methods fail to effectively address this problem, as long-tail bundle recommendation requires not only capturing the user-bundle interactions but also the item compositions in different bundles.
Therefore, in this paper, we propose Composition-Aware Long-tail Bundle Recommendation (CALBRec), which leverages the inherent composition patterns shared across different bundles as valuable signals for further representation augmentation and recommendation enhancement.
Specifically, to solve the complexity of modeling shared composition patterns due to the exponential explosion caused by the growing number of items and bundle sizes, we first introduce a composition-aware tail adapter to capture the shared composition patterns and then adaptively integrate them into individual bundle representations.
Moreover, to mitigate the impact of noise in user-bundle interaction data, we propose to map the bundle representations into a set of learnable prototypes, and we further propose a prototype learning module to combine the composition patterns with interaction signals for tail bundles.
Extensive experiments on three public datasets demonstrate that our method can improve the performance on bundle recommendation significantly, especially on the long-tail bundles.
309: HiTuner: Hierarchical Semantic Fusion Model Fine-Tuning on Text-Attributed Graphs
Authors: Zihan Fang, Zhiling Cai, Yuxuan Zheng, Shide Du, Yanchao Tan, Shiping Wang
Location: Guangzhou | Day: TBD
Show Abstract
Text-Attributed Graphs (TAGs) are vital for modeling entity relationships across various domains. Graph Neural Networks have become cornerstone for processing graph structures, while the integration of text attributes remains a prominent research. The development of Large Language Models (LLMs) provides new opportunities for advancing textual encoding in TAGs. However, LLMs face challenges in specialized domains due to their limited task-specific knowledge, and fine-tuning them for specific tasks demands significant resources. To cope with the above challenges, we propose HiTuner, a novel framework that leverages fine-tuned Pre-trained Language Models (PLMs) with domain expertise as tuner to enhance the hierarchical LLM contextualized representations for modeling TAGs. Specifically, we first strategically select hierarchical hidden states of LLM to form a set of diverse and complementary descriptions as input for the sparse projection operator. Concurrently, a hybrid representation learning is developed to amalgamate the broad linguistic comprehension of LLMs with task-specific insights of the fine-tuned PLMs. Finally, HiTuner employs a confidence network to adaptively fuse the semantically-augmented representations. Empirical results across benchmark datasets spanning various domains validate the effectiveness of the proposed framework.
Our codes are available at: https://github.com/ZihanFang11/HiTuner
310: Multi-Sourced Compositional Generalization in Visual Question Answering
Authors: Chuanhao Li, Wenbo Ye, Zhen Li, Yuwei Wu, Yunde Jia
Location: Guangzhou | Day: TBD
Show Abstract
Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V&L) recently. Due to the multi-modal nature of V&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, i.e., multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. The GQA-MSCG dataset is available at https://github.com/NeverMoreLCH/MSCG.
322: Improvements to the Generate-and-Complete Approach to Conformant Planning
Authors: Liangda Fang, Min Zhan, Jin Tong, Xiujie Huang, Ziliang Chen, Quanlong Guan
Location: Guangzhou | Day: TBD
Show Abstract
Conformant planning is a computationally challenging task that generates an action sequence to achieve goal condition with uncertain initial states and non-deterministic actions. The generate-and-complete (in short, GC) approach shows superior performance on conformant planning, which iteratively enumerates the solution of a planning subproblem for a single initial state and attempts to extend it for all initial states until a conform solution is found. However, two major drawbacks of the GC approach hinder its performance: the computational overhead due to state exploration and the insertion of many redundant actions. To overcome the above drawbacks, we improve both verification and completion procedures. Experimental results show that the improved GC planner has significant improvements over the original GC approach in many instances with a large number of initial states. Our approach also outperforms all of state-of-the-art planners, solving 989 instances in comparison to 784, which is the most solved by DNF.
324: Directing Mamba to Complex Textures: An Efficient Texture-Aware State Space Model for Image Restoration
Authors: Long Peng, Xin Di, Zhanfeng Feng, Wenbo Li, Renjing Pei, Yang Wang, Xueyang Fu, Yang Cao, Zheng-Jun Zha
Location: Guangzhou | Day: TBD
Show Abstract
Image restoration aims to recover details and enhance contrast in degraded images. With the growing demand for high-quality imaging (e.g., 4K and 8K), achieving a balance between restoration quality and computational efficiency has become increasingly critical. Existing methods, primarily based on CNNs, Transformers, or their hybrid approaches, apply uniform deep representation extraction across the image. However, these methods often struggle to effectively model long-range dependencies and largely overlook the spatial characteristics of image degradation (regions with richer textures tend to suffer more severe damage), making it hard to achieve the best trade-off between restoration quality and efficiency. To address these issues, we propose a novel texture-aware image restoration method, TAMambaIR, which simultaneously perceives image textures and achieves a trade-off between performance and efficiency. Specifically, we introduce a novel Texture-Aware State Space Model, which enhances texture awareness and improves efficiency by modulating the transition matrix of the state-space equation and focusing on regions with complex textures. Additionally, we design a Multi-Directional Perception Block to improve multi-directional receptive fields while maintaining low computational overhead. Extensive experiments on benchmarks for image super-resolution, deraining, and low-light image enhancement demonstrate that TAMambaIR achieves state-of-the-art performance with significantly improved efficiency, establishing it as a robust and efficient framework for image restoration.
325: Flow-based Time-aware Causal Structure Learning for Sequential Recommendation
Authors: Hangtong Xu, Yuanbo Xu, Huayuan Liu, En Wang
Location: Guangzhou | Day: TBD
Show Abstract
Sequential models aim to predict future interactions based on users’ historical interaction sequences. Traditional sequential methods primarily focus on capturing intra-historical sequence dependencies, overlooking the influence of unobserved confounders in recommendation scenarios. Recent studies incorporate time as additional information helps the model capture dynamic user preferences. However, time is just the external manifestation of the influence of confounders but not the actual cause of the dynamic of user preference. Additionally, improperly integrating time with item embeddings can obstruct the model’s ability to capture sequence dependencies. To address these challenges, we first revisit the sequential recommendation problem from a causal perspective and incorporate confounders as a new task. We propose a new framework—Flow-based Time-aware Causal Structure for Sequential Recommendation (FCSRec)—explicitly incorporating unobserved confounders’ influence in the recommendation process. Specifically, we use Normalizing Flows to learn the causal graph of confounders and incorporate time information as conditional info to capture confounders’ time-sensitive representations. To balance the influence of confounders and sequence dependencies, we introduce a classifier-free training paradigm by randomly masking the influence of confounders during training to encourage the model to learn both sequence dependencies and confounders’ influence equally. We validate FCSRec on manifold real-world datasets, and experimental results show that FCSRec outperforms several state-of-the-art methods in recommendation performance. Our code is available at Code-link.
326: Linear Trading Position with Sparse Spectrum
Authors: Zhao-Rong Lai, Haisheng Yang
Location: Guangzhou | Day: TBD
Show Abstract
The principal portfolio approach is an emerging method in signal-based trading. However, these principal portfolios may not be diversified to explore the key features of the prediction matrix or robust to different situations. To address this problem, we propose a novel linear trading position with sparse spectrum that can explore a larger spectral region of the prediction matrix. We also develop a Krasnosel’skii-Mann fixed-point algorithm to optimize this trading position, which possesses the descent property and achieves a linear convergence rate in the objective value. This is a new theoretical result for this type of algorithms. Extensive experiments show that the proposed method achieves good and robust performance in various situations.
336: Boost Embodied AI Models with Robust Compression Boundary
Authors: Chong Yu, Tao Chen, Zhongxue Gan
Location: Guangzhou | Day: TBD
Show Abstract
The rapid improvement of deep learning models with the integration of the physical world has dramatically improved embodied AI capabilities. Meanwhile, the powerful embodied AI models and their scales place an increasing burden on deployment efficiency. The efficiency issue is more apparent on embodied AI platforms than on data centers because they have more limited computational resources and memory bandwidth. Meanwhile, most embodied AI scenarios, like autonomous driving and robotics, are more sensitive to fast responses. Theoretically, the traditional model compression techniques can help embodied AI models with more efficient computation, lower memory and energy consumption, and reduced latency. Because the embodied AI models are expected to interact with the physical world, the corresponding compressed models are also expected to resist natural corruption caused by real-world events such as noise, blur, weather conditions, and even adversarial corruption. This paper explores the novel paradigm to boost the efficiency of the embodied AI models and the robust compression boundary. The efficacy of our method has been proven to find the optimal balance between accuracy, efficiency, and robustness in real-world conditions.
354: kgMBQA: Quality Knowledge Graph-driven Multimodal Blind Image Assessment
Authors: Wuyuan Xie, Tingcheng Bian, Miaohui Wang
Location: Guangzhou | Day: TBD
Show Abstract
Blind image assessment aims to simulate human prediction of image quality distortion levels and provide quality scores. However, existing unimodal quality indicators have limited representational ability when facing complex contents and distortion types, and the predicted scores also fail to provide explanatory reasons, which further affects the credibility of their prediction results. To address these challenges, we propose a multimodal quality indicator with explanatory text descriptions, called kgMBQA. Specifically, we construct an image quality knowledge graph and conduct in-depth mining to generate explanatory texts. The text modality is further aligned and fused with the image modality, thereby improving the model performance while also outputting its corresponding quality explanatory description. The experimental results demonstrate that our kgMBQA achieves the best performance compared to recent representative methods on the KonIQ-10k, LIVE Challenge, BIQ2021, TID2013, and AIGC-3K datasets.
359: Multi-granularity Knowledge Transfer for Continual Reinforcement Learning
Authors: Chaofan Pan, Lingfei Ren, Yihui Feng, Linbo Xiong, Wei Wei, Yonghao Li, Xin Yang
Location: Guangzhou | Day: TBD
Show Abstract
Continual reinforcement learning (CRL) empowers RL agents with the ability to learn a sequence of tasks, accumulating knowledge learned in the past and using the knowledge for problemsolving or future task learning. However, existing methods often focus on transferring fine-grained knowledge across similar tasks, which neglects the multi-granularity structure of human cognitive control, resulting in insufficient knowledge transfer across diverse tasks. To enhance coarse-grained knowledge transfer, we propose a novel framework called MT-Core (as shorthand for Multi-granularity knowledge Transfer for Continual reinforcement learning). MT-Core has a key characteristic of multi-granularity policy learning: 1) a coarsegrained policy formulation for utilizing the powerful reasoning ability of the large language model (LLM) to set goals, and 2) a fine-grained policy learning through RL which is oriented by the goals. We also construct a new policy library (knowledge base) to store policies that can be retrieved for multi-granularity knowledge transfer. Experimental results demonstrate the superiority of the proposed MT-Core in handling diverse CRL tasks versus popular baselines.
363: Injecting Imbalance Sensitivity for Multi-Task Learning
Authors: Zhipeng Zhou, Liu Liu, Peilin Zhao, Wei Gong
Location: Guangzhou | Day: TBD
Show Abstract
Multi-task learning (MTL) has emerged as a promising approach for deploying deep learning models in real-life applications. Recent studies have proposed optimization-based learning paradigms to establish task-shared representations in MTL. However, our paper empirically argues that these studies, specifically gradient-based ones, primarily emphasize the conflict issue while neglecting the potentially more significant impact of imbalance/dominance in MTL. In line with this perspective, we enhance the existing baseline method by injecting imbalance-sensitivity through the imposition of constraints on the projected norms. To demonstrate the effectiveness of our proposed IMbalance-sensitive Gradient (IMGrad) descent method, we evaluate it on multiple mainstream MTL benchmarks, encompassing supervised learning tasks as well as reinforcement learning. The experimental results consistently demonstrate competitive performance.
366: FBQuant: FeedBack Quantization for Large Language Models
Authors: Yijiang Liu, Hengyu Fang, Liulu He, Rongyu Zhang, Yichuan Bai, Yuan Du, Li Du
Location: Guangzhou | Day: TBD
Show Abstract
Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to the limited computational resources of edge devices. In particular, the key bottleneck stems from memory bandwidth constraints related to weight loading.
Weight-only quantization effectively reduces memory access, yet often induces significant accuracy degradation.
Recent efforts to incorporate sub-branches have shown promise for mitigating quantization errors, but these methods either lack robust optimization strategies or rely on suboptimal objectives. To address these gaps, we propose FeedBack Quantization (FBQuant), a novel approach inspired by negative feedback mechanisms in automatic control.
FBQuant inherently ensures that the reconstructed weights remain bounded by the quantization process, thereby reducing the risk of overfitting.
To further offset the additional latency introduced by sub-branches, we develop an efficient CUDA kernel that decreases 60% of extra inference time.
Comprehensive experiments demonstrate the efficiency and effectiveness of FBQuant across various LLMs. Notably, for 3-bit Llama2-7B, FBQuant improves zero-shot accuracy by 1.2%.
392: SCOUT: Semi-supervised Camouflaged Object Detection by Utilizing Text and Adaptive Data Selection
Authors: Weiqi Yan, Lvhai Chen, Shengchuan Zhang, Yan Zhang, Liujuan Cao
Location: Guangzhou | Day: TBD
Show Abstract
The difficulty of pixel-level annotation has significantly hindered the development of the Camouflaged Object Detection (COD) field. To save on annotation costs, previous works leverage the semi-supervised COD framework that relies on a small number of labeled data and a large volume of unlabeled data. We argue that there is still significant room for improvement in the effective utilization of unlabeled data. To this end, we introduce a Semi-supervised Camouflaged Object Detection by Utilizing Text and Adaptive Data Selection (SCOUT). It includes an Adaptive Data Augment and Selection (ADAS) module and a Text Fusion Module (TFM). The ADSA module selects valuable data for annotation through an adversarial augment and sampling strategy. The TFM module further leverages the selected valuable data by combining camouflage-related knowledge and text-visual interaction. To adapt to this work, we build a new dataset, namely RefTextCOD. Extensive experiments show that the proposed method surpasses previous semi-supervised methods in the COD field and achieves state-of-the-art performance. Our code will be released at https://github.com/Heartfirey/UCOD-DPL.
405: Unleashing the Potential of Transformer Flow for Photorealistic Face Restoration
Authors: Kepeng Xu, Li Xu, Gang He, Wei Chen, Xianyun Wu, Wenxin Yu
Location: Guangzhou | Day: TBD
Show Abstract
Face restoration is a challenging task due to the need to remove artifacts and restore details. Traditional methods usually use generative model prior to achieve face restoration, but the restored results are still insufficient in terms of realism and details. In this paper, we introduce OmniFace, a novel face restoration framework that leverages Transformer-based diffusion flow. By exploiting the scaling property of Transformer, OmniFace achieves high-resolution restoration with exceptional realism and detail. The framework integrates three key components: (1) a Transformer-driven vector estimation network, (2) a representation aligned ControlNet, and (3) an adaptive training strategy for face restoration. The inherent scaling law of Transformer architectures enables the restoration of high-quality faces at high resolution. The controlnet combined with pre-trained diffusion representation can be easily trained. The adaptive training strategy provides a vector field that is more suitable for face restoration. Comprehensive experiments demonstrate that OmniFace outperforms existing techniques in terms of restoration quality across multiple benchmark datasets, especially in restoring photographic-level texture details in high-resolution scenes.
410: DO-CoLM: Dynamic 3D Conformation Relationships Capture with Self-Adaptive Ordering Molecular Relational Modeling in Language Models
Authors: Zhuo Chen, Jiahui Zhang, Sihan Wang, Hongxin Xiang, Jianmin Wang, Wenjie Du, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Molecular Relational Learning (MRL) aims to understand interactions between molecular pairs, playing a critical role in advancing biochemical research. Recently, Large Language Models (LLMs), with their extensive knowledge bases and advanced reasoning capabilities, have emerged as powerful tools for MRL. However, existing LLMs, which primarily rely on SMILES strings and molecular graphs, face two major challenges. They struggle to capture molecular stereochemistry and dynamics, as molecules possess multiple 3D conformations with varying reactivity and dynamic transformation relationships that are essential for accurately predicting molecular interactions but cannot be effectively represented by 1D SMILES or 2D molecular graphs. Additionally, these models do not consider the autoregressive nature of LLMs, overlooking the impact of input order on model performance. To address these issues, we propose DO-CoLM: a Dynamic relationship capture and self-adaptive Ordering 3D molecular Conformation LM for MRL. By introducing modules to dynamically model intra-molecular and inter-molecular conformational relationships and adaptively adjust the molecular modality input order, DO-CoLM achieves superior performance, as demonstrated by experimental results on 12 cross-domain datasets.
449: InfVC: An Inference-Enhanced Local Search Algorithm for the Minimum Vertex Cover Problem in Massive Graphs
Authors: Rui Sun, Peiyan Liu, Yiyuan Wang, Zhaohui Liu, Liping Du, Jian Gao
Location: Guangzhou | Day: TBD
Show Abstract
The minimum vertex cover (MVC) problem is a classic NP-hard combinatorial optimization problem with extensive real-world applications. In this paper, we propose an efficient local search algorithm, InfVC, to solve the MVC in massive graphs, which comprises three ideas. First, we introduce an inference-driven optimization strategy that explores better feasible solutions through inference rules. Second, we develop a structural-determined perturbation strategy that is motivated by the structure features of high-quality solutions, prioritizing high-degree vertices into the candidate solution to guide the search process to some potential high-quality search area. Third, we design a self-adaptive local search framework that dynamically balances exploration and exploitation through a perturbation management mechanism. Extensive experiments demonstrate that InfVC outperforms all the state-of-the-art algorithms on almost massive instances.
459: Adversarial Attacks on Both Face Recognition and Face Anti-spoofing Models
Authors: Fengfan Zhou, Qianyu Zhou, Heifei Ling, Xuequan Lu
Location: Guangzhou | Day: TBD
Show Abstract
Adversarial attacks on Face Recognition (FR) systems have demonstrated significant effectiveness against standalone FR models. However, their practicality diminishes in complete FR systems that incorporate Face Anti-Spoofing (FAS) models, as these models can detect and mitigate a substantial number of adversarial examples. To address this critical yet under-explored challenge, we introduce a novel attack setting that targets both FR and FAS models simultaneously, thereby enhancing the practicability of adversarial attacks on integrated FR systems. Specifically, we propose a new attack method, termed Reference-free Multi-level Alignment (RMA), designed to improve the capacity of black-box attacks on both FR and FAS models. The RMA framework is built upon three key components. Firstly, we propose an Adaptive Gradient Maintenance module to address the imbalances in gradient contributions between FR and FAS models. Secondly, we develop a Reference-free Intermediate Biasing module to improve the transferability of adversarial examples against FAS models. In addition, we introduce a Multi-level Feature Alignment module to reduce feature discrepancies at various levels of representation. Extensive experiments showcase the superiority of our proposed attack method to state-of-the-art adversarial attacks.
486: T-T: Table Transformer for Tagging-based Aspect Sentiment Triplet Extraction
Authors: Kun Peng, Chaodong Tong, Cong Cao, Hao Peng, Qian Li, Guanlin Wu, Lei Jiang, Yanbing Liu, Philip S. Yu
Location: Guangzhou | Day: TBD
Show Abstract
Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed of aspect terms, opinion terms, and sentiment polarities from given sentences. The table tagging method is a popular approach to addressing this task, which encodes a sentence into a 2-dimensional table, allowing for the tagging of relations between any two words. Previous efforts have focused on designing various downstream relation learning modules to better capture interactions between tokens in the table, revealing that a stronger capability in relation capture can lead to greater improvements in the model. Motivated by this, we attempt to directly utilize transformer layers as downstream relation learning modules. Due to the powerful semantic modeling capability of transformers, it is foreseeable that this will lead to excellent improvement. However, owing to the quadratic relation between the length of the table and the length of the input sentence sequence, using transformers directly faces two challenges: overly long table sequences and unfair local attention interaction. To address these challenges, we propose a novel Table-Transformer (T-T) for the tagging-based ASTE method. Specifically, we introduce a stripe attention mechanism with a loop-shift strategy to tackle these challenges. The former modifies the global attention mechanism to only attend to a 2-dimensional local attention window, while the latter facilitates interaction between different attention windows. Extensive and comprehensive experiments demonstrate that the T-T, as a downstream relation learning module, achieves state-of-the-art performance with lower computational costs.
489: Prompt-Aware Controllable Shadow Removal
Authors: Kerui Chen, Zhiliang Wu, Wenjin Hou, Kun Li, Hehe Fan, Yi Yang
Location: Guangzhou | Day: TBD
Show Abstract
Shadow removal aims to restore the image content in shadowed regions. While deep learning-based methods have shown promising results, they still face key challenges: 1) uncontrolled removal of all shadows, or 2) controllable removal but heavily relies on precise shadow region masks. To address these issues, we introduce a novel paradigm: prompt-aware controllable shadow removal. Unlike existing approaches, our paradigm allows for targeted shadow removal from specific subjects based on user prompts (e.g., dots, lines, or subject masks). This approach eliminates the need for shadow annotations and offers flexible, user-controlled shadow removal. Specifically, we propose an end-to-end learnable model, the Prompt-Aware Controllable Shadow Removal Network (PACSRNet). PACSRNet consists of two key modules: a prompt-aware module that generates shadow masks for the specified subject based on the user prompt, and a shadow removal module that uses the shadow prior from the first module to restore the content in the shadowed areas. Additionally, we enhance the shadow removal module by incorporating feature information from the prompt-aware module through a linear operation, providing prompt-guided support for shadow removal. Recognizing that existing shadow removal datasets lack diverse user prompts, we contribute a new dataset specifically designed for prompt-based controllable shadow removal. Extensive experimental results demonstrate the effectiveness and superiority of PACSRNet.
496: PDDFormer: Pairwise Distance Distribution Graph Transformer for Crystal Material Property Prediction
Authors: Xiangxiang Shen, Zheng Wan, Lingfeng Wen, Licheng Sun, Jian Yang, Xuan Tang, Shing-Ho J. Lin, Xiao He, Mingsong Chen, Xian Wei
Location: Guangzhou | Day: TBD
Show Abstract
Crystal structures can be simplified as a periodic point set that repeats across three-dimensional space along an underlying lattice. Traditionally, crystal representation methods rely on descriptors such as lattice parameters, symmetry, and space groups to characterize the structure. However, in reality, atoms in materials always vibrate above absolute zero, causing their positions to fluctuate continuously. This dynamic behavior disrupts the fundamental periodicity of the lattice, making crystal graphs based on static lattice parameters and conventional descriptors discontinuous under slight perturbations. Chemists proposed the pairwise distance distribution (PDD) method to address this. However, the completeness of PDD requires defining a large number of neighboring atoms, leading to high computational costs. Additionally, PDD does not account for atomic information, making it challenging to apply it directly to crystal material property prediction tasks. To tackle these challenges, we introduce the atom-weighted Pairwise Distance Distribution (WPDD) and Unit cell Pairwise Distance Distribution (UPDD) for the first time, applying them to the construction of multi-edge crystal graphs. We demonstrate the continuity and general completeness of crystal graphs under slight atomic position perturbations. Moreover, by modeling PDD as global information and integrating it into matrix-based message passing, we significantly reduce computational costs. Comprehensive evaluation results show that WPDDFormer achieves state-of-the-art predictive accuracy across tasks on benchmark datasets such as the Materials Project and JARVIS-DFT.
498: Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
Authors: Xin He, Longhui Wei, Lingxi Xie, Qi Tian
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of novel works recently. The prevailing trend involves adopting data-driven methodologies, wherein diverse instruction-following datasets were collected. However, these approaches always face the challenge of limited visual perception capabilities, as they solely utilizing CLIP-like encoders to extract visual information from inputs. Though these encoders are pre-trained on billions of image-text pairs, they still grapple with the information loss dilemma, given that textual captions only partially capture the contents depicted in images. To address this limitation, this paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. Specifically, this work introduces a novel method that incorporates multi-task encoders and existing visual tools into the MLLMs training and inference pipeline, aiming to provide a more comprehensive summarization of visual inputs. Extensive experiments have evaluated its effectiveness of advancing MLLMs, showcasing improved visual perception capability achieved through the integration of visual experts.
508: View-Association-Guided Dynamic Multi-View Classification
Authors: Xinyan Liang, Li Lv, Qian Guo, Bingbing Jiang, Feijiang Li, Liang Du, Lu Chen
Location: Guangzhou | Day: TBD
Show Abstract
In multi-view classification tasks, integrating information from multiple views effectively is crucial for improving model performance. However, most existing methods fail to fully leverage the complex relationships between views, often treating them independently or using static fusion strategies. In this paper, we propose a View-Association-Guided Dynamic Multi-View Classification method (AssoDMVC) to address these limitations. Our approach dynamically models and incorporates the relationships between different views during the classification process. Specifically, we introduce a view-relation-guided mechanism that captures the dependencies and interactions between views, allowing for more flexible and adaptive feature fusion. This dynamic fusion strategy ensures that each view contributes optimally based on its contextual relevance and the inter-view relationships. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms traditional multi-view classification techniques, offering a more robust and efficient solution for tasks involving complex multi-view data.
509: Phenotypic Profile-Informed Generation of Drug-Like Molecules via Dual-Channel Variational Autoencoders
Authors: Hui Liu, Shiye Tian, Xuejun Liu
Location: Guangzhou | Day: TBD
Show Abstract
The de novo generation of drug-like molecules capable of inducing desirable phenotypic changes is receiving increasing attention. However, previous methods predominantly rely on expression profiles to guide molecule generation, but overlook the perturbative effect of the molecules on cellular contexts. To overcome this limitation, we propose SmilesGEN, a novel generative model based on variational autoencoder (VAE) architecture to generate molecules with potential therapeutic effects. SmilesGEN integrates a pre-trained drug VAE (SmilesNet) with an expression profile VAE (ProfileNet), jointly modeling the interplay between drug perturbations and transcriptional responses in a common latent space. Specifically, ProfileNet is imposed to reconstruct pre-treatment expression profiles when eliminating drug-induced perturbations in the latent space, while SmilesNet is informed by desired expression profiles to generate drug-like molecules. Our empirical experiments demonstrate that SmilesGEN outperforms current state-of-the-art models in generating molecules with higher degree of validity, uniqueness, novelty, as well as higher Tanimoto similarity to known ligands targeting the relevant proteins. Moreover, we evaluate SmilesGEN for scaffold-based molecule optimization and generation of therapeutic agents, and confirmed its superior performance in generating molecules with higher similarity to approved drugs. SmilesGEN establishes a robust framework that leverages gene signatures to generate drug-like molecules that hold promising potential to induce desirable cellular phenotypic changes. The source code and datasets are available at: https://github.com/hliulab/SmilesGEN.
519: Drafting and Revision: Advancing High-Fidelity Video Inpainting
Authors: Zhiliang Wu, Kun Li, Hehe Fan, Yi Yang
Location: Guangzhou | Day: TBD
Show Abstract
Video inpainting aims to fill the missing regions in video with spatial-temporally coherent contents. Existing methods usually treat the missing contents as a whole and adopt a hybrid objective containing a reconstruction loss and an adversarial loss to train the model. However, these two kinds of loss focus on contents at different frequencies, simply combining them may cause inter-frequency conflicts, leading the trained model to generate compromised results. Inspired by the common corrupted painting restoration process of “drawing a draft first and then revising the details later”, this paper proposes a Drafting-and-Revision Completion Network (DRCN) for video inpainting. Specifically, we first design a Drafting Network that utilizes the temporal information to complete the low-frequency semantic structure at low resolution. Then, a Revision Network is developed to hallucinate high-frequency details at high resolution by using the output of Drafting Network. In this way, adversarial loss and reconstruction loss can be applied to high-frequency and low-frequency respectively, effectively mitigating inter-frequency conflicts. Furthermore, Revision Network can be stacked in a pyramid manner to generate higher resolution details, which provide a feasible solution for high-resolution video inpainting. Experiments show that DRCN achieves improvements of 7.43% and 12.64% in E_warp and LPIPS, and can handle higher resolution videos on limited GPU memory.
523: OMS: One More Step Noise Searching to Enhance Membership Inference Attacks for Diffusion Models
Authors: Xiaomeng Fu, Xi Wang, Qiao Li, Jin Liu, Jiao Dai, Jizhong Han, Xingyu Gao
Location: Guangzhou | Day: TBD
Show Abstract
The data-intensive nature of Diffusion models amplifies the risks of privacy infringements and copyright disputes, particularly when training on extensive unauthorized data scraped from the Internet. Membership Inference Attacks (MIA) aim to determine whether a data sample has been utilized by the target model during training, thereby serving as a pivotal tool for privacy preservation. Current MIA employs the prediction loss to distinguish between training member samples and non-members.
These methods assume that, compared to non-members, members, having been encountered by the model during training result in a smaller prediction loss. However, this assumption proves ineffective in diffusion models due to the random noise sampled during the training process. Rather than estimating the loss, our approach examines this random noise and reformulate the MIA as a noise search problem, assuming that members are more feasible to find the noise used in the training process.
We formulate this noise search process as an optimization problem and employ the fixed-point iteration to solve it. We analyze current MIA methods through the lens of the noise search framework and reveal that they rely on the first residual as the discriminative metric to differentiate members and non-members. Inspired by this observation, we introduce OMS, which augments existing MIA methods by iterating One More fixed-point Step to include a further residual, i.e., the second residual.
We integrate our method into various MIA methods across different diffusion models. The experimental results validate the efficacy of our proposed approach.
549: An Association-based Fusion Method for Speech Enhancement
Authors: Shijie Wang, Qian Guo, Lu Chen, Liang Du, Zikun Jin, Zhian Yuan, Xinyan Liang
Location: Guangzhou | Day: TBD
Show Abstract
Deep learning-based speech enhancement (SE) methods predominantly draw upon two architectural frameworks: generative adversarial networks and diffusion models. In the realm of SE, capturing the local and global relations between signal frames is crucial for the success of these methods. These frameworks typically employ a UNet architecture as their foundational backbone, integrating Long Short-Term Memory (LSTM) networks or attention mechanisms within the UNet to effectively model both local and global signal relations. However, the coupled relation modeling way may not fully harness the potential of these relations. In this paper, we propose an innovative Association-based Fusion Speech Enhancement method (AFSE), a decoupled method. AFSE first constructs a graph that encapsulates the association between each time window of the speech signal, and then models the global relations between frames by fusing the features of these time windows in a manner akin to graph neural networks. Furthermore, AFSE leverages a UNet with dilated convolutions to model the local relations, enabling the network to maintain a high-resolution representation while benefiting from a wider receptive field. Experimental results demonstrate that the AFSE method significantly improves performance in speech enhancement tasks, validating the effectiveness and superiority of our approach. The code is available at https://github.com/jie019/AFSE_IJCAI2025.
588: Learnable Frequency Decomposition for Image Forgery Detection and Localization
Authors: Dong Li, Jiayíng Zhu, Yidi Liu, Xin Lu, Xueyang Fu, Jiawei Liu, Aiping Liu, Zheng-Jun Zha
Location: Guangzhou | Day: TBD
Show Abstract
Concern for image authenticity spurs research in image forgery detection and localization (IFDL). Most deep learning-based methods focus primarily on spatial domain modeling and have not fully explored frequency domain strategies. In this paper, we observe and analyze the frequency characteristic changes caused by image tampering. Observations indicate that manipulation traces are especially prominent in phase components and span both low and high-frequency bands. Based on these findings, we propose a forensic frequency decomposition network (F2D-Net), which incorporates deep Fourier transforms and leverages both phase information and high and low-frequency components to enhance IFDL. Specifically, F2D-Net consists of the Spectral Decomposition Subnetwork (SDSN) and the Frequency Separation Subnetwork (FSSN). The former decomposes the image into amplitude and phase, focusing on learning the semantic content in the phase spectrum to identify forged objects, thus improving forgery detection accuracy. The latter further adaptively decomposes the output of the SDSN to obtain corresponding high and low frequencies, and applies a divide-and-conquer strategy to refine each frequency band, mitigating the optimization difficulties caused by coupled forgery traces across different frequencies, thereby better capturing the pixels belonging to the forged object to improve localization accuracy. Experiments on multiple datasets demonstrate that our method outperforms state-of-the-art image forgery detection and localization techniques both qualitatively and quantitatively.
602: Exploring Semantic Masked Autoencoder for Self-supervised Point Cloud Understanding
Authors: Yixin Zha, Chuxin Wang, Wenfei Yang, Tianzhu Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Point cloud understanding aims to acquire robust and general feature representations from unlabeled data. Masked point modeling-based methods have recently shown significant performance across various downstream tasks. These pre-training methods rely on random masking strategies to establish the perception of point clouds by restoring corrupted point cloud inputs, which leads to the failure of capturing reasonable semantic relationships by the self-supervised models. To address this issue, we propose Semantic Masked Autoencoder, which comprises two main components: a prototype-based component semantic modeling module and a component semantic-enhanced masking strategy. Specifically, in the component semantic modeling module, we design a component semantic guidance mechanism to direct a set of learnable prototypes in capturing the semantics of different components from objects. Leveraging these prototypes, we develop a component semantic-enhanced masking strategy that addresses the limitations of random masking in effectively covering complete component structures. Furthermore, we introduce a component semantic-enhanced prompt-tuning strategy, which further leverages these prototypes to improve the performance of pre-trained models in downstream tasks. Extensive experiments conducted on datasets such as ScanObjectNN, ModelNet40, and ShapeNetPart demonstrate the effectiveness of our proposed modules.
605: Multi-Label Text Classification with Label Attention Aware and Correlation Aware Contrastive Learning
Authors: Zhengzhong Zhu, Pei Zhou, Zeting Li, Kejiang Chen, Jiangping Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Multi-label text classification (MLTC) is a challenging task where each document can be associated with multiple interdependent labels. This task is complicated by two key issues: the intricate correlations among labels and the partial overlap between labels and text relevance. Existing methods often fail to capture the semantic dependencies between labels or struggle to handle the ambiguities caused by partial overlaps, resulting in suboptimal representation learning.
To address these challenges, we propose the Unified Contextual and Label-Aware Framework (UCLAF), which integrates a Label Attention Aware Network(LAN) and Correlation Aware Contrastive Learning (CACL) in a synergistic design. The Label Attention Aware Network explicitly models label dependencies by embedding labels and texts into a shared semantic space, aligning text representations with label semantics. Meanwhile, Correlation Aware Contrastive Learning refines these representations by dynamically modeling sample-level relationships, leveraging a contrastive loss function that accounts for the proportional overlap of labels between samples. This complementary approach enables UCLAF to jointly address complex label correlations and partial label overlaps.
Extensive experiments on benchmark datasets demonstrate that UCLAF significantly outperforms state-of-the-art methods, showcasing its effectiveness in improving both representation learning and classification performance in MLTC tasks. We will release our code after the paper is accepted.
614: Learn to Think: Bootstrapping LLM Logic Through Graph Representation Learning
Authors: Hang Gao, Chenhao Zhang, Tie Wang, Junsuo Zhao, Fengge Wu, Changwen Zheng, Huaping Liu
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) have achieved remarkable success across various domains. However, they still face significant challenges, including high computational costs for training and limitations in solving complex reasoning problems. Although existing methods have extended the reasoning capabilities of LLMs through structured paradigms, these approaches often rely on task-specific prompts and predefined reasoning processes, which constrain their flexibility and generalizability. To address these limitations, we propose a novel framework that leverages graph learning to enable more flexible and adaptive reasoning capabilities for LLMs. Specifically, this approach models the reasoning process of a problem as a graph and employs LLM-based graph learning to guide the adaptive generation of each reasoning step. To further enhance the adaptability of the model, we introduce a Graph Neural Network (GNN) module to perform representation learning on the generated reasoning process, enabling real-time adjustments to both the model and the prompt. Experimental results demonstrate that this method significantly improves reasoning performance across multiple tasks without requiring additional training or task-specific prompt design. Code can be found in https://github.com/zch65458525/L2T.
623: Learning Heterogeneous Performance-Fairness Trade-offs in Federated Learning
Authors: Rongguang Ye, Ming Tang
Location: Guangzhou | Day: TBD
Show Abstract
Recent methods leverage a hypernet to handle the performance-fairness trade-offs in federated learning. This hypernet maps the clients’ preferences between model performance and fairness to preference-specifc models on the trade-off curve, known as local Pareto front. However, existing methods typically adopt a uniform preference sampling distribution to train the hypernet across clients, neglecting the inherent heterogeneity of their local Pareto fronts. Meanwhile, from the perspective of generalization, they do not consider the gap between local and global Pareto fronts on the global dataset. To address these limitations, we propose HetPFL to effectively learn both local and global Pareto fronts. HetPFL comprises Preference Sampling Adaptation (PSA) and Preference-aware Hypernet Fusion (PHF). PSA adaptively determines the optimal preference sampling distribution for each client to accommodate heterogeneous local Pareto fronts. While PHF performs preference-aware fusion of clients’ hypernets to ensure the performance of the global Pareto front. We prove that HetPFL converges linearly with respect to the number of rounds, under weaker assumptions than existing methods. Extensive experiments on four datasets show that HetPFL significantly outperforms seven baselines in terms of the quality of learned local and global Pareto fronts.
632: Multimodal Knowledge Retrieval-Augmented Iterative Alignment for Satellite Commonsense Conversation
Authors: Qian Li, Xuchen Li, Zongyu Chang, Yuzheng Zhang, Cheng Ji, Shangguang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Satellite technology has significantly influenced our daily lives, manifested in applications such as navigation and communication. With its development, a vast amount of multimodal satellite commonsense data has been generated, thus leading to an urgent demand for conversation about satellite data. However, existing large language models suffer from prevalent hallucinations and poor comprehensibility on multimodal satellite data due to their high professional content threshold and partial information opacity. To address these issues, we propose a multimodal satellite knowledge retrieval-augmented iterative alignment framework (Sat-RIA) for satellite commonsense conversation. We first construct multi-view retrieval expert knowledge to reduce hallucinations and enhance the interpretability of responses, which incorporates the satellite expert database, satellite rule, satellite image database, and a satellite knowledge graph. We next design commonsense conversation instructions to make the answers more legible and understandable. Furthermore, the retrieval-augmented iterative alignment module refines response precision by aligning outputs with task-specific standards through multi-stage evaluations.
Finally, we construct satellite multi-turn dialogue and visual question-answer datasets for a more comprehensive evaluation of satellite commonsense conversation. Experimental results demonstrate that Sat-RIA outperforms existing large language models and provides more comprehensible answers with fewer hallucinations.
636: Variational Multi-Modal Hypergraph Attention Network for Multi-Modal Relation Extraction
Authors: Qian Li, Cheng Ji, Shu Guo, Kun Peng, Qianren Mao, Shangguang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-modal relation extraction (MMRE) is a challenging task that seeks to identify relationships between entities with textual and visual attributes. However, existing methods struggle to handle the complexities posed by multiple entity pairs within a single sentence that share similar contextual information (e.g., identical text and image content). These scenarios amplify the difficulty of distinguishing relationships and hinder accurate extraction. To address these limitations, we propose the variational multi-modal hypergraph attention network (VM-HAN), a novel and robust framework for MMRE. Unlike previous approaches, VM-HAN constructs a multi-modal hypergraph for each sentence-image pair, explicitly modeling high-order intra-/inter-modal correlations among different entity pairs in the same context. This design enables a more detailed and nuanced understanding of entity relationships by capturing intricate cross-modal interactions that are often overlooked. Additionally, we introduce the variational hypergraph attention network (V-HAN). This variational attention mechanism dynamically refines the hypergraph structure, enabling the model to effectively handle the inherent ambiguity and complexity of multi-modal data. Comprehensive experiments on benchmark MMRE datasets demonstrate that VM-HAN achieves state-of-the-art performance, significantly surpassing existing methods in both accuracy and efficiency.
637: AlphaGAT: A Two-Stage Learning Approach for Adaptive Portfolio Selection
Authors: Shicheng Li, Jinshan Zhang, Feng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Portfolio selection is a critical task in finance, involving the allocation of resources across various assets. However, current methods often struggle to maintain robust performance due to the inherent low signal-to-noise ratio in raw financial data and shifts in data distribution. We propose AlphaGAT, a novel two-stage learning approach for portfolio selection, designed to adapt to different market scenarios. Inspired by the concept of alpha factors, which transform historical market data into actionable signals, the first stage introduces an advanced model named CATimeMixer for alpha factor generation with a novel loss function to improve the effectiveness and robustness. CATimeMixer integrates TimeMixer with Conv1D (C) and cross-asset Attention (A). Specifically, Conv1D enhances TimeMixer by capturing trend and seasonal features across different scales, while cross-asset attention enables TimeMixer to extract interrelationships between different assets. The second stage applies reinforcement learning to dynamically adjust weights, integrating alpha factors into trading signals. Recognizing the varying effectiveness of alpha factors across different periods, our RL agent innovatively transforms the alpha factors into graphs and employs graph attention networks (GAT) to discern the significance of different alpha factors, enhancing policy robustness. Extensive experiments on real-world market data show that our approach outperforms state-of-the-art methods.
641: Free Lunch of Image-mask Alignment for Anomaly Image Generation and Segmentation
Authors: Xiangyue Li, Xiaoyang Wang, Zhibin Wan, Quan Zhang, Yupei Wu, Tao Deng, Mingjie Sun
Location: Guangzhou | Day: TBD
Show Abstract
This paper aims at generating anomalous images and their segmentation labels to address the lack of real-world anomaly samples and privacy issues. Departing from conventional approaches that use masks solely to guide the generation of anomaly images, we propose a dual-branch training strategy for the generative model. This strategy enables the simultaneous production of anomaly images and masks, with an alignment regularization loss that ensures the coherence between the generated images and their masks. During inference, only the image-generation branch is activated to produce synthetic samples for training the downstream segmentation model. Furthermore, we propose to integrate the well-trained generative model into the training of segmentation models, utilizing a generative feedback loss to refine the segmentation model’s performance. Experiments show our method’s IoU metrics exceed previous methods by 5.03%, 5.68% and 16.63% on Real-IAD (industrial), polyp (medical), and Floor Dirty (indoor) datasets. The code is publicly accessible at https://github.com/huan-yin/anomaly-alignment.
651: LensNet: An End-to-End Learning Framework for Empirical Point Spread Function Modeling and Lensless Imaging Reconstruction
Authors: Jiesong Bai, Yuhao Yin, Yihang Dong, Xiaofeng Zhang, Chi-Man Pun, Xuhang Chen
Location: Guangzhou | Day: TBD
Show Abstract
Lensless imaging stands out as a promising alternative to conventional lens-based systems, particularly in scenarios demanding ultracompact form factors and cost-effective architectures. However, such systems are fundamentally governed by the Point Spread Function (PSF), which dictates how a point source contributes to the final captured signal. Traditional lensless techniques often require explicit calibrations and extensive pre-processing, relying on static or approximate PSF models. These rigid strategies can result in limited adaptability to real-world challenges, including noise, system imperfections, and dynamic scene variations, thus impeding high-fidelity reconstruction. In this paper, we propose LensNet, an end-to-end deep learning framework that integrates spatial-domain and frequency-domain representations in a unified pipeline. Central to our approach is a learnable Coded Mask Simulator (CMS) that enables dynamic, data-driven estimation of the PSF during training, effectively mitigating the shortcomings of fixed or sparsely calibrated kernels. By embedding a Wiener filtering component, LensNet refines global structure and restores fine-scale details, thus alleviating the dependency on multiple handcrafted pre-processing steps. Extensive experiments demonstrate LensNet’s robust performance and superior reconstruction quality compared to state-of-the-art methods, particularly in preserving high-frequency details and attenuating noise. The proposed framework establishes a novel convergence between physics-based modeling and data-driven learning, paving the way for more accurate, flexible, and practical lensless imaging solutions for applications ranging from miniature sensors to medical diagnostics. The link of code is https://github.com/baijiesong/Lensnet.
653: OS-GCL: A One-Shot Learner in Graph Contrastive Learning
Authors: Cheng Ji, Chenrui He, Qian Li, Qingyun Sun, Xingcheng Fu, Jianxin Li
Location: Guangzhou | Day: TBD
Show Abstract
Graph contrastive learning (GCL) enhances the self-supervised learning capacity for graph representation learning. Nevertheless, the previous research has neglected to consider one fundamental nature of GCL — graph contrastive learning operates as a one-shot learner, guided by the widely utilized noise contrastive estimation (e.g., the InfoNCE loss). Theoretically, to initially investigate the factors that contribute to the one-shot learner essence, we analyze the InfoNCE-based objective and derive its equivalent form of the softmax-based cross-entropy function. It is concluded that the InfoNCE-based GCL is determined to be a (2n-1)-way 1-shot classifier (n is the number of nodes). In this particular context, each sample is indicative of a unique ideational class, and each class has only one sample. Consequently, the one-shot learning nature of GCL leads to the issue of the limited self-supervised signal. To further address the above issue, we propose a One-Shot Learner in Graph Contrastive Learning (OS-GCL). Firstly, we estimate the potential probability distributions of the deterministic node features and discrete graph topology. Secondly, we develop a probabilistic message-passing mechanism to propagate probability (of feature) on probability (of topology). Thirdly, we propose the ProbNCE loss functions to contrast distributions. Extensive experimental results demonstrate the superiority of OS-GCL. To the best of our knowledge, this is the first study to examine the one-shot learning essence and the limited self-supervised signal issue of GCL.
680: Distribution-Aware Online Learning for Urban Spatiotemporal Forecasting on Streaming Data
Authors: Chengxin Wang, Gary Tan, Swagato Barman Roy, Beng Chin Ooi
Location: Guangzhou | Day: TBD
Show Abstract
The intrinsic non-stationarity of urban spatiotemporal (ST) streams, particularly unique distribution shifts that evolve over time, poses substantial challenges for accurate urban ST forecasting. Existing works often overlook these dynamic shifts, limiting their ability to adapt to evolving trends effectively. To address this challenge, we propose DOL, a novel Distribution-aware Online Learning framework designed to handle the unique shifts in urban ST streams. DOL introduces a streaming update mechanism that leverages streaming memories to strategically adapt to gradual distribution shifts. By aligning network updates with these shifts, DOL avoids unnecessary updates, reducing computational overhead while improving prediction accuracy. DOL also incorporates an adaptive spatiotemporal network with a location-specific learner, enabling it to handle diverse urban distribution shifts across locations. Experimental results on four real-world datasets confirm DOL’s superiority over state-of-the-art models. The source code is available at https://github.com/cwang-nus/DOL.
690: Riding the Wave: Multi-Scale Spatial-Temporal Graph Learning for Highway Traffic Flow Prediction Under Overload Scenarios
Authors: Xigang Sun, Jiahui Jin, Hancheng Wang, Xiangguo Sun, Xiaoliang Wang, Jun Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Highway traffic flow prediction under overload scenarios (HIPO) is a critical problem in intelligent transportation systems, which aims to forecast future traffic patterns on highway segments during periods of exceptionally high demand. Despite its importance, this problem has rarely been explored in recent research due to the unique challenges posed by irregular flow patterns, complex traffic behaviors, and sparse contextual data. In this paper, we propose a Heterogeneous Spatial-Temporal graph network With Adaptive contrastiVE learning (HST-WAVE) to address the HIPO problem. Specifically, we first construct a heterogeneous traffic graph according to the physical highway structure. Then, we develop a multi-scale temporal weaving Transformer and a coupled heterogeneous graph attention network to capture the irregular traffic flow patterns and complex transition behaviors. Furthermore, we introduce an adaptive temporal enhancement contrastive learning strategy to bridge the gap between divergent temporal patterns and mitigate data sparsity. We conduct extensive experiments on two real-world highway network datasets (No. G56 and G60 in Hangzhou, China), showing that our model can effectively handle the HIPO problem and achieve state-of-the-art performance. The source code is available at https://github.com/luck-seu/HST-WAVE.
693: Incorporating Legal Logic into Deep Learning: An Intelligent Approach to Probation Prediction
Authors: Qinghua Wang, Xu Zhang, Lingyan Yang, Rui Shao, Bonan Wang, Fang Wang, Cunquan Qu
Location: Guangzhou | Day: TBD
Show Abstract
Probation is a crucial institution in modern criminal law, embodying the principles of fairness and justice while contributing to the harmonious development of society. Despite its importance, the current Intelligent Judicial Assistant System (IJAS) lacks dedicated methods for probation prediction, and research on the underlying factors influencing probation eligibility remains limited. In addition, probation eligibility requires a comprehensive analysis of both criminal circumstances and remorse. Much of the existing research in IJAS relies primarily on data-driven methodologies, which often overlooks the legal logic underpinning judicial decision-making. To address this gap, we propose a novel approach that integrates legal logic into deep learning models for probation prediction, implemented in three distinct stages. First, we construct a specialized probation dataset that includes fact descriptions and probation legal elements (PLEs). Second, we design a distinct probation prediction model named the Multi-Task Dual-Theory Probation Prediction Model (MT-DT), which is grounded in the legal logic of probation and the Dual-Track Theory of Punishment. Finally, our experiments on the probation dataset demonstrate that the MT-DT model outperforms baseline models, and an analysis of the underlying legal logic further validates the effectiveness of the proposed approach.
694: SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation
Authors: Zhaoxi Mu, Xinyu Yang, Gang Wang
Location: Guangzhou | Day: TBD
Show Abstract
While contemporary speech separation technologies adeptly process lengthy mixed audio waveforms, they are frequently challenged by the intricacies of real-world environments, including noisy and reverberant settings, which can result in artifacts or distortions in the separated speech. To overcome these limitations, we introduce SepALM, a pioneering approach that employs audio language models (ALMs) to rectify and re-synthesize speech within the text domain following preliminary separation. SepALM comprises four core components: a separator, a corrector, a synthesizer, and an aligner. By integrating an ALM-based end-to-end error correction mechanism, we mitigate the risk of error accumulation and circumvent the optimization hurdles typically encountered in conventional methods that amalgamate automatic speech recognition (ASR) with large language models (LLMs). Additionally, we have developed Chain-of-Thought (CoT) prompting and knowledge distillation techniques to facilitate the reasoning and training processes of the ALM. Our experiments substantiate that SepALM not only elevates the precision of speech separation but also markedly bolsters adaptability in novel acoustic environments.
695: Subgraph Information Bottleneck with Causal Dependency for Stable Molecular Relational Learning
Authors: Peiliang Zhang, Jingling Yuan, Chao Che, Yongjun Zhu, Lin Li
Location: Guangzhou | Day: TBD
Show Abstract
Molecular Relational Learning (MRL) is widely applied in molecular sciences. Recent studies attempt to retain molecular core information (e.g., substructures) by Graph Information Bottleneck but primarily focus on information compression without considering the causal dependencies of chemical reactions among substructures. This oversight neglects the core factors that determine molecular relationships, making maintaining stable MRL in distribution-shifted data challenging. To bridge this gap, we propose the Causal Subgraph Information Bottleneck (CausalGIB) for stable MRL. CausalGIB leverages causal dependency to guide substructure representation and integrates subgraph information bottleneck to optimize the core substructure representation, generating stable representations. Specifically, we distinguish causal and confounding substructures by noise injection and substructure interaction based on causal analysis. Furthermore, by minimizing the discrepancy between causal and confounding information within subgraph information bottleneck, CausalGIB captures core substructures composed of causal substructures and aggregates them into molecular representations to improve their stability. Experimental results on nine datasets demonstrate that CausalGIB outperforms state-of-the-art models in two tasks and significantly enhances model’s stability in distribution-shifted data.
719: Revealing Concept Shift in Spatio-Temporal Graphs via State Learning
Authors: Kuo Yang, Yunhe Guo, Qihe Huang, Zhengyang Zhou, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Dynamic graphs are ubiquitous in the real world, presenting the temporal evolution of individuals within spatial associations. Recently, dynamic graph learning research is flourishing, striving to more effectively capture evolutionary patterns and spatial correlations. However, existing methods still fail to address the issue of concept shift in dynamic graphs. Concept shift manifests as a distribution shift in the mapping pattern between historical observations and future evolution. The reason is that some environment variables in dynamic graphs exert varying effects on evolution patterns, but these variables are not effectively captured by the models, leading to the intractable concept shift issue. To tackle this issue, we propose a State-driven environment inference framework (Samen) to achieve a dynamic graph learning framework equipped with concept generalization ability. Firstly, we propose a two-stage environment inference and compression strategy. From the perspective of state space, we introduce a prefix-suffix collaborative state learning mechanism to bidirectionally model the spatio-temporal states. A hierarchical state compressor is further designed to refine the state information resulting in concept shift. Secondly, we propose a skip-connection spatio-temporal prediction module, which effectively utilizes the inferred environments to improve the model’s generalization capability. Finally, we select seven datasets from different domains to validate the effectiveness of our model. By comparing the performance of different models on samples with concept shift, we verify that our Samen gains generalization capacity that existing methods fail to capture.
728: Advancing Stain Transfer for Multi-Biomarkers: A Human Annotation-Free Method Based on Auxiliary Task Supervision
Authors: Siyuan Xu, Haofei Song, Yingjiao Deng, Jiansheng Wang, Yan Wang, Qingli Li
Location: Guangzhou | Day: TBD
Show Abstract
Histopathological examination primarily relies on hematoxylin and eosin (H&E) and immunohistochemical (IHC) staining. Though IHC provides more crucial molecular information for diagnosis, it is more costly than H&E staining. Stain transfer technology seeks to efficiently generate virtual IHC images from H&E images. While current deep learning-based methods have made progress, they still struggle to maintain pathological and structural consistency across biomarkers without pixel-level aligned reference. To address the problem, we propose an Auxiliary Task supervision-based Stain Transfer method for multi-biomarkers (ATST-Net), which pioneeringly employs human annotation-free masks as ground truth (GT). ATST-Net ensures pathological consistency, structural preservation and style transfer. It automatically annotates H&E masks in a cost-effective manner by utilizing consecutive IHC sections. Multiple auxiliary tasks provide diverse supervisory information on the location and intensity of biomarker expression, ensuring model accuracy and interpretability. We design a pretrained model-based generator to extract deep feature in H&E images, improving generalization performance. Extensive experiments demonstrate the effectiveness of ATST-Net’s components. Compared to existing methods, ATST-Net achieves state-of-the-art (SOTA) accuracy on datasets with multiple biomarkers and intensity levels, while also reflecting high practical value. Code is available at https://github.com/SikangSHU/ATST-Net.
740: Beyond Individual and Point: Next POI Recommendation via Region-aware Dynamic Hypergraph with Dual-level Modeling
Authors: Xixi Li, Zhuo Gu, Rui Yao, Yong Zhou, Hancheng Zhu, Jiaqi Zhao, Wen-liang Du
Location: Guangzhou | Day: TBD
Show Abstract
Next POI recommendation contributes to the prosperity of various intelligent location-based services. Existing studies focus on exploring sequential patterns and POI interactions using sequential and graph-based methods to enhance recommendation performance. However, they don’t effectively exploit geographical information. In addition, methods that focus on modeling mobility patterns using individual limited data may suffer from data sparsity and the information cocoons problem. Moreover, most graph structures focus on adjacent nodes, failing to capture potential high-order associations among POIs. To address these challenges, we propose the Region-aware dynamic Hypergraph learning method with Dual-level interaction Modeling (ReHDM), which exploits users’ dynamic mobility beyond individual and point. Specifically, ReHDM utilizes regional encoding to mine the potential spatial relationships among POIs with coarse-grained geographical information. By incorporating POI-level and trajectory-level associations within a hypergraph convolutional network, ReHDM comprehensively captures cross-user collaborative information. Furthermore, ReHDM captures not only dependencies among POIs within each trajectory for a single user, but also the high-order collaborative information across individual user trajectories and associated users’ trajectories. Experimental results on three public datasets demonstrate the superiority of ReHDM to the state-of-the-art.
749: Coupling Category Alignment for Graph Domain Adaptation
Authors: Nan Yin, Xiao Teng, Zhiguang Cao, Mengzhu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Graph domain adaptation (GDA), which transfers knowledge from a labeled source domain to an unlabeled target graph domain, attracts considerable attention in numerous fields. However, existing methods commonly employ message-passing neural networks (MPNNs) to learn domain-invariant representations by aligning the entire domain distribution, inadvertently neglecting category-level distribution alignment and potentially causing category confusion. To address the problem, we propose an effective framework named Coupling Category Alignment (CoCA) for GDA, which effectively addresses the category alignment issue with theoretical guarantees. CoCA incorporates a graph convolutional network branch and a graph kernel network branch, which explore graph topology in implicit and explicit manners. To mitigate category-level domain shifts, we leverage knowledge from both branches, iteratively filtering highly reliable samples from the target domain using one branch and fine-tuning the other accordingly. Furthermore, with these reliable target domain samples, we incorporate the coupled branches into a holistic contrastive learning framework. This framework includes multi-view contrastive learning to ensure consistent representations across the dual branches, as well as cross-domain contrastive learning to achieve category-level domain consistency. Theoretically, we establish a sharper generalization bound, which ensures the effectiveness of category alignment. Extensive experiments on benchmark datasets validate the superiority of the proposed CoCA compared with baselines.
751: Sharpness-aware Zeroth-order Optimization for Graph Transformers
Authors: Yang Liu, Chuan Zhou, Yuhan Lin, Shuai Zhang, Yang Gao, Zhao Li, Shirui Pan
Location: Guangzhou | Day: TBD
Show Abstract
Graph Transformers (GTs) have emerged as powerful tools for handling graph-structured data through global attention mechanisms. While GTs can effectively capture long-range dependencies, they introduce difficulties in optimization due to their complex, non-differentiable operators, which cannot be directly handled by standard gradient-based optimizers (such as Adam or AdamW). To investigate the above issues, this work adopts the line of Zeroth-Order Optimization (ZOO) technique. However, direct integration of ZOO incurs considerable challenges due to the sharp loss landscape and steep gradients within the GT parameter space. Under the above observations, we propose a Sharpness-aware Zeroth-order Optimizer (SZO) that combines Sharpness-Aware Minimization (SAM) technique facilitating convergence within a flatter neighborhood, and leverages parallel computing for efficient gradient estimation. Theoretically, we provide a comprehensive analysis of the optimizer from both convergence and generalization perspectives. Empirically, we conduct extensive experiments on various classical GTs across a wide range of benchmark datasets, which underscore the superior performance of SZO over the state-of-the-art optimizers.
754: Towards Equilibrium: An Instantaneous Probe-and-Rebalance Multimodal Learning Approach
Authors: Yang Yang, Xixian Wu, Qing-Yuan Jiang
Location: Guangzhou | Day: TBD
Show Abstract
The multimodal imbalance problem has been extensively studied to prevent the undesirable scenario where multimodal performance falls below that of unimodal models. However, existing methods typically assess the strength of modalities and perform learning simultaneously under the imbalanced status. This deferred strategy fails to rebalance multimodal learning instantaneously, leading to performance degeneration. To address this, we propose a novel multimodal learning approach, termed instantaneous probe-and-rebalance multimodal learning (IPRM), which employs a two-pass forward method to first probe (but not learn) and then perform rebalanced learning under the balanced status. Concretely, we first employ the geodesic multimodal mixup (GMM) to incorporate fusion representation and probe modality strength in the first forward phase. Then the weights are instantaneously recalibrated based on the probed strength, facilitating balanced training via the second forward pass. This process is applied dynamically throughout the entire training process. Extensive experiments reveal that our proposed IPRM outperforms all baselines, achieving state-of-the-art (SOTA) performance on numerous widely used datasets. The code is available at https://github.com/njustkmg/IJCAI25-IPRM.
769: BILE: An Effective Behavior-based Latent Exploration Scheme for Deep Reinforcement Learning
Authors: Yiming Wang, Kaiyan Zhao, Yan Li, Leong Hou U
Location: Guangzhou | Day: TBD
Show Abstract
Efficient exploration of state spaces is critical for the success of deep reinforcement learning (RL). While many methods leverage exploration bonuses to encourage exploration instead of relying solely on extrinsic rewards, these bonus-based approaches often face challenges with learning efficiency and scalability, especially in environments with high-dimensional state spaces.
To address these issues, we propose BehavIoral metric-based Latent Exploration (BILE). The core idea is to learn a compact representation within the behavioral metric space that preserves value differences between states. By introducing additional rewards to encourage exploration in this latent space, BILE drives the agent to visit states with higher value diversity and exhibit more behaviorally distinct actions, leading to more effective exploration of the state space. Additionally, we present a novel behavioral metric for efficient and robust training of the state encoder, backed by theoretical guarantees. Extensive experiments on high-dimensional environments, including realistic indoor scenarios in Habitat, robotic tasks in Robosuite, and challenging discrete Minigrid benchmarks, demonstrate the superiority and scalability of our method over other approaches.
791: ESBN: Estimation Shift of Batch Normalization for Source-free Universal Domain Adaptation
Authors: Jiao Li, Houcheng Su, Bingli Wang, Yuandong Min, Mengzhu Wang, Nan Yin, Shanshan Wang, Jingcai Guo
Location: Guangzhou | Day: TBD
Show Abstract
Domain adaptation (DA) is crucial for transferring models trained in one domain to perform well in a different, often unseen domain. Traditional methods, including unsupervised domain adaptation (UDA) and source-free domain adaptation (SFDA), have made significant progress. However, most existing DA methods rely heavily on Batch Normalization (BN) layers, which are not optimal in source-free settings, where the source domain is unavailable for comparison. In this study, we propose a novel method, ESBN, which addresses the challenge of domain shift by adjusting the placement of normalization layers and replacing BN with Batch-free Normalization (BFN). Unlike BN, BFN is less dependent on batch statistics and provides more robust feature representations through instance-specific statistics. We systematically investigate the effects of different BN layer placements across various network configurations and demonstrate that selective replacement with BFN improves generalization performance. Extensive experiments on multiple domain adaptation benchmarks show that our approach outperforms state-of-the-art methods, particularly in challenging scenarios such as Open-Partial Domain Adaptation (OPDA).
799: Learning Real Facial Concepts for Independent Deepfake Detection
Authors: Ming-Hui Liu, Harry Cheng, Tianyi Wang, Xin Luo, Xin-Shun Xu
Location: Guangzhou | Day: TBD
Show Abstract
Deepfake detection models often struggle with generalization to unseen datasets, manifesting as misclassifying real instances as fake in target domains. This is primarily due to an overreliance on forgery artifacts and a limited understanding of real faces. To address this challenge, we propose a novel approach RealID to enhance generalization by learning a comprehensive concept of real faces while assessing the probabilities of belonging to the real and fake classes independently. RealID comprises two key modules: the Real Concept Capture Module (RealC^2) and the Independent Dual-Decision Classifier (IDC). With the assistance of a Multi-Real Memory, RealC^2 maintains various prototypes for real faces, allowing the model to capture a comprehensive concept of real class. Meanwhile, IDC redefines the classification strategy by making independent decisions based on the concept of the real class and the presence of forgery artifacts. Through the combined effect of the above modules, the influence of forgery-irrelevant patterns is alleviated, and extensive experiments on five widely used datasets demonstrate that RealID significantly outperforms existing state-of-the-art methods, achieving a 1.74% improvement in average accuracy.
803: AdaMixT: Adaptive Weighted Mixture of Multi-Scale Expert Transformers for Time Series Forecasting
Authors: Huanyao Zhang, Jiaye Lin, Wentao Zhang, Haitao Yuan, Guoliang Li
Location: Guangzhou | Day: TBD
Show Abstract
Multivariate time series forecasting involves predicting future values based on historical observations. However, existing approaches primarily rely on predefined single-scale patches or lack effective mechanisms for multi-scale feature fusion. These limitations hinder them from fully capturing the complex patterns inherent in time series, leading to constrained performance and insufficient generalizability. To address these challenges, we propose a novel architecture named Adaptive Weighted Mixture of Multi-Scale Expert Transformers (AdaMixT). Specifically, AdaMixT introduces various patches and leverages both General Pre-trained Models (GPM) and Domain-specific Models (DSM) for multi-scale feature extraction. To accommodate the heterogeneity of temporal features, AdaMixT incorporates a gating network that dynamically allocates weights among different experts, enabling more accurate predictions through adaptive multi-scale fusion. Comprehensive experiments on eight widely used benchmarks, including Weather, Traffic, Electricity, ILI, and four ETT datasets, consistently demonstrate the effectiveness of AdaMixT in real-world scenarios.
831: Can We Verify Step by Step for Incorrect Answer Detection?
Authors: Xin Xu, Shizhe Diao, Can Yang, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Chain-of-Thought (CoT) prompting has marked a significant advancement in enhancing the reasoning capabilities of large language models (LLMs). Previous studies have developed various extensions of CoT, which focus primarily on enhancing end-task performance. In addition, there has been research on assessing the quality of reasoning chains in CoT. This raises an intriguing question: Is it possible to predict the accuracy of LLM outputs by scrutinizing the reasoning chains they generate? To answer this research question, we introduce a benchmark, R2PE, designed specifically to explore the relationship between reasoning chains and performance in various reasoning tasks spanning five different domains. This benchmark aims to measure the falsehood of the final output of LLMs based on the reasoning steps. To make full use of information in multiple reasoning chains, we propose the process discernibility score (PDS) framework that beats the answer-checking baseline by a large margin. Concretely, this resulted in an average of 5.1% increase in the F1 score and 2.97% improvement in AUC-PR across all 45 subsets within R2PE. We further demonstrate our PDS’s efficacy in advancing open-domain QA accuracy. Our code will be released in the final version. Codes and data are available at https://github.com/XinXU-USTC/R2PE.git. For further details on the appendix, please refer to https://arxiv.org/abs/2402.10528.
832: Shaping a Stabilized Video by Mitigating Unintended Changes for Concept-Augmented Video Editing
Authors: Mingce Guo, Jingxuan He, Yufei Yin, Zhangye Wang, Shengeng Tang, Lechao Cheng
Location: Guangzhou | Day: TBD
Show Abstract
Text-driven video editing powered by generative diffusion models holds significant promise for applications spanning film production, advertising, and beyond. However, the limited expressiveness of pre-trained word embeddings often restricts nuanced edits, especially when targeting novel concepts with specific attributes. In this work, we present a novel Concept-Augmented Textual Inversion (CATI) framework that flexibly integrates new object information from user-provided concept videos. By fine-tuning only the V (Value) projection in attention via Low-Rank Adaptation (LoRA), our approach preserves the original attention distribution of the diffusion model while efficiently incorporating external concept knowledge. To further stabilize editing results and mitigate the issue of attention dispersion when prompt keywords are modified, we introduce a Dual Prior Supervision (DPS) mechanism. DPS supervises cross-attention between the source and target prompts, preventing undesired changes to non-target areas and improving the fidelity of novel concepts. Extensive evaluations demonstrate that our plug-and-play solution not only maintains spatial and temporal consistency but also outperforms state-of-the-art methods in generating lifelike and stable edited videos. The source code is publicly available at https://guomc9.github.io/STIVE-PAGE/.
848: Streaming Multi-agent Pathfinding
Authors: Mingkai Tang, Lu Gan, Kaichen Zhang
Location: Guangzhou | Day: TBD
Show Abstract
The task of the multi-agent pathfinding (MAPF) problem is to navigate a team of agents from their start point to the goal points. However, this setup is unsuitable in the assembly line scenario, which is periodic with a long working hour. To address this issue, the study formalizes the streaming MAPF (S-MAPF) problem, which assumes that the agents in the same agent stream have a periodic start time and share the same action sequence. The proposed solution, Agent Stream Conflict-Based Search (ASCBS), is designed to tackle this problem by incorporating a cyclic vertex/edge constraint to handle conflicts. Additionally, this work explores the potential usage of the disjoint splitting strategy within ASCBS. Experimental results indicate that ASCBS surpasses traditional MAPF solvers in terms of runtime for scenarios with prolonged working hours.
850: SyncGaussian: Stable 3D Gaussian-Based Talking Head Generation with Enhanced Lip Sync via Discriminative Speech Features
Authors: Ke Liu, Jiwei Wei, Shiyuan He, Zeyu Ma, Chaoning Zhang, Ning Xie, Yang Yang
Location: Guangzhou | Day: TBD
Show Abstract
Generating high-fidelity talking heads that maintain stable head poses and achieve robust lip sync remains a significant challenge. Although methods based on 3D Gaussian Splatting (3DGS) offer a promising solution via point-based deformation, they suffer from inconsistent head dynamics and mismatched mouth movements due to unstable Gaussian initialization and incomplete speech features. To overcome these limitations, we introduce SyncGaussian, a 3DGS-based framework that ensures stable head poses, enhanced lip sync, and realistic appearances with real-time rendering. SyncGaussian employs a stable head Gaussian initialization strategy to mitigate head jitter by optimizing commonly used rough head pose parameters. To enhance lip sync, we propose a sync-enhanced encoder that leverages audio-to-text and audio-to-visual speech features. Guided by a tailored cosine similarity loss function, the encoder integrates discriminative speech features through a multi-level sync adaptation mechanism, enabling the learning of an adaptive speech feature space. Extensive experiments demonstrate that SyncGaussian outperforms state-of-the-art methods in image quality, dynamic motion, and lip sync, with the potential for real-time applications.
852: METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection
Authors: Yongqi Wang, Xinxiao Wu, Shuo Yang
Location: Guangzhou | Day: TBD
Show Abstract
Open-vocabulary video visual relationship detection aims to detect objects and their relationships in videos without being restricted by predefined object or relationship categories. Existing methods leverage the rich semantic knowledge of pre-trained vision-language models such as CLIP to identify novel categories. They typically adopt a cascaded pipeline to first detect objects and then classify relationships based on the detected objects, which may lead to error propagation and thus suboptimal performance. In this paper, we propose Mutual EnhancemenT of Objects and Relationships (METOR), a query-based unified framework to jointly model and mutually enhance object detection and relationship classification in open-vocabulary scenarios. Under this framework, we first design a CLIP-based contextual refinement encoding module that extracts visual contexts of objects and relationships to refine the encoding of text features and object queries, thus improving the generalization of encoding to novel categories. Then we propose an iterative enhancement module to alternatively enhance the representations of objects and relationships by fully exploiting their interdependence to improve recognition performance. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate that our framework achieves state-of-the-art performance. Codes are at https://github.com/wangyongqi558/METOR.
854: Efficient Dynamic Ensembling for Multiple LLM Experts
Authors: Jinwu Hu, Yufeng Wang, Shuhai Zhang, Kai Zhou, Guohao Chen, Yu Hu, Bin Xiao, Mingkui Tan
Location: Guangzhou | Day: TBD
Show Abstract
LLMs have demonstrated impressive performance across various language tasks. However, the strengths of LLMs can vary due to different architectures, model sizes, areas of training data, etc. Therefore, ensemble reasoning for the strengths of different LLM experts is critical to achieving consistent and satisfactory performance on diverse inputs across a wide range of tasks. However, existing LLM ensemble methods are either computationally intensive or incapable of leveraging complementary knowledge among LLM experts for various inputs. In this paper, we propose an efficient Dynamic Ensemble Reasoning paradigm, called DER to integrate the strengths of multiple LLM experts conditioned on dynamic inputs. Specifically, we model the LLM ensemble reasoning problem as a Markov Decision Process, wherein an agent sequentially takes inputs to request knowledge from an LLM candidate and passes the output to a subsequent LLM candidate. Moreover, we devise a reward function to train a DER-Agent to dynamically select an optimal answering route given the input questions, aiming to achieve the highest performance with as few computational resources as possible. Last, to fully transfer the expert knowledge from the prior LLMs, we develop a Knowledge Transfer Prompt that enables the subsequent LLM candidates to transfer complementary knowledge effectively. Experiments demonstrate that our method uses fewer computational resources to achieve better performance compared to state-of-the-art baselines. Code and appendix are available at https://github.com/Fhujinwu/DER.
903: NuMDS: An Efficient Local Search Algorithm for Minimum Dominating Set Problem
Authors: Rui Sun, Zhaohui Liu, Yiyuan Wang, Han Xiao, Jiangnan Li, Jiejiang Chen
Location: Guangzhou | Day: TBD
Show Abstract
The minimum dominating set (MDS) problem is a crucial NP-hard combinatorial optimization problem with wide applications in real-world scenarios. In this paper, we propose an efficient local search algorithm namely NuMDS to solve the MDS, which comprises three key ideas. First, we introduce a dominate propagation-based reduction method that fixes a portion of vertices in a given graph. Second, we develop a novel two-phase initialization method based on the decomposition method. Third, we propose a multi-stage local search procedure, which adopts three different search manners according to the current stage of the search. We conduct extensive experiments to demonstrate the outstanding effectiveness of NuMDS, and the results clearly indicate that NuMDS outperforms previous state-of-the-art algorithms on almost all instances.
908: Imagination-Limited Q-Learning for Offline Reinforcement Learning
Authors: Wenhui Liu, Zhijian Wu, Jingchao Wang, Dingjiang Huang, Shuigeng Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Offline reinforcement learning seeks to derive improved policies entirely from historical data but often struggles with over-optimistic value estimates for out-of-distribution (OOD) actions. This issue is typically mitigated via policy constraint or conservative value regularization methods. However, these approaches may impose overly constraints or biased value estimates, potentially limiting performance improvements. To balance exploitation and restriction, we propose an Imagination-Limited Q-learning (ILQ) method, which aims to maintain the optimism that OOD actions deserve within appropriate limits. Specifically, we utilize the dynamics model to imagine OOD action-values, and then clip the imagined values with the maximum behavior values. Such design maintains reasonable evaluation of OOD actions to the furthest extent, while avoiding its over-optimism. Theoretically, we prove the convergence of the proposed ILQ under tabular Markov decision processes. Particularly, we demonstrate that the error bound between estimated values and optimality values of OOD state-actions possesses the same magnitude as that of in-distribution ones, thereby indicating that the bias in value estimates is effectively mitigated. Empirically, our method achieves state-of-the-art performance on a wide range of tasks in the D4RL benchmark.
952: EAVIT: Efficient and Accurate Human Value Identification From Text Data via LLMs
Authors: Wenhao Zhu, Yuhang Xie, Guojie Song, Xin Zhang
Location: Guangzhou | Day: TBD
Show Abstract
The rapid evolution of large language models (LLMs) has revolutionized various fields, including the identification and discovery of human values within text data. While traditional NLP models, such as BERT, have been employed for this task, their ability to represent textual data is significantly outperformed by emerging LLMs like GPTs. However, the performance of online LLMs often degrades when handling long contexts required for value identification, which also incurs substantial computational costs. To address these challenges, we propose EAVIT, an efficient and accurate framework for human value identification that combines the strengths of both locally fine-tunable and online black-box LLMs. Our framework employs a value detector—a small, local language model—to generate initial value estimations. These estimations are then used to construct concise input prompts for online LLMs, enabling accurate final value identification. To train the value detector, we introduce explanation-based training and data generation techniques specifically tailored for value identification, alongside sampling strategies to optimize the brevity of LLM input prompts. Our approach effectively reduces the number of input tokens by up to 1/6 compared to directly querying online LLMs, while consistently outperforming traditional NLP methods and other LLM-based strategies.
957: EyeSeg: An Uncertainty-Aware Eye Segmentation Framework for AR/VR
Authors: Zhengyuan Peng, Jianqing Xu, Shen Li, Jiazhen Ji, Yuge Huang, Jingyun Zhang, Jinmin Li, Shouhong Ding, Rizen Guo, Xin Tan, Lizhuang Ma
Location: Guangzhou | Day: TBD
Show Abstract
Human-machine interaction through augmented reality (AR) and virtual reality (VR) is increasingly prevalent, requiring accurate and efficient gaze estimation which hinges on the accuracy of eye segmentation to enable smooth user experiences. We introduce EyeSeg, a novel eye segmentation framework designed to overcome key challenges that existing approaches struggle with: motion blur, eyelid occlusion, and train-test domain gaps. In these situations, existing models struggle to extract robust features, leading to suboptimal performance. Noting that these challenges can be generally quantified by uncertainty, we design EyeSeg as an uncertainty-aware eye segmentation framework for AR/VR wherein we explicitly model the uncertainties by performing Bayesian uncertainty learning of a posterior under the closed set prior. Theoretically, we prove that a statistic of the learned posterior indicates segmentation uncertainty levels and empirically outperforms existing methods in downstream tasks, such as gaze estimation. EyeSeg outputs an uncertainty score and the segmentation result, weighting and fusing multiple gaze estimates for robustness, which proves to be effective especially under motion blur, eyelid occlusion and cross-domain challenges. Moreover, empirical results suggest that EyeSeg achieves segmentation improvements of MIoU, E1, F1, and ACC surpassing previous approaches.
968: Heterogeneous Federated Learning with Scalable Server Mixture-of-Experts
Authors: Jingang Jiang, Yanzhao Chen, Xiangyang Liu, Haiqi Jiang, Chenyou Fan
Location: Guangzhou | Day: TBD
Show Abstract
Classical Federated Learning (FL) encounters significant challenges when deploying large models on power-constrained clients. To tackle this, we propose an asymmetric FL mechanism that enables the aggregation of compact client models into a comprehensive server model. We design the server model as a Mixture-of-Experts (MoE), where each expert has the same architecture as each client model. This uniformity allows for efficient fusion of the most pertinent client models to update each server expert, based on the measured relevance between each client and server expert. To address the Non-IID data issue, we further optimize the server-side MoE architecture by incorporating a main expert that always activates alongside a set of selectively activated routed experts. This configuration ensures a balance between learning general knowledge and specific data distribution. Our Fed-MoE framework is model-agnostic and has demonstrated notable improvements on vision FL tasks with million-scale ResNet backbones, and language tasks with billion-scale BERT and GPT-2 backbones.
1002: Dynamic Multiple High-order Correlations Fusion with Noise Filtering for Incomplete Multi-view Noisy-label Learning
Authors: Kaixiang Wang, Xiaojian Ding, Fan Yang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-view multi-label data often suffers from incomplete feature views and label noise. This paper is the first to address both challenges simultaneously, rectifying critical deficiencies in existing methodologies that inadequately extract and fuse high-order structural correlations across views while lacking robust solutions to mitigate label noise. We introduce a dynamic multiple high-order correlations fusion with noise filtering, specifically designed for incomplete multi-view noisy-label learning. By capitalizing on a dynamic multi-hypergraph neural network, inspired by the principles of ensemble learning, we adeptly capture and integrate high-order correlations among samples from different views. The model’s capability is further augmented through an innovative hypergraph fusion technique based on random walk theory, which empowers it to seamlessly amalgamate both structural and feature information. Moreover, we propose sophisticated noise-filtering matrices that are tightly embedded within the hypergraph neural network, devised to counteract the detrimental impact of label noise. Recognizing that label noise perturbs the data distribution in the label space, these filtering matrices exploit the distributional disparities between feature and label spaces. The high-order structural information derived from both domains underpins the learning and efficacy of the noise-filtering matrices. Empirical evaluations on benchmark datasets unequivocally demonstrate that our method significantly outperforms contemporary state-of-the-art techniques.
1007: Towards Improved Risk Bounds for Transductive Learning
Authors: Bowei Zhu, Shaojie Li, Yong Liu
Location: Guangzhou | Day: TBD
Show Abstract
Transductive learning is a popular setting in statistic learning theory, reasoning from observed, specific training cases to specific test cases, which has been widely used in many fields such as graph neural networks and semi-supervised learning. Existing results provide fast rates of convergence based on the traditional local techniques, which need the surrogate function that upper bounds the uniform error within a localized region to be “sub-root”. We derive new version of concentration inequality for empirical processes in transductive learning and apply generic chaining technique to relax the assumptions and gain tighter results in empirical risk minimization. Furthermore, we concentrate on the generalization of moment penalization algorithm. We design a novel estimator based on the second moment (variance) penalization and derive its learning rates, which is the first theoretical generalization analysis considering variance-based algorithms.
1010: Zero-Shot Machine Unlearning with Proxy Adversarial Data Generation
Authors: Huiqiang Chen, Tianqing Zhu, Xin Yu, Wanlei Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Machine unlearning aims to remove the influence of specific samples from a trained model. A key challenge in this process is over-unlearning, where the model’s performance on the remaining data significantly drops due to the change in the model’s parameters. Existing unlearning algorithms depend on the remaining data to prevent this issue. As such, these methods are inapplicable in a more practical scenario, where only the unlearning samples are available (i.e., zero-shot unlearning). This paper presents a novel framework, ZS-PAG, to fill this gap. Our approach offers three key innovations: (1) we approximate the inaccessible remaining data by generating adversarial samples; (2) leveraging the generated samples, we pinpoint a specific subspace to perform the unlearning process, therefore preventing over-unlearning in the challenging zero-shot scenario; and (3) we consider the influence of the unlearning process on the remaining samples and design an influence-based pseudo-labeling strategy. As a result, our method further improves the model’s performance after unlearning. The proposed method holds a theoretical guarantee, and experiments on various benchmarks validate the effectiveness and superiority of our proposed method over several baselines.
1012: Non-collective Calibrating Strategy for Time Series Forecasting
Authors: Bin Wang, Yongqi Han, Minbo Ma, Tianrui Li, Junbo Zhang, Feng Hong, Yanwei Yu
Location: Guangzhou | Day: TBD
Show Abstract
Deep learning-based approaches have demonstrated significant advancements in time series forecasting. Despite these ongoing developments, the complex dynamics of time series make it challenging to establish the rule of thumb for designing the golden model architecture. In this study, we argue that refining existing advanced models through a universal calibrating strategy can deliver substantial benefits with minimal resource costs, as opposed to elaborating and training a new model from scratch. We first identify a multi-target learning conflict in the calibrating process, which arises when optimizing variables across time steps, leading to the underutilization of the model’s learning capabilities. To address this issue, we propose an innovative calibrating strategy called Socket+Plug (SoP). This approach retains an exclusive optimizer and early-stopping monitor for each predicted target within each Plug while keeping the fully trained Socket backbone frozen. The model-agnostic nature of SoP allows it to directly calibrate the performance of any trained deep forecasting models, regardless of their specific architectures. Extensive experiments on various time series benchmarks and a spatio-temporal meteorological ERA5 dataset demonstrate the effectiveness of SoP, achieving up to a 22% improvement even when employing a simple MLP as the Plug (highlighted in Figure 1).
1020: Stability and Generalization for Stochastic (Compositional) Optimizations
Authors: Xiaokang Pan, Jin Liu, Hulin Kuang, Youqi Li, Lixing Chen, Zhe Qu
Location: Guangzhou | Day: TBD
Show Abstract
The use of estimators instead of stochastic gradients for updates has been shown to improve algorithm convergence rates of, but their impact on generalization remains under-explored. In this paper, we investigate how estimators influence generalization. Our focus is on two widely studied problems: stochastic optimization (SO) and stochastic compositional optimization (SCO), both under convex and non-convex settings. For SO problems, we first analyze the generalization error of the STORM algorithm as a foundational step. We then extend our analysis to SCO problems by introducing an algorithmic framework that encompasses several popular algorithmic approaches. Through this framework, we conduct a generalization analysis, uncovering new insights into the impact of estimators on generalization. Subsequently, we provide a detailed analysis of three specific algorithms within this framework: SCGD, SCSC, and COVER, to explore the effects of different estimator strategies. Furthermore, in the context of SCO, we propose a novel definition of stability and a new decomposition of excess risk in the non-convex setting. Our analysis indicates two key findings: (1) In SCO problems, eliminating the estimator for the gradient of the inner function does not impact generalization performance while significantly reducing computational and storage overhead. (2) Faster convergence rates are consistently associated with better generalization performance.
1024: MATCH: Modality-Calibrated Hypergraph Fusion Network for Conversational Emotion Recognition
Authors: Jiandong Shi, Ming Li, Lu Bai, Feilong Cao, Ke Lu, Jiye Liang
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal emotion recognition aims to identify emotions by integrating multimodal features derived from spoken utterances. However, existing work often neglects the calibration of conversational entities, focusing mainly on extracting potential intra- or cross-modal information. This leads to the underutilization of utterance information that is essential for accurately characterizing emotion. Additionally, the lack of effective modeling of conversational patterns limits the ability to capture emotional pathways across contexts, modalities and speakers, impacting the overall emotional understanding. In this study, we propose the modality-calibrated hypergraph fusion network (MATCH), which leverages multimodal fusion and hypergraph learning techniques to address these challenges. In particular, we introduce an entity calibration strategy that refines the representations of conversational entities both at the modality and context levels, allowing for deeper insights into emotion-related cues. Furthermore, we present an emotion-aligned hypergraph fusion method that incorporates a line graph to explore conversational patterns, facilitating flexible knowledge transfer across modalities through hyperedge-level and graph-level alignments. Experiments demonstrate that MATCH outperforms state-of-the-art approaches on two benchmark datasets.
1031: Counterfactual Thinking Driven Emotion Regulation for Image Sentiment Recognition
Authors: Xinyue Zhang, Zhaoxia Wang, Hailing Wang, Guitao Cao
Location: Guangzhou | Day: TBD
Show Abstract
Image sentiment recognition (ISR) facilitates the practical application of affective computing on rapidly growing social platforms. Nowadays, region-based ISR methods that use affective regions to guide emotion prediction have gained significant attention. However, existing methods lack a causality-based mechanism to guide affective region generation and effective tools to quantitatively evaluate their quality. Inspired by the psychological theory of Emotion Regulation, we propose a counterfactual thinking driven emotion regulation network (CTERNet), which simulates the Emotion Regulation Theory by modeling the entire process of ISR based on human causality-driven mechanisms. Specifically, we first use multi-scale perception for feature extraction to simulate the stage of situation selection. Next, we combine situation modification, attentional deployment, and cognitive change into a counterfactual thinking based cognitive reappraisal module, which learns both affective regions (factual) and other potential affective regions (counterfactual). In the response modulation stage, we compare the factual and counterfactual outcomes to encourage the network to discover the most emotionally representative regions, thereby quantifying the quality of affective regions for ISR tasks. Experimental results demonstrate that our method outperforms or matches the state-of-the-art approaches, proving its effectiveness in addressing the key challenges of region-based ISR.
1032: HyperTrans: Efficient Hypergraph-Driven Cross-Domain Pattern Transfer in Image Anomaly Detection
Authors: Tengyu Zhang, Deyu Zeng, Baoqiang Li, Wei Wang, Wei Liu, Zongze Wu
Location: Guangzhou | Day: TBD
Show Abstract
Anomaly detection plays a pivotal role in industrial quality assurance processes, with cross-domain problems, exemplified by the model upgrade from RGB to 3D, being prevalent in real-world scenarios yet remaining systematically underexplored. To address the severe challenges posed by the extreme lack of datasets in target domain, we retain the knowledge from source models and explore a novel solution for anomaly detection through cross-domain learning, introducing HyperTrans. Targeting few-shot scenarios, HyperTrans centers around hypergraphs to model the relationship of the limited patch features and employs a perturbation-rectification-scoring architecture. The domain perturbation module injects and adapts channel-level statistical perturbations, mitigating style shifts during domain transfer. Subsequently, a residual hypergraph restoration module utilizes a cross-domain hypergraph to capture higher-order correlations in patches and align them across domains. Ultimately, with feature patterns exhibiting reduced domain shifts, an inter-domain scoring module aggregates similarity information between patches and normal patterns within the multi-domain subhypergraphs to make an integrated decision, generating multi-level anomaly predictions. Extensive experiments demonstrate that HyperTrans offers significant advantages in anomaly classification and anomaly segmentation tasks, outperforming state-of-the-art non-cross-domain methods in image-wise ROCAUC by 13%, 12%, and 15% in 1-shot, 2-shot, and 5-shot settings on MVTec3D AD.
1035: Reliable and Diverse Hierarchical Adapter for Zero-shot Video Classification
Authors: Wenxuan Ge, Peng Huang, Rui Yan, Hongyu Qu, Guosen Xie, Xiangbo Shu
Location: Guangzhou | Day: TBD
Show Abstract
Adapting pre-trained vision-language models to downstream tasks has emerged as a novel paradigm for zero-shot learning. Existing test-time adaptation (TTA) methods such as TPT attempt to fine-tune visual or textual representations to accommodate downstream tasks but still require expensive optimization costs. To this end, Training-free Dynamic Adapter (TDA) maintains a cache containing visual features for each category in a parameter-free manner and measures sample confidence based on prediction entropy of test samples. Inspired by TDA, this work aims to develop the first training-free adapter for zero-shot video classification. Capturing the intrinsic temporal relationships within video data to construct and maintain the video cache is key to extending TDA to the video domain. In this work, we propose a reliable and diverse Hierarchical Adapter for zero-shot video classification, which consists of Frame-level Cache Refiner and Video-level Cache Updater. Before each video sample enters the corresponding cache, it needs to be refined at frame level based on prediction entropy and temporal probability difference. Due to the limited capacity of the cache, we update the cache during inference based on the principle of diversity. Experiments on four popular video classification benchmarks demonstrate the effectiveness of Hierarchical Adapter. The code is available at https://github.com/Gwxer/Hierarchical-Adapter.
1044: DriftRemover: Hybrid Energy Optimizations for Anomaly Images Synthesis and Segmentation
Authors: Siyue Yao, Haotian Xu, Mingjie Sun, Siyue Yu, Jimin Xiao, Eng Gee Lim
Location: Guangzhou | Day: TBD
Show Abstract
This paper tackles the challenge of anomaly image synthesis and segmentation to generate various anomaly images and their segmentation labels to mitigate the issue of data scarcity. Existing approaches employ the precise mask to guide the generation, relying on additional mask generators, leading to increased computational costs and limited anomaly diversity. Although a few works use coarse masks as the guidance to expand diversity, they lack effective generation of labels for synthetic images, thereby reducing their practicality. Therefore, our proposed method simultaneously generates anomaly images and their corresponding masks by utilizing coarse masks and anomaly categories. The framework utilizes attention maps from synthesis process as mask labels and employs two optimization modules to tackle drift challenges, which are mismatches between synthetic results and real situations. Our evaluation demonstrates that our method improves pixel-level AP by 1.3% and F1-MAX by 1.8% in anomaly detection tasks on the MVTec dataset. Additionally, its successful application in practical scenarios highlights its effectiveness, improving IoU by 37.2% and F-measure by 25.1% with the Floor Dirt dataset. The code is available at https://github.com/JJessicaYao/DriftRemover.
1049: Interactive Multimodal Learning via Flat Gradient Modification
Authors: Qing-Yuan Jiang, Zhouyang Chi, Yang Yang
Location: Guangzhou | Day: TBD
Show Abstract
Due to the notorious modality imbalance phenomenon, multimodal learning (MML) struggles to achieve satisfactory performance. Recently, multimodal learning with alternating unimodal adaptation (MLA) has been proven effective in mitigating the interference between modalities by capturing interaction through orthogonal projection, thus relieving modality imbalance phenomenon to some extent. However, the projection strategy orthogonal to the original space can lead to poor plasticity as the alternating learning proceeds, thus affecting model performance. To address this issue, in this paper, we propose a novel multimodal learning method called interactiveMML via flat gradient modification (IGM) by employing a flat gradient modification strategy to enhance interactive MML. Specifically, we first employ a flat projection-based gradient modification strategy that is independent to the original space, aiming to avoid the poor plasticity issue. Then we introduce the sharpness-aware minimization (SAM)-based optimization strategy to fully exploit the flatness of the learning objective and further enhance interaction during learning. To this end, the plasticity problem can be avoided and the overall performance is improved. Extensive experiments on widely used datasets demonstrate that IGM outperforms various state-of-the-art (SOTA) baselines, achieving superior performance. The source code is available at https://anonymous.4open.science/r/method-CC45.
1050: Endogenous Recovery via Within-modality Prototypes for Incomplete Multimodal Hashing
Authors: Sa Zhu, Dayan Wu, Chenming Wu, Pengwen Dai, Bo Li
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal hashing projects multimodal data into compact binary codes, enabling rapid and storage-efficient retrieval of large-scale multimedia content.
In practical scenarios, the issue of missing modality frequently arises when dealing with multimodal data.
Existing incomplete multimodal hashing techniques directly recover missing modalities by neural networks, resulting in a disjointed representation space between the recovered and true data.
In this paper, we present a novel recovery paradigm, namely Prototype-based Modality Completion Hashing (PMCH).
Instead of directly synthesizing it from available modalities, PMCH adaptively aggregates associated within-modality prototypes to recover missing modality data.
Specifically, PMCH introduces an within-modality prototype learning module to optimize representative prototypes for each modality.
These prototypes act as recovery anchors and reside within the same representation space as their corresponding modality data.
Subsequently, PMCH adaptively aggregates the associated within-modality prototypes with coefficients derived from the modality-specific Weight-Net.
By utilizing prototypes from the same modality, the semantic disparity between the reconstructed and authentic data can be substantially diminished.
Extensive experiments on three widely used benchmark datasets demonstrate that PMCH can effectively recover the missing modality, and attain state-of-the-art performance in both complete and incomplete multimodal retrieval scenarios. Code is available at https://github.com/Sasa77777779/PMCH.git.
1057: Asynchronous Credit Assignment for Multi-Agent Reinforcement Learning
Authors: Yongheng Liang, Hejun Wu, Haitao Wang, Hao Cai
Location: Guangzhou | Day: TBD
Show Abstract
Credit assignment is a critical problem in multi-agent reinforcement learning (MARL), aiming to identify agents’ marginal contributions for optimizing cooperative policies. Current credit assignment methods typically assume synchronous decision-making among agents. However, many real-world scenarios require agents to act asynchronously without waiting for others. This asynchrony introduces conditional dependencies between actions, which pose great challenges to current methods. To address this issue, we propose an asynchronous credit assignment framework, incorporating a Virtual Synchrony Proxy (VSP) mechanism and a Multiplicative Value Decomposition (MVD) algorithm. VSP enables physically asynchronous actions to be virtually synchronized during credit assignment. We theoretically prove that VSP preserves both task equilibrium and algorithm convergence. Furthermore, MVD leverages multiplicative interactions to effectively model dependencies among asynchronous actions, offering theoretical advantages in handling asynchronous tasks. Extensive experiments show that our framework consistently outperforms state-of-the-art MARL methods on challenging tasks while providing improved interpretability for asynchronous cooperation.
1063: Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network
Authors: Junyan Wu, Wenbo Xu, Wei Lu, Xiangyang Luo, Rui Yang, Shize Guo
Location: Guangzhou | Day: TBD
Show Abstract
Audio temporal forgery localization (ATFL) aims to find the precise forgery regions of the partial spoof audio that is purposefully modified. Existing ATFL methods rely on training efficient networks using fine-grained annotations, which are obtained costly and challenging in real-world scenarios. To meet this challenge, in this paper, we propose a progressive audio-language co-learning network (LOCO) that adopts co-learning and self-supervision manners to prompt localization performance under weak supervision scenarios. Specifically, an audio-language co-learning module is first designed to capture forgery consensus features by aligning semantics from temporal and global perspectives. In this module, forgery-aware prompts are constructed by using utterance-level annotations together with learnable prompts, which can incorporate semantic priors into temporal content features dynamically. In addition, a forgery localization module is applied to produce forgery proposals based on fused forgery-class activation sequences. Finally, a progressive refinement strategy is introduced to generate pseudo frame-level labels and leverage supervised semantic contrastive learning to amplify the semantic distinction between real and fake content, thereby continuously optimizing forgery-aware features. Extensive experiments show that the proposed LOCO achieves SOTA performance on three public benchmarks.
1066: PeSANet: Physics-encoded Spectral Attention Network for Simulating PDE-Governed Complex Systems
Authors: Han Wan, Rui Zhang, Qi Wang, Yang Liu, Hao Sun
Location: Guangzhou | Day: TBD
Show Abstract
Accurately modeling and forecasting complex systems governed by partial differential equations (PDEs) is crucial in various scientific and engineering domains. However, traditional numerical methods struggle in real-world scenarios due to incomplete or unknown physical laws. Meanwhile, machine learning approaches often fail to generalize effectively when faced with scarce observational data and the challenge of capturing local and global features. To this end, we propose the Physics-encoded Spectral Attention Network (PeSANet), which integrates local and global information to forecast complex systems with limited data and incomplete physical priors. The model consists of two key components: a physics-encoded block that uses hard constraints to approximate local differential operators from limited data, and a spectral-enhanced block that captures long-range global dependencies in the frequency domain. Specifically, we introduce a novel spectral attention mechanism to model inter-spectrum relationships and learn long-range spatial features. Experimental results demonstrate that PeSANet outperforms existing methods across all metrics, particularly in long-term forecasting accuracy, providing a promising solution for simulating complex systems with limited data and incomplete physics.
1086: Gaussian Mixture Model for Graph Domain Adaptation
Authors: Mengzhu Wang, Wenhao Ren, Yu Zhang, Yanlong Fan, Dianxi Shi, Luoxi Jing, Nan Yin
Location: Guangzhou | Day: TBD
Show Abstract
Unsupervised domain adaptation (UDA) has been widely studied with the goal of transferring knowledge from a label-rich source domain to a related but unlabeled target domain. Most UDA techniques achieve this by reducing the feature discrepancies between the two domains to learn domain-invariant feature representations. While domain-invariant feature representations can reduce the differences between the source and target domains, excessively simplifying these differences may cause the model to overlook important domain-specific features, resulting in a decline in transfer learning effectiveness. To address this issue, this paper proposes a novel Gaussian Mixture Model for graph domain adaptation (GMM). This model effectively reduces the distributional bias between the source and target domains by modeling the distribution differences on a graph structure. GMM leverages the local structural information of the graph and the clustering capability of the Gaussian mixture model to automatically learn the latent mapping relationships between the source and target domains. To the best of our knowledge, this is the first work to introduce a Gaussian mixture model into UDA. Extensive experimental results on three standard benchmarks demonstrate that the proposed GMM algorithm outperforms state-of-the-art unsupervised domain adaptation methods in terms of performance.
1095: Brain-Inspired Stepwise Patch Merging for Vision Transformers
Authors: Yonghao Yu, Dongcheng Zhao, Guobin Shen, Yiting Dong, Yi Zeng
Location: Guangzhou | Day: TBD
Show Abstract
The hierarchical architecture has become a mainstream design paradigm for Vision Transformers (ViTs), with Patch Merging serving as the pivotal component that transforms a columnar architecture into a hierarchical one. Drawing inspiration from the brain’s ability to integrate global and local information for comprehensive visual understanding, we propose Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism’s ability to ‘see’ better. SPM consists of Multi-Scale Aggregation (MSA) and Guided Local Enhancement (GLE) striking a proper balance between long-range dependency modeling and local feature enhancement. Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic segmentation. Meanwhile, experiments show that combining SPM with different backbones can further improve performance. The code has been released at https://github.com/Yonghao-Yu/StepwisePatchMerging.
1116: Enhancing Multimodal Protein Function Prediction Through Dual-Branch Dynamic Selection with Reconstructive Pre-Training
Authors: Xiaoling Luo, Peng Chen, Chengliang Liu, Xiaopeng Jin, Jie Wen, Yumeng Liu, Junsong Wang
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal protein features play a crucial role in protein function prediction. However, these features encompass a wide range of information, ranging from structural data and sequence features to protein attributes and interaction networks, making it challenging to decipher their complex interconnections. In this work, we propose a multimodal protein function prediction method (DSRPGO) by utilizing dynamic selection and reconstructive pre-training mechanisms. To acquire complex protein information, we introduce reconstructive pre-training to mine more fine-grained information with low semantic levels. Moreover, we put forward the Bidirectional Interaction Module (BInM) to facilitate interactive learning among multimodal features. Additionally, to address the difficulty of hierarchical multi-label classification in this task, a Dynamic Selection Module (DSM) is designed to select the feature representation that is most conducive to current protein function prediction. Our proposed DSRPGO model improves significantly in BPO, MFO, and CCO on human datasets, thereby outperforming other benchmark models.
1117: Top-I2P: Explore Open-Domain Image-to-Point Cloud Registration Using Topology Relationship
Authors: Pei An, Jiaqi Yang, Muyao Peng, You Yang, Qiong Liu, Jie Ma, Liangliang Nan
Location: Guangzhou | Day: TBD
Show Abstract
Image-to-point cloud (I2P) registration is a fundamental task in computer vision, which aims to align pixels in 2D images with corresponding points in 3D point clouds. While deep learning based methods dominate this field, they often fail to generalize to the open domain. In this paper, we address open-domain I2P registration from the topology relationships perspective. Firstly, we find that topology relationships reflect sparse connections between pixels and points, which shows the significant potential in enhancing cross-modality feature interaction in the open domain. Building on this insight, we develop an I2P registration framework using topology relationships. After that, to construct and leverage the topology relationships between the heterogeneous 2D and 3D spaces, we design a registration network, Top-I2P, with correction-based topology reasoning and fast topology feature interaction modules. Extensive experiments on 7-Scenes, RGBD-V2, ScanNet, and self-collected I2P datasets demonstrate that Top-I2P achieves superior registration performance in open-domain scenarios.
1132: Decoupled Imbalanced Label Distribution Learning
Authors: Yongbiao Gao, Xiangcheng Sun, Miaogen Ling, Chao Tan, Yi Zhai, Guohua Lv
Location: Guangzhou | Day: TBD
Show Abstract
Label Distribution Learning (LDL) has been successfully implemented in numerous practical applications. However, the imbalance in label distributions presents a significant challenge due to the substantial variation in annotation information. To tackle this issue, we introduce Decoupled Imbalance Label Distribution Learning (DILDL), which decomposes the imbalanced label distribution into a dominant label distribution and a non-dominant label distribution. Our empirical findings reveal that an excessively high description degree of dominant labels can result in substantial gradient information attenuation for non-dominant labels during the learning process. Therefore, we employ the decoupling approach to balance the description degrees of both dominant and non-dominant labels independently. Furthermore, we align the feature representations with the representations of dominant and non-dominant labels separately, aiming to effectively mitigate the distribution shift problem. Experimental results demonstrate that our proposed DILDL outperforms other state-of-the-art methods for imbalance label distribution learning.
1134: On the Generalization of Feature Incremental Learning
Authors: Chao Xu, Xijia Tang, Lijun Zhang, Chenping Hou
Location: Guangzhou | Day: TBD
Show Abstract
In many real applications, the data attributes are incremental and the samples are stored with accumulated feature spaces gradually. Although there are several elegant approaches to tackling this problem, the theoretical analysis is still limited. There exist at least two challenges and fundamental questions. 1) How to derive the generalization bounds of these approaches? 2) Under what conditions do these approaches have a strong generalization guarantee? To solve these crucial but rarely studied problems, we provide a comprehensive theoretical analysis in this paper. We begin by summarizing and refining four strategies for addressing feature incremental data. Subsequently, we derive their generalization bounds, providing rigorous and quantitative insights. The theoretical findings highlight the key factors influencing the generalization abilities of different strategies. In tackling the above two fundamental problems, we also provide valuable guidance for exploring other learning challenges in dynamic environments. Finally, the comprehensive experimental and theoretical results mutually validate each other, underscoring the reliability of our conclusions.
1137: Suit the Node Pair to the Case: A Multi-Scale Node Pair Grouping Strategy for Graph-MLP Distillation
Authors: Rui Dong, Jiaxing Li, Weihuang Zheng, Youyong Kong
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Network (GNN) is powerful in solving various graph-related tasks, while its message passing mechanism may lead to latency during inference time. Multi-Layer-Perceptron (MLP) can achieve fast inference speed but with limited performance. One solution to fill this gap is through Knowledge Distillation. However, current distillation methods follow a ”node-to-node” paradigm, while considering the complex relationships between different node pairs, direct distillation fails to capture these multiple-granularity features in GNN. Furthermore, current methods which focuses on the alignment of logits in the final layer ignores further learning within layers inside student MLP. Therefore, in this paper, we introduce a multi-scale knowledge distillation method (MSN-GDM) aiming to capture multiple knowledge from GNN to MLP. We firstly propose a multi-scale node-pair grouping strategy to assign node pairs to different-scale groups according to node pair similarity metrics. The similarity metrics consider both node features and topological structures of the given node pair. Then based on the preprocessed node-set groups, we design a multi-scale distillation method that can capture comprehensive knowledge in the corresponding node-set groups. The hierarchical weighted sum of each layer is applied as the final output. Extensive experiments on eight real-world datasets demonstrate the effectiveness of our proposed method.
1141: Efficient Differentiable Approximation of Generalized Low-rank Regularization
Authors: Naiqi Li, Yuqiu Xie, Peiyuan Liu, Tao Dai, Yong Jiang, Shu-Tao Xia
Location: Guangzhou | Day: TBD
Show Abstract
Low-rank regularization (LRR) has been widely applied in various machine learning tasks, but the associated optimization is challenging. Directly optimizing the rank function under constraints is NP-hard in general. To overcome this difficulty, various relaxations of the rank function were studied. However, optimization of these relaxed LRRs typically depends on singular value decomposition, which is a time-consuming and nondifferentiable operator that cannot be optimized with gradient-based techniques. To address these challenges, in this paper we propose an efficient differentiable approximation of the generalized LRR. The considered LRR form subsumes many popular choices like the nuclear norm, the Schatten-p norm, and various nonconvex relaxations. Our method enables LRR terms to be appended to loss functions in a plug-and-play fashion, and the GPU-friendly operations enable efficient and convenient implementation. Furthermore, convergence analysis is presented, which rigorously shows that both the bias and the variance of our rank estimator rapidly reduce with increased sample size and iteration steps. In the experimental study, the proposed method is applied to various tasks, which demonstrates its versatility and efficiency. Code is available at https://github.com/naiqili/EDLRR.
1157: Adaptive Language-Aware Image Reflection Removal Network
Authors: Siyan Fang, Yuntao Wang, Jinpu Zhang, Ziwen Li, Yuehuan Wang
Location: Guangzhou | Day: TBD
Show Abstract
Existing image reflection removal methods struggle to handle complex reflections. Accurate language descriptions can help the model understand the image content to remove complex reflections. However, due to blurred and distorted interferences in reflected images, machine-generated language descriptions of the image content are often inaccurate, which harms the performance of language-guided reflection removal. To address this, we propose the Adaptive Language-Aware Network (ALANet) to remove reflections even with inaccurate language inputs. Specifically, ALANet integrates both filtering and optimization strategies. The filtering strategy reduces the negative effects of language while preserving its benefits, whereas the optimization strategy enhances the alignment between language and visual features. ALANet also utilizes language cues to decouple specific layer content from feature maps, improving its ability to handle complex reflections. To evaluate the model’s performance under complex reflections and varying levels of language accuracy, we introduce the Complex Reflection and Language Accuracy Variance (CRLAV) dataset. Experimental results demonstrate that ALANet surpasses state-of-the-art methods for image reflection removal. The code and dataset are available at https://github.com/fashyon/ALANet.
1167: Open-World Semi-Supervised Learning with Class Semantic Correlations
Authors: Yuxin Fan, Junbiao Cui, Jiye Liang, Jianqing Liang
Location: Guangzhou | Day: TBD
Show Abstract
Open-world semi-supervised learning (OWSSL) aims to recognize both known and unknown classes, but the labeled samples only cover the known classes. Existing OWSSL methods primarily represent classes as symbolic variables, which ignore the rich internal semantic information associated with the classes and thus hampers their ability to recognize unknown classes. Recent studies incorporate textual descriptions of classes to facilitate training, but these methods overlook the class semantic correlations, which constrains their effectiveness in recognizing unknown classes. To address these issues, we propose a novel OWSSL method. Our method fine-tunes only the image encoder during training while keeping the text encoder frozen, thereby preserving the rich semantic correlations learned during the pre-training phase. Furthermore, we employ a semantic margin to extract class semantic correlations from textual descriptions, which are then utilized in enhancing image representation discriminability. Experimental results across multiple datasets demonstrate that our method significantly outperforms representative OWSSL methods in the recognition of both known and unknown classes.
1169: TCDM: A Temporal Correlation-Empowered Diffusion Model for Time Series Forecasting
Authors: Huibo Xu, Likang Wu, Xianquan Wang, Zhiding Liu, Qi Liu
Location: Guangzhou | Day: TBD
Show Abstract
Although previous studies have applied diffusion models to time series forecasting, these efforts have struggled to preserve the intrinsic temporal correlations within the series, leading to suboptimal predictive outcomes. This failure primarily results from the introduction of independent, identically distributed (i.i.d.) noise. In the forward process, the addition of i.i.d. noise to the time series gradually diminishes these temporal correlations. The reverse process starts with i.i.d. noise and lacks priors related to temporal correlations, which can result in directional biases during sampling. From a frequency-domain perspective, noise disrupts the low-frequency-dominated structure of trend components, making it difficult for the model to learn long-term temporal dependencies. To address these limitations, we introduce a decomposition prediction framework to complement the novel Temporal Correlation-Empowered Diffusion Model. Overall, We decompose the time series into trend and residual components, predict them using a base model and a diffusion model, and then combine the results. Specifically, a frequency-domain MLP model was adopted as the base model due to its not distorting the original sequence, and better the capture of long-range temporal dependencies. The diffusion model incorporates two key modules to capture short- and mid-range temporal correlations: the Maintaining Temporal Correlation Module and the Redesigned Initial Module. Extensive experiments across multiple datasets demonstrate that the proposed method significantly outperforms related strong baselines.
1175: Underground Diagnosis in 3D GPR Data by Learning in CuCoRes Model Space
Authors: Xiren Zhou, Shikang Liu, Xinyu Yan, Xiangyu Wang, Huanhuan Chen
Location: Guangzhou | Day: TBD
Show Abstract
Ground Penetrating Radar (GPR) provides detailed subterranean insights. Nevertheless, underground diagnosis via GPR is hindered by the fact that training data typically contain only normal samples, along with the complexity of GPR data’s wave-collection characteristics. This paper proposes subsurface anomaly detection within the Cubic Correlation Reservoir Network (CuCoRes) model space. CuCoRes incorporates three reservoirs with spatial correlation adjustment in each direction to adequately and accurately capture multi-directional dynamics (i.e., changing information) within GPR data. Fitting GPR data with CuCoRes and representing data with fitted models, the original GPR data is mapped into a category-discriminative CuCoRes model space, where anomalies could be efficiently identified and categorized based on model dissimilarities. Our approach leverages only limited normal GPR data, easily accessible, to support subsequent anomaly detection and categorization, enhancing its applicability in practical scenarios. Experiments on real-world data demonstrate its effectiveness, outperforming state-of-the-art.
1179: Fault Diagnosis in REDNet Model Space
Authors: Xiren Zhou, Ziyu Tang, Shikang Liu, Ao Chen, Xiangyu Wang, Huanhuan Chen
Location: Guangzhou | Day: TBD
Show Abstract
Fault Diagnosis (FD) in time-varying data presents considerations such as limited training data, intra- and inter-dimensional correlations, and constraints of training time. In response, this paper introduces FD in the Reservoir-Embedded-Directional Network (REDNet) model space. Model-oriented methods utilize well-fitted networks or functions, denoted as "models" that capture data’s changing information, as more stable and parsimonious representations of the data. Our approach employs REDNet for data fitting, wherein multiple reservoirs are organized along intrinsic correlation directions to establish intra- and inter-dimensional dependencies, thereby capturing multi-directional dynamics in high-dimensional data.
Representing each data instance with an independently fitted REDNet model maps these instances into a class-separable REDNet model space, where FD could be performed on the models rather than the original data. Concentrating on the data-intrinsic dynamics, our method achieves rapid training speeds, and maintains robust performance even with minimal training data. Experiments on several datasets demonstrate its effectiveness.
1183: Stabilizing Holistic Semantics in Diffusion Bridge for Image Inpainting
Authors: Jinjia Peng, Mengkai Li, Huibing Wang
Location: Guangzhou | Day: TBD
Show Abstract
Image inpainting aims to restore the original image from a damaged version. Recently, a special type of diffusion bridge model has achieved promising performance by directly mapping the degradation process and restoring corrupted images through the corresponding reverse process. However, due to the lack of explicit semantic priors during the denoising process, the inpainted results typically exhibit inferior context-stability and semantic consistency. To this end, this paper proposes a novel Global Structure-Guided Diffusion Bridge framework (GSGDiff), which incorporates an additional structure restorer to stabilize the generation of holistic semantics. Specifically, to acquire richer semantic structure priors, this paper proposes a posterior sampling approach that captures semantically global and consistent structures at each timestep, efficiently integrating them into the texture generation through the corresponding guidance module. Additionally, considering the characteristics of diffusion models with low denoising levels at larger timesteps, this paper proposes a semantic fusion schedule to avoid noise interference by reducing the weight of ineffective guided semantics in the early stages. By applying the proposed posterior sampling to the texture denoising process, GSGDiff can achieve more stable and superior inpainting results over competitive baselines. Experiments on Places2, Paris Street View and CelebA-HQ datasets validate the efficacy of the proposed method.
1206: Binary Event-Driven Spiking Transformer
Authors: Honglin Cao, Zijian Zhou, Wenjie Wei, Yu Liang, Ammar Belatreche, Dehao Zhang, Malu Zhang, Yang Yang, Haizhou Li
Location: Guangzhou | Day: TBD
Show Abstract
Transformer-based Spiking Neural Networks (SNNs) introduce a novel event-driven self-attention paradigm that combines the high performance of Transformers with the energy efficiency of SNNs. However, the larger model size and increased computational demands of the Transformer structure limit their practicality in resource-constrained scenarios. In this paper, we integrate binarization techniques into Transformer-based SNNs and propose the Binary Event-Driven Spiking Transformer, i.e. BESTformer. The proposed BESTformer can significantly reduce storage and computational demands by representing weights and attention maps with a mere 1-bit. However, BESTformer suffers from a severe performance drop from its full-precision counterpart due to the limited representation capability of binarization. To address this issue, we propose a Coupled Information Enhancement (CIE) method, which consists of a reversible framework and information enhancement distillation. By maximizing the mutual information between the binary model and its full-precision counterpart, the CIE method effectively mitigates the performance degradation of the BESTformer. Extensive experiments on static and neuromorphic datasets demonstrate that our method achieves superior performance to other binary SNNs, showcasing its potential as a compact yet high-performance model for resource-limited edge devices. The repository of this paper is available at https://github.com/CaoHLin/BESTFormer.
1212: Optical Flow Estimation for Tiny Objects: New Problem, Specialized Benchmark, and Bioinspired Scheme
Authors: Xueyao Ji, Gang Wang, Yizheng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Optical flow is pivotal in video-based tasks, yet existing methods mostly focus on medium-/large-size objects, while underperforming when characterizing the motion of tiny objects. To bridge this gap, we introduce the On-off Time-delay with Hassenstein-Reichardt correlator (OTHR), a computationally efficient scheme inspired by the primate visual cortex’s direction selectivity mechanism. OTHR kernels, applied across multiple frames, discern bright/dark luminance changes along a specific direction over a time delay, effectively estimating motion of tiny objects amidst noise and static backgrounds. Notably, OTHR integrates seamlessly with leading deep learning flow estimation models such as RAFT and FlowFormer. We also propose refined evaluation metrics for tiny objects and contribute a new dataset featuring such objects to aid algorithm development. Our experiments confirm OTHR’s superiority over competing methods, particularly in enhancing state-of-the-art models’ performance on tiny object motion estimation at minimal cost. Specifically, for objects less than 100 pixels, OTHR reduces RAFT and FlowFormer’s errors by 22.03% and 83.50%, respectively. The codes will be accessible at https://github.com/JaneEliot/OTHR.
1223: One-step Label Shift Adaptation via Robust Weight Estimation
Authors: Ruidong Fan, Xiao Ouyang, Tingjin Luo, Lijun Zhang, Chenping Hou
Location: Guangzhou | Day: TBD
Show Abstract
Label shift is a prevalent phenomenon encountered in open environments, characterized by a notable discrepancy in the label distributions between the source (training) and target (test) domains, whereas the conditional distributions given the labels remain invariant. Existing label shift methods adopt a two-step strategy: initially computing the importance weight and subsequently utilizing it to calibrate the target outputs. However, this conventional strategy overlooks the intricate interplay between output adjustment and weight estimation. In this paper, we introduce a novel approach termed as One-step Label Shift Adaptation (OLSA). Our methodology jointly learns the predictive model and the corresponding weights through a bi-level optimization framework, with the objective of minimizing an upper bound on the target risk. To enhance the robustness of our proposed model, we incorporate a debiasing term into the upper-level classifier training and devise a regularization term for the lower-level weight estimation. Furthermore, we present theoretical analyses about the generalization bounds, offering guarantees for the model’s performance. Extensive experimental results substantiate the efficacy of our proposal.
1250: GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype
Authors: Changxi Chi, Jun Xia, Jingbo Zhou, Jiabei Cheng, Chang Yu, Stan Z. Li
Location: Guangzhou | Day: TBD
Show Abstract
Predicting genetic perturbations enables the identification of potentially crucial genes prior to wet-lab experiments, significantly improving overall experimental efficiency. Since genes are the foundation of cellular life, building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations. However, current methods fail to fully leverage gene-related information, and solely rely on simple evaluation metrics to construct coarse-grained GRN. More importantly, they ignore functional differences between biotypes, limiting the ability to capture potential gene interactions. In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data, respectively, which serve as the initialization for gene representations. Additionally, we introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes, while capturing implicit gene relationships through graph structure learning (GSL). We propose GRAPE, a heterogeneous graph neural network (HGNN) that leverages gene representations initialized with features from descriptions and sequences, models the distinct roles of genes with different biotypes, and dynamically refines the GRN through GSL. The results on publicly available datasets show that our method achieves state-of-the-art performance. The code for reproducing the results can be seen at the link: https://github.com/ChangxiChi/GRAPE.
1280: Self-Classification Enhancement and Correction for Weakly Supervised Object Detection
Authors: Yufei Yin, Lechao Cheng, Wengang Zhou, Jiajun Deng, Zhou Yu, Houqiang Li
Location: Guangzhou | Day: TBD
Show Abstract
In recent years, weakly supervised object detection (WSOD) has attracted much attention due to its low labeling cost. The success of recent WSOD models is often ascribed to the two-stage multi-class classification (MCC) task, i.e., multiple instance learning and online classification refinement. Despite achieving non-trivial progresses, these methods overlook potential classification ambiguities between these two MCC tasks and fail to leverage their unique strengths. In this work, we introduce a novel WSOD framework to ameliorate these two issues. For one thing, we propose a self-classification enhancement module that integrates intra-class binary classification (ICBC) to bridge the gap between the two distinct MCC tasks. The ICBC task enhances the network’s discrimination between positive and mis-located samples in a class-wise manner and forges a mutually reinforcing relationship with the MCC task. For another, we propose a self-classification correction algorithm during inference, which combines the results of both MCC tasks to effectively reduce the mis-classified predictions. Extensive experiments on the prevalent VOC 2007 & 2012 datasets demonstrate the superior performance of our framework.
1281: QuantileFormer: Probabilistic Time Series Forecasting with a Pattern-Mixture Decomposed VAE Transformer
Authors: Yimiao Shao, Wenzhong Li, Kang Xia, Kaijie Lin, Mingkai Lin, Sanglu Lu
Location: Guangzhou | Day: TBD
Show Abstract
Probabilistic time series forecasting has attracted an increasing attention in machine learning community for its potential applications in the fields of renewable energy, traffic management, healthcare, etc. Previous research mainly focused on extracting long-range dependencies for point-wise prediction, which fail to capture complex temporal patterns and statistical characteristics for probabilistic analysis. In this paper, we propose a novel pattern-mixture decomposition method that decomposes long-term series into quantile drift, divergence patterns, and Gaussian mixture components, which can effectively capture the intricate temporal patterns and stochastic characteristics in time series. Based on pattern-mixture decomposition, we propose a novel Transformer-based model called QuantileFormer for probabilistic time series forecasting. It takes the the comprehensive drift-divergence mixture patterns as features, and designs a variational inference based fusion Transformer architecture to generate quantile prediction results. Extensive experiments show that the proposed method consistently boosts the baseline methods by a large margin and achieves state-of-the-art performance on six real-world benchmarks.
1293: CFII-Net: Explicit Class Embeddings and Feature Maps Through Iterative Interaction for Boosting Medical Image Segmentation
Authors: Xinyu Zhu, Xiwen Liu, Lianghua He, Yin Wen
Location: Guangzhou | Day: TBD
Show Abstract
Prior knowledge of category structure is essential in medical image segmentation, especially with significant organ structure differences. However, current hybrid architectures primarily focus on enhancing pixel-level representation learning, often neglecting or weakening the key prior knowledge of categorical structures, which poses challenges in capturing category relationships and accurate segmenting. To address this concern, we propose a novel network using Explicit Class Embeddings and Feature Maps through Iterative Interaction (CFII-Net) for boosting medical image segmentation. CFII-Net effectively segments images by exploring the relationship between explicit class embeddings and pixels in images. Specifically, we propose an Explicit Class Embedding Generator (ECEG) to obtain high-quality class semantic embeddings, incorporating category structure priors, which are used to guide high-accuracy segmentation. We then introduce an iterative Interactor, which utilizes transformers to facilitate the interaction between feature maps and class embeddings, thereby exploring pixel-to-class relationships. Furthermore, we propose updating strategies to refine the class embeddings and feature maps during the iteration process for achieving refined image segmentation. Extensive empirical evidence shows that any codec can be easily integrated into CFII-Net and yields improvements over the state-of-the-art methods in four public benchmarks.
1295: Robust Graph Contrastive Learning for Incomplete Multi-view Clustering
Authors: Deyin Zhuang, Jian Dai, Xingfeng Li, Xi Wu, Yuan Sun, Zhenwen Ren
Location: Guangzhou | Day: TBD
Show Abstract
In recent years, multi-view clustering (MVC) has become a promising approach for analyzing heterogeneous multi-source data. However, during the collection of multi-view data, factors such as environmental interference or sensor failure often lead to the loss of view sample data, resulting in incomplete multi-view clustering (IMVC). Graph contrastive IMVC has demonstrated promising performance as an effective solution, which typically utilizes in-graph instances as positive pairs and out-of-graph instances as negative pairs. However, the construction of positive and negative pairs in this paradigm inevitably leads to graph noise Correspondence (GNC). To this end, we propose a new IMVC framework, namely robust graph contrastive learning (RGCL). Specifically, RGCL first completes the missing data by using a multi-view consistency transfer relationship graph. Then, to mitigate the impact of false negative pairs from graph contrastive, we propose noise-robust graph contrastive learning to mine intra-view consistency accurately. Finally, we present cross-view graph-level alignment to fully exploit the complementary information across different views. Experimental results on the six multi-view datasets demonstrate that our RGCL exhibits superiority and effectiveness compared with 9 state-of-the-art IMVC methods. The source code is available at https://github.com/DYZ163/RGCL.git.
1296: Misclassification-driven Fingerprinting for DNNs Using Frequency-aware GANs
Authors: Weixing Liu, Shenghua Zhong
Location: Guangzhou | Day: TBD
Show Abstract
Deep neural networks (DNNs) have become valuable assets due to their success in various tasks, but their high training costs also make them targets for model theft. Fingerprinting techniques are commonly used to verify model ownership, but existing methods either require training many additional models, leading to increased costs, or rely on GANs to generate fingerprints near decision boundaries, which may compromise image quality. To address these challenges, we propose a GAN-based fingerprint generation method that applies frequency-domain perturbations to normal samples, effectively creating fingerprints. This approach not only resists intellectual property (IP) threats, but also improves fingerprint acquisition efficiency while maintaining high imperceptibility. Extensive experiments demonstrate that our method achieves a state-of-the-art (SOTA) AUC of 0.98 on the Tiny-ImageNet dataset under IP removal attacks, outperforming existing methods by 8%, and consistently achieves the best ABP for three types of IP detection and erasure attacks on the GTSRB dataset. Our source code is available at https://github.com/wason981/Frequency-Fingerprinting.
1298: Exploiting Label Skewness for Spiking Neural Networks in Federated Learning
Authors: Di Yu, Xin Du, Linshan Jiang, Huijing Zhang, Shuiguang Deng
Location: Guangzhou | Day: TBD
Show Abstract
The energy efficiency of deep spiking neural networks (SNNs) aligns with the constraints of resource-limited edge devices, positioning SNNs as a promising foundation for intelligent applications leveraging the extensive data collected by these devices. To safeguard data privacy, federated learning (FL) facilitates collaborative SNN-based model training by leveraging data distributed across edge devices without transmitting local data to a central server. However, existing FL approaches encounter challenges in handling label-skewed data across devices, inducing drift in the local SNN model and consequently impairing the performance of the global SNN model. To tackle these problems, we propose a novel framework called FedLEC, which incorporates intra-client label weight calibration to balance the learning intensity across local labels and inter-client knowledge distillation to mitigate local SNN model bias caused by label absence. Extensive experiments with three different structured SNNs across five datasets (i.e., three non-neuromorphic and two neuromorphic datasets) demonstrate the efficiency of FedLEC. Compared to seven state-of-the-art FL algorithms, FedLEC achieves an average accuracy improvement of approximately 11.59% for the global SNN model under various label skew distribution settings.
1306: AdaR: An Adaptive Gradient Method with Cyclical Restarting of Moment Estimations
Authors: Yangchuan Wang, Lianhong Ding, Peng Shi
Location: Guangzhou | Day: TBD
Show Abstract
Adaptive gradient methods, primarily based on Adam, are prevalent in training neural networks, adjusting step sizes via exponentially decaying averages of gradients and squared gradients. Adam assigns small weights to distant gradients, termed long-tail gradients in this paper. However, these gradients persistently influence update behavior, potentially degrading generalization performance. To address this issue, we incorporate a restart mechanism into moment estimations, proposing AdaR (ADAptive gradient methods via Restarting moment estimations). Specifically, AdaR divides a training epoch into fixed-iteration intervals, alternating between two sets of moment estimations for parameter updates and discarding prior moment estimations at the beginning of each interval. Within each interval, one set updates parameters and will be discarded in the subsequent interval, while the other is reset at the midpoint to estimate moments for updates in the subsequent interval. The restart mechanism cyclically discards distant gradients, initiates fresh moment estimations for parameter updates, and stabilizes training. By prioritizing recent gradients, the method increases estimation accuracy and enhances step size adjustment. Empirically, AdaR outperforms state-of-the-art optimization algorithms on image classification and language modeling tasks, demonstrating superior generalization and faster convergence.
1311: Top-Down Guidance for Learning Object-Centric Representations
Authors: Junhong Zou, Xiangyu Zhu, Zhaoxiang Zhang, Zhen Lei
Location: Guangzhou | Day: TBD
Show Abstract
Humans’ innate ability to decompose scenes into objects allows for efficient understanding, predicting, and planning. In light of this, Object-Centric Learning (OCL) attempts to endow networks with similar capabilities, learning to represent scenes with the composition of objects. However, existing OCL models only learn through reconstructing the input images, which does not assist the model in distinguishing objects, resulting in suboptimal object-centric representations. This flaw limits current object-centric models to relatively simple downstream tasks. To address this issue, we draw on humans’ top-down vision pathway and propose Top-Down Guided Network (TDGNet), which includes a top-down pathway to improve object-centric representations. During training, the top-down pathway constructs guidance with high-level object-centric representations to optimize low-level grid features output by the backbone. While during inference, it refines object-centric representations by detecting and solving conflicts between low- and high-level features. We show that TDGNet outperforms current object-centric models on multiple datasets of varying complexity. In addition, we expand the downstream task scope of object-centric representations by applying TDGNet to the field of robotics, validating its effectiveness in downstream tasks including video prediction and visual planning. Code will be available at https://github.com/zoujunhong/RHGNet.
1316: Detecting Hallucination in Large Language Models Through Deep Internal Representation Analysis
Authors: Luan Zhang, Dandan Song, Zhijing Wu, Yuhang Tian, Changzhi Zhou, Jing Xu, Ziyi Yang, Shuhao Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Large language models (LLMs) have shown exceptional performance across various domains. However, LLMs are prone to hallucinate facts and generate non-factual responses, which can undermine their reliability in real-world applications. Current hallucination detection methods suffer from external resource demands, substantial time overhead, difficulty overcoming LLMs’ intrinsic limitation, and insufficient modeling. In this paper, we propose MHAD, a novel internal-representation-based hallucination detection method. MHAD utilizes linear probing to select neurons and layers within LLMs. The selected neurons and layers are demonstrated with significant awareness of hallucinations at the initial and final generation steps. By concatenating the outputs from these selected neurons of selected layers at the initial and final generation steps, a hallucination awareness vector is formed, enabling precise hallucination detection via an MLP. Additionally, we introduce SOQHD, a novel benchmark for evaluating hallucination detection in Open-Domain QA (ODQA). Extensive experiments show that MHAD outperforms existing hallucination detection methods across multiple LLMs, demonstrating superior effectiveness.
1326: ST-USleepNet: A Spatial-Temporal Coupling Prominence Network for Multi-Channel Sleep Staging
Authors: Jingying Ma, Qika Lin, Ziyu Jia, Mengling Feng
Location: Guangzhou | Day: TBD
Show Abstract
Sleep staging is critical to assess sleep quality and diagnose disorders. Despite advancements in artificial intelligence enabling automated sleep staging, significant challenges remain: (1) Simultaneously extracting prominent temporal and spatial sleep features from multi-channel raw signals, including characteristic sleep waveforms and salient spatial brain networks. (2) Capturing the spatial-temporal coupling patterns essential for accurate sleep staging. To address these challenges, we propose a novel framework named ST-USleepNet, comprising a spatial-temporal graph construction module (ST) and a U-shaped sleep network (USleepNet). The ST module converts raw signals into a spatial-temporal graph based on signal similarity, temporal, and spatial relationships to model spatial-temporal coupling patterns. The USleepNet employs a U-shaped structure for both the temporal and spatial streams, mirroring its original use in image segmentation to isolate significant targets. Applied to raw sleep signals and graph data from the ST module, USleepNet effectively segments these inputs, simultaneously extracting prominent temporal and spatial sleep features. Testing on three datasets demonstrates that ST-USleepNet outperforms existing baselines, and model visualizations confirm its efficacy in extracting prominent sleep features and temporal-spatial coupling patterns across various sleep stages. The code is available at https://github.com/Majy-Yuji/ST-USleepNet.
1337: SCNNs: Spike-based Coupling Neural Networks for Understanding Structural-Functional Relationships in the Human Brain
Authors: Shaolong Wei, Shu Jiang, Mingliang Wang, Liang Sun, Haonan Rao, Weiping Ding, Jiashuang Huang
Location: Guangzhou | Day: TBD
Show Abstract
Structural-functional coupling (SC-FC coupling) offers an effective approach for analyzing structural-functional relationships, capable of revealing the dependency of functional activity on the underlying white matter architecture. However, extant SC-FC coupling analysis methods primarily center on disclosing the statistical association between the topological patterns of structural connectivity (SC) and functional connectivity (FC), while often neglecting the neurobiological mechanisms by which the brain typically transmits and processes information in the form of spikes. To address this, we propose a biologically inspired deep-learning model called spike-based coupling neural networks (SCNNs). It can simulate spiking neural activity to more realistically reproduce the interaction between brain regions and the dynamic behavior of neuronal networks. Specifically, we first use spike neurons to capture the FC temporal characteristics of the original functional magnetic resonance imaging (fMRI) time series and the SC spatial characteristics of the structural brain network. Then, we use synaptic and neuronal filter effects to simulate the coupling mechanism of SC and FC in the brain at different temporal and spatial scales, thereby quantifying SC-FC coupling and providing support for the identification of brain diseases. The results on real datasets show that the proposed method can identify brain diseases and provide a new perspective for understanding SC-FC relationships.
1341: TreeKV: Smooth Key-Value Cache Compression with Tree Structures
Authors: Ziwei He, Jian Yuan, Haoli Bai, Jingwen Leng, Bo Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Efficient key-value (KV) cache compression is critical for scaling transformer-based Large Language Models (LLMs) in long sequences and resource-limited settings. Existing methods evict tokens based on their positions or importance, but position-based strategies can miss crucial information outside predefined regions, while those relying on global importance scores resulting in strong regional biases, limiting the KV cache’s overall context retention and potentially impairing the performance of LLMs on complex tasks. Our wavelet analysis reveals that as tokens approach the end of sequence, their contributions to generation gradually increase and tends to diverge more from neighboring tokens, indicating a smooth transition with increasing complexity and variability from distant to nearby context. Motivated by this observation, we propose TreeKV, an intuitive, training-free method that employs a tree structure for smooth cache compression. TreeKV maintains a fixed cache size, allowing LLMs to deliver high-quality output in long text scenarios and is applicable during both the generation and prefilling stages. TreeKV consistently surpasses all baseline models in language modeling tasks on PG19 and OpenWebText2, allowing LLMs trained with short context window to generalize to longer window with a 16x cache reduction. On the Longbench benchmark, TreeKV achieves the best performance with only 6% of the budget at optimal efficiency.
1356: ADC-GS: Anchor-Driven Deformable and Compressed Gaussian Splatting for Dynamic Scene Reconstruction
Authors: He Huang, Qi Yang, Mufan Liu, Yiling Xu, Zhu Li
Location: Guangzhou | Day: TBD
Show Abstract
Existing 4D Gaussian Splatting methods rely on per-Gaussian deformation from a canonical space to target frames, which overlooks redundancy among adjacent Gaussian primitives and result in suboptimal performance. To address this limitation, we propose Anchor-Driven Deformable and Compressed Gaussian Splatting (ADC-GS), a compact and efficient representation for dynamic scene reconstruction. Specifically, ADC-GS organizes Gaussian primitives into an anchor-based structure within the canonical space, enhanced by a temporal significance-based anchor refinement strategy. To reduce deformation redundancy, ADC-GS introduces a hierarchical coarse-to-fine pipeline that captures motions at varying granularities. Moreover, a rate-distortion optimization is adopted to achieve an optimal balance between bitrate consumption and representation fidelity. Experimental results demonstrate that ADC-GS outperforms the per-Gaussian deformation approaches in rendering speed by 300%-800% while achieving state-of-the-art storage efficiency without compromising rendering quality. The code is released at https://github.com/H-Huang774/ADC-GS.git.
1358: Multimodal Inference with Incremental Tabular Attributes
Authors: Xinda Chen, Zhen Xing, Zixian Zhang, Weimin Tan, Bo Yan
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal Learning with visual and tabular modalities has become more and more popular nowadays, especially in the healthcare area. Due to the adaptation of new equipment or new factors being introduced, the tabular modality keeps changing. However, the standard process of training multimodal AI models requires tables to have fixed columns in training and inference; thus, it is not suitable for handling dynamically changed tables. Therefore, new methods are needed for efficiently handling such tables in multimodal learning. In this paper, we introduce a new task, multimodal inference with incremental tabular attributes, which aims to enable trained multimodal models to leverage incremental attributes in tabular modality during the inference stage efficiently. We implement a specialized encoder to disentangle the latent representation of incremental tabular attributes inside itself and with the old attributes to reduce information redundancy and further align the incremental attributes with the visual modality with consistency loss to improve information richness. Experimental results across five public datasets show that our method effectively utilizes incremental tabular attributes, achieving state-of-the-art performance in general scenarios. Beyond the inference, we also find that our method achieved better performance in fully supervised settings, evoking a new training style for multimodal learning with tables.
1369: Beyond Statistical Analysis: Multimodal Framework for Time Series Forecasting with LLM-Driven Temporal Pattern
Authors: Jiahong Xiong, Chengsen Wang, Haifeng Sun, Yuhan Jing, Qi Qi, Zirui Zhuang, Lei Zhang, Jianxin Liao, Jingyu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Accurate forecasting of time series is crucial for many applications in the real world. Conventional methods primarily rely on statistical analysis of historical data, often leading to overfitting and failing to account for background information and constraints imposed by external events. Therefore, introducing large language models (LLMs) with robust textual capabilities holds significant potential. However, due to the inherent limitations of LLMs in handling numerical data, they do not exhibit advantages in precise numerical prediction tasks. Therefore, we propose a framework to integrate LLMs with conventional methods synergistically. Rather than directly outputting numerical predictions, we leverage the capabilities of the LLMs to generate textual temporal patterns, thereby fully utilizing their inherent knowledge and reasoning abilities. Additionally, we introduce a memory network designed to decode these textual representations into a format that numerical models can effectively interpret. This approach not only capitalizes on the strengths of the LLM in text processing but also bridges the gap between textual and numerical data, enhancing the overall predictive performance of the model. Our experimental results demonstrate the framework’s effectiveness, achieving state-of-the-art performance on various benchmark datasets.
1377: Prototype-guided Knowledge Propagation with Adaptive Learning for Lifelong Person Re-identification
Authors: Zhijie Lu, Wuxuan Shi, He Li, Mang Ye
Location: Guangzhou | Day: TBD
Show Abstract
Lifelong Person Re-identification (LReID) is essential in dynamic camera networks, which continually adapts to new environments while preserving previously acquired knowledge. Existing LReID techniques often preserve samples from past datasets to maintain old knowledge, potentially leading to privacy risks. While prototype-based methods offer privacy advantages, current approaches primarily focus on adjusting classifiers for image classification tasks, neglecting representation biases between old and new identities in person re-identification. This study introduces a novel Prototype-guided Knowledge Propagation (PKP) method, which mitigates discrepancies in similar identity images between old and new tasks by guiding prototype construction through triplet loss constraints. Additionally, to address disparities between prototypes and the updated feature extractor, an Adaptive Parameter Evolution (APE) strategy is proposed. APE optimizes the integration of the old and new models by assessing the importance of the new tasks, dynamically selecting the most pertinent parameters for updates according to their contribution to the current task. Extensive experiments on the LReID benchmark demonstrate that our approach surpasses state-of-the-art prototype-based LReID methods in terms of mAP and rank-1 accuracy. Code is available at https://github.com/joyner-7/IJCAI2025-PKA.
1386: Reliable and Calibrated Semantic Occupancy Prediction by Hybrid Uncertainty Learning
Authors: Song Wang, Zhongdao Wang, Jiawei Yu, Wentong Li, Bailan Feng, Junbo Chen, Jianke Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Vision-centric semantic occupancy prediction plays a crucial role in autonomous driving, which requires accurate and reliable predictions from low-cost sensors. Although having notably narrowed the accuracy gap with LiDAR, there is still few research effort to explore the reliability and calibration in predicting semantic occupancy from camera. In this paper, we conduct a comprehensive evaluation of existing semantic occupancy prediction models from a reliability perspective for the first time. Despite the gradual alignment of camera-based models with LiDAR in terms of accuracy, a significant reliability gap still persists. To address this concern, we propose ReliOcc, a method designed to enhance the reliability of camera-based occupancy networks. ReliOcc provides a plug-and-play scheme for existing models, which integrates hybrid uncertainty from individual voxels with sampling-based noise and relative voxels through mix-up learning. Besides, an uncertainty-aware calibration strategy is devised to further improve model reliability in offline mode. Extensive experiments under various settings demonstrate that ReliOcc significantly enhances the reliability of learned model while maintaining the accuracy for both geometric and semantic predictions. Notably, our proposed approach exhibits robustness to sensor failures and out of domain noises during inference.
1393: Transferable Relativistic Predictor: Mitigating Cross-Task Cold-Start Issue in NAS
Authors: Nan Li, Bing Xue, Lianbo Ma, Mengjie Zhang
Location: Guangzhou | Day: TBD
Show Abstract
In neural architecture search (NAS), the relativistic predictor has recently emerged as an attractive technique to solve ranking issue for performance evaluation by predicting the relativistic ranking of architecture pair rather than the absolute performance of an architecture. However, it suffers from a significant cold-start issue, requiring a large amount of evaluated architectures to train an effective predictor on new datasets. In this paper, we propose a transferable relativistic predictor (TRP). Specifically, we construct a proxy dataset using the transferable cheaper-to-obtain performance estimation to softly label the rank between architectural pairs. The soft label with a smooth and easy-to-optimize loss function facilitates the learning of expressive and generalizable representations on the proxy dataset. Furthermore, we construct Chebyshev interpolation for correlation curve to adaptively determine the number of evaluated architectures required on each dataset. Extensive experimental results in different search spaces show the superior performance of TRP compared with state-of-the-art predictors. TRP requires only 54 and 73 evaluated architectures for a warm start on the CIFAR-10 and CIFAR-100 under the DARTS search space.
1406: Few-shot Novel Category Discovery
Authors: Chunming Li, Shidong Wang, Haofeng Zhang
Location: Guangzhou | Day: TBD
Show Abstract
The recently proposed Novel Category Discovery (NCD) adapt paradigm of transductive learning hinders its application in more real-world scenarios. In fact, few labeled data in part of new categories can well alleviate this burden, which coincides with the ease that people can label few of new category data. Therefore, this paper presents a new setting in which a trained agent is able to flexibly switch between the tasks of identifying examples of known (labelled) classes and clustering novel (completely unlabeled) classes as the number of query examples increases by leveraging knowledge learned from only a few (handful) support examples. Drawing inspiration from the discovery of novel categories using prior-based clustering algorithms, we introduce a novel framework that further relaxes its assumptions to the real-world open set level by unifying the concept of model adaptability in few-shot learning. We refer to this setting as Few-Shot Novel Category Discovery (FSNCD) and propose Semi-supervised Hierarchical Clustering (SHC) and Uncertainty-aware K-means Clustering (UKC) to examine the model’s reasoning capabilities. Extensive experiments and detailed analysis on five commonly used datasets demonstrate that our methods can achieve leading performance levels across different task settings and scenarios. Code is available at: https://github.com/Ashengl/FSNCD.
1407: A Medical Image Classification Network Based on Multi-View Consistent Momentum Contrastive Learning
Authors: Chuangui Cao, Shifei Ding, Lili Guo
Location: Guangzhou | Day: TBD
Show Abstract
Due to variations in imaging conditions, images often exhibit discrepancies in color reproduction. Furthermore, motion-induced blur can lead to edge degradation, making color sensitivity and edge blurriness two prevalent and challenging issues in both natural image processing and medical image analysis. To address these challenges, we propose a model termed the Three-View Consistency Mo-mentum Contrastive with Sobel Operator (SVCMC). Specifically, we first design a three-view momen-tum-update architecture that employs a So-bel-augmented ResNet as the backbone. We then introduce a novel contrastive loss, referred to as the Three-View Consistency Momentum Contrastive Loss. Next, to mitigate the oscillations and slow convergence commonly observed in contrastive learning, we construct a dynamic contrastive loss function that adapts in real time over the training process. Finally, we validated the superiority of our model on two medical image datasets and one natural image dataset, where its classification ac-curacy and convergence speed significantly out-performed existing state-of-the-art contrastive models.
1408: Enhancing User-Oriented Proactivity in Open-Domain Dialogues with Critic Guidance
Authors: Yufeng Wang, Jinwu Hu, Ziteng Huang, Kunyang Lin, Zitian Zhang, Peihao Chen, Yu Hu, Qianyue Wang, Zhuliang Yu, Bin Sun, Xiaofen Xing, Qingfang Zheng, Mingkui Tan
Location: Guangzhou | Day: TBD
Show Abstract
Open-domain dialogue systems aim to generate natural and engaging conversations, providing significant practical value in real applications such as social robotics and personal assistants. The advent of large language models (LLMs) has greatly advanced this field by improving context understanding and conversational fluency. However, existing LLM-based dialogue systems often fall short in proactively understanding the user’s chatting preferences and guiding conversations toward user-centered topics. This lack of user-oriented proactivity can lead users to feel unappreciated, reducing their satisfaction and willingness to continue the conversation in human-computer interactions. To address this issue, we propose a User-oriented Proactive Chatbot (UPC) to enhance the user-oriented proactivity. Specifically, we first construct a critic to evaluate this proactivity inspired by the LLM-as-a-judge strategy. Given the scarcity of high-quality training data, we then employ the critic to guide dialogues between the chatbot and user agents, generating a corpus with enhanced user-oriented proactivity. To ensure the diversity of the user backgrounds, we introduce the ISCO-800, a diverse user background dataset for constructing user agents. Moreover, considering the communication difficulty varies among users, we propose an iterative curriculum learning method that trains the chatbot from easy-to-communicate users to more challenging ones, thereby gradually enhancing its performance. Experiments demonstrate that our proposed training method is applicable to different LLMs, improving user-oriented proactivity and attractiveness in open-domain dialogues. Code and appendix are available at github.com/wang678/LLM-UPC.
1412: Multi-player Multi-armed Bandits with Delayed Feedback
Authors: Jingqi Fan, Zilong Wang, Shuai Li, Linghe Kong
Location: Guangzhou | Day: TBD
Show Abstract
Multi-player multi-armed bandits (MP-MAB) have been extensively studied due to their application in cognitive radio networks. In this setting, multiple players simultaneously select arms and instantly receive feedback. However, in realistic decentralized networks, feedback is often delayed due to sensing latency and signal processing. Without a central coordinator, explicit communication is impossible, and delayed feedback disrupts implicit coordination, since it depends on synchronous observations. As a result, collisions are frequent and system performance degrades significantly. In this paper, we propose an algorithm in MP-MAB with stochastic delay feedback. Each player in the algorithm independently maintains an estimate of the optimal arm set based on their own delayed rewards but only pulls arms from the set, which is, with high probability, identical to those of other players, thus avoiding collisions. The identical arm set also enables implicit communication, allowing players to utilize the exploration results of others. We establish a regret upper bound and derive a lower bound to prove the algorithm is near-optimal. Numerical experiments on both synthetic and real-world datasets validate the effectiveness of our algorithm.
1420: Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation
Authors: Jie Li, Shifei Ding, Lili Guo, Xuan Li
Location: Guangzhou | Day: TBD
Show Abstract
Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation. Generating efficient and modality-specific representations for each utterance remains a significant challenge. Previous studies have proposed various models to integrate features extracted using different modality-specific encoders. However, they neglect the varying contributions of modalities to this task and introduce high complexity by aligning modalities at the frame level. To address these challenges, we propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance textual modality representations, while knowledge distillation is utilized to strengthen representations of weaker modalities. Furthermore, we introduce a multi-modal anchor gated transformer to effectively integrate utterance-level representations across modalities. Extensive experiments on the IEMOCAP and MELD datasets demonstrate the effectiveness of knowledge distillation in enhancing modality representations and achieve state-of-the-art performance in emotion recognition. Our code is available at: https://github.com/JieLi-dd/MAGTKD.
1423: Collaborative Multi-LoRA Experts with Achievement-based Multi-Tasks Loss for Unified Multimodal Information Extraction
Authors: Li Yuan, Yi Cai, Xudong Shen, Qing Li, Qingbao Huang, Zikun Deng, Tao Wang
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal Information Extraction (MIE) has gained attention for extracting structured information from multimedia sources. Traditional methods tackle MIE tasks separately, missing opportunities to share knowledge across tasks. Recent approaches unify these tasks into a generation problem using instruction-based T5 models with visual adaptors, optimized through full-parameter fine-tuning. However, this method is computationally intensive, and multi-task fine-tuning often faces gradient conflicts, limiting performance.
To address these challenges, we propose collaborative multi-LoRA experts with achievement-based multi-task loss (C-LoRAE) for MIE tasks. C-LoRAE extends the low-rank adaptation (LoRA) method by incorporating a universal expert to learn shared multimodal knowledge from cross-MIE tasks and task-specific experts to learn specialized instructional task features. This configuration enhances the model’s generalization ability across multiple tasks while maintaining the independence of various instruction tasks and mitigating gradient conflicts. Additionally, we propose an achievement-based multi-task loss to balance training progress across tasks, addressing the imbalance caused by varying numbers of training samples in MIE tasks. Experimental results on seven benchmark datasets across three key MIE tasks demonstrate that C-LoRAE achieves superior overall performance compared to traditional fine-tuning methods and LoRA methods while utilizing a comparable number of training parameters to LoRA.
1428: Smoothed Online Convex Optimization with Delayed Feedback
Authors: Sifan Yang, Wenhao Yang, Wei Jiang, Yuanyu Wan, Lijun Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Smoothed online convex optimization (SOCO), in which the online player incurs both a hitting cost and a switching cost for changing its decisions, has garnered significant attention in recent years. While existing studies typically assume that the gradient information is revealed immediately, such an assumption may not hold in some real-world applications. To overcome this limitation, we investigate SOCO with delayed feedback, and develop two online algorithms that can minimize the dynamic regret with switching cost. Firstly, we extend Mild-OGD, an existing algorithm that adopts the meta-expert framework for online convex optimization with delayed feedback, to account for switching cost. Specifically, we analyze the switching cost in the expert-algorithm of Mild-OGD, and then modify its meta-algorithm to incorporate this cost when assigning the weight to each expert. We demonstrate that our proposed method, Smelt-DOGD can achieve an O(√(dT(P_T+1))) dynamic regret bound with switching cost, where d is the maximum delay and P_T is the path-length. Secondly, we develop an efficient variant to reduce the number of projections per round from O(log T) to 1, yet maintaining the same theoretical guarantee. The key idea is to construct a new surrogate loss defined over a simpler domain for expert-algorithms so that these experts do not need to perform the complex projection operations in each round. Finally, we conduct experiments to validate the effectiveness and efficiency of our algorithms.
1430: Device-Cloud Collaborative Correction for On-Device Recommendation
Authors: Tianyu Zhan, Shengyu Zhang, Zheqi Lv, Jieming Zhu, Jiwei Li, Fan Wu, Fei Wu
Location: Guangzhou | Day: TBD
Show Abstract
With the rapid development of recommendation models and device computing power, device-based recommendation has become an important research area due to its better real-time performance and privacy protection. Previously, Transformer-based sequential recommendation models have been widely applied in this field because they outperform Recurrent Neural Network (RNN)-based recommendation models in terms of performance. However, as the length of interaction sequences increases, Transformer-based models introduce significantly more space and computational overhead compared to RNN-based models, posing challenges for device-based recommendation. To balance real-time performance and high performance on devices, we propose Device-Cloud Collaborative Correction Framework for On-Device Recommendation (CoCorrRec). CoCorrRec uses a self-correction network (SCN) to correct parameters with extremely low time cost. By updating model parameters during testing based on the input token, it achieves performance comparable to current optimal but more complex Transformer-based models. Furthermore, to prevent SCN from overfitting, we design a global correction network (GCN) that processes hidden states uploaded from devices and provides a global correction solution. Extensive experiments on multiple datasets show that CoCorrRec outperforms existing Transformer-based and RNN-based device recommendation models in terms of performance, with fewer parameters and lower FLOPs, thereby achieving a balance between real-time performance and high efficiency. Code is available at https:
//github.com/Yuzt-zju/CoCorrRec.
1441: TextMEF: Text-guided Prompt Learning for Multi-exposure Image Fusion
Authors: Jinyuan Liu, Qianjun Huang, Guanyao Wu, Di Wang, Zhiying Jiang, Long Ma, Risheng Liu, Xin Fan
Location: Guangzhou | Day: TBD
Show Abstract
Multi-exposure image fusion~(MEF) aims to integrate a set of low dynamic range images, producing a single image with a higher dynamic range than either one. Despite significant advancements, current MEF approaches still struggle to handle extremely over- or under-exposed conditions, resulting in unsatisfactory visual effects such as hallucinated details and distorted color tones. With this regard, we propose TextMEF, a prompt-driven fusion method enhanced by prompt learning, for multi-exposure image fusion. Specifically, we learn a set of prompts based on text-image similarity among negative and positive samples (over-exposed, under-exposed images, and well-exposed ones). These learned prompts are seamlessly integrated into the loss function, providing high-level guidance for constraining non-uniform exposure regions. Furthermore, we develop a attention Mamba module effectively translates over-/under- exposed regional features into exposure invariant space and ensure them to build efficient long-range dependency to high dynamic range image. Extensive experimental results on three publicly available benchmarks demonstrate that our TextMEF significantly outperforms state-of-the-art approaches in both visual inspection and objective analysis.
1445: Dual-Balancing for Physics-Informed Neural Networks
Authors: Chenhong Zhou, Jie Chen, Zaifeng Yang, Ching Eng Png
Location: Guangzhou | Day: TBD
Show Abstract
Physics-informed neural networks (PINNs) have emerged as a new learning paradigm for solving partial differential equations (PDEs) by enforcing the constraints of physical equations, boundary conditions (BCs), and initial conditions (ICs) into the loss function. Despite their successes, vanilla PINNs still suffer from poor accuracy and slow convergence due to the intractable multi-objective optimization issue. In this paper, we propose a novel Dual-Balanced PINN (DB-PINN), which dynamically adjusts loss weights by integrating inter-balancing and intra-balancing to alleviate two imbalance issues in PINNs. Inter-balancing aims to mitigate the gradient imbalance between PDE residual loss and condition-fitting losses by determining an aggregated weight that offsets their gradient distribution discrepancies. Intra-balancing acts on condition-fitting losses to tackle the imbalance in fitting difficulty across diverse conditions. By evaluating the fitting difficulty based on the loss records, intra-balancing can allocate the aggregated weight proportionally to each condition loss according to its fitting difficulty level. We further introduce a robust weight update strategy to prevent abrupt spikes and arithmetic overflow in instantaneous weight values caused by large loss variances, enabling smooth weight updating and stable training. Extensive experiments demonstrate that DB-PINN achieves significantly superior performance than those popular gradient-based weighting methods in terms of convergence speed and prediction accuracy. Our code and supplementary material are available at https://github.com/chenhong-zhou/DualBalanced-PINNs.
1451: Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization
Authors: Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Jiang Zong, Hao Peng, Jianwei Yin
Location: Guangzhou | Day: TBD
Show Abstract
Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks. Since the majority of knowledge is acquired during pre-training, attributing the predictions of fine-tuned LLMs to their pre-training data may provide valuable insights. Influence functions have been proposed as a means to explain model predictions based on training data. However, existing approaches often fail to compute "multi-stage" influence and lack scalability to billion-scale LLMs.

In this paper, we propose multi-stage influence functions to attribute the downstream predictions of fine-tuned LLMs to pre-training data under the full-parameter fine-tuning paradigm. To enhance the efficiency and practicality of our multi-stage influence function, we leverage Eigenvalue-corrected Kronecker-Factored (EK-FAC) parameterization for efficient approximation.
Empirical results validate the superior scalability of EK-FAC approximation and the effectiveness of our multi-stage influence function. Additionally, case studies on a real-world LLM, dolly-v2-3b, demonstrate its interpretive power, with exemplars illustrating insights provided by multi-stage influence estimates.
1473: OT-DETECTOR: Delving into Optimal Transport for Zero-shot Out-of-Distribution Detection
Authors: Yu Liu, Hao Tang, Haiqi Zhang, Jing Qin, Zechao Li
Location: Guangzhou | Day: TBD
Show Abstract
Out-of-distribution (OOD) detection is crucial for ensuring the reliability and safety of machine learning models in real-world applications. While zero-shot OOD detection, which requires no training on in-distribution (ID) data, has become feasible with the emergence of vision-language models like CLIP, existing methods primarily focus on semantic matching and fail to fully capture distributional discrepancies. To address these limitations, we propose OT-DETECTOR, a novel framework that employs Optimal Transport (OT) to quantify both semantic and distributional discrepancies between test samples and ID labels. Specifically, we introduce cross-modal transport mass and transport cost as semantic-wise and distribution-wise OOD scores, respectively, enabling more robust detection of OOD samples. Additionally, we present a semantic-aware content refinement (SaCR) module, which utilizes semantic cues from ID labels to amplify the distributional discrepancy between ID and hard OOD samples. Extensive experiments on several benchmarks demonstrate that OT-DETECTOR achieves state-of-the-art performance across various OOD detection tasks, particularly in challenging hard-OOD scenarios.
1474: Curriculum Hierarchical Knowledge Distillation for Bias-Free Survival Prediction
Authors: Chaozhuo Li, Zhihao Tang, Mingji Zhang, Zhiquan Liu, Litian Zhang, Xi Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Survival prediction is a pivotal task for estimating mortality risk within a given timeframe based on whole slide images (WSIs). Conventional models typically assume that WSIs across patients are independent and identically distributed, an assumption that may not hold due to inherent variability in WSI preparation and the uncertain condition of infected tissues. These uncontrollable external factors introduce significant variability in the numbers and resolutions of WSIs across patients, leading to bias and compromised performance, particularly for tail patients with limited data. In this paper, we propose a novel approach, PathoKD, based on knowledge distillation. Recognizing the hierarchical nature of disease progression and the data scarcity issues associated with vanilla knowledge distillation methods, PathoKD integrates a novel curriculum learning framework with hierarchical knowledge distillation. This integration effectively mitigates the performance gap between head and tail patients, thereby enhancing prediction accuracy across patient groups. Our proposal is extensively evaluated over popular datasets and experimental results demonstrate its superiority.
1476: An Efficient Core-Guided Solver for Weighted Partial MaxSAT
Authors: Shiwei Pan, Yiyuan Wang, Shaowei Cai
Location: Guangzhou | Day: TBD
Show Abstract
The maximum satisfiability problem (MaxSAT) is a crucial combinatorial optimization problem with widespread applications across various critical domains. This paper presents CASHWMaxSAT, an efficient core-guided MaxSAT solver based on two novel ideas.
The first and most important idea is the introduction of an extended stratification technique that progressively focuses on solving high-weight soft clauses. Second, we integrate disjoint unsatisfiable cores with the goal of minimizing the unsatisfiable core, allowing the solver to learn multiple high-quality clauses in a single conflict analysis step. These innovations enable our MaxSAT solver to efficiently identify key constraints and reduce redundant reasoning, significantly enhancing solving efficiency. Experimental results on benchmarks from the complete weighted track of the MaxSAT Evaluations 2022-2024 demonstrate that the proposed methods lead to substantial improvements, with CASHWMaxSAT outperforming state-of-the-art MaxSAT solvers across all benchmarks. Additionally, it enabled us to achieve the top two positions in the exact weighted category of the MaxSAT Evaluation 2024.
1480: DiffusionIMU: Diffusion-Based Inertial Navigation with Iterative Motion Refinement
Authors: Xiaoqiang Teng, Chenyang Li, Shibiao Xu, Zhihao Hao, Deke Guo, Jingyuan Li, Haisheng Li, Weiliang Meng, Xiaopeng Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Inertial navigation enables self-contained localization using only Inertial Measurement Units (IMUs), making it widely applicable in various domains such as navigation, augmented reality, and robotics. However, existing methods suffer from drift accumulation due to the sensor noise and difficulty capturing long-range temporal dependencies, limiting their robustness and accuracy. To address these challenges, we propose DiffusionIMU, a novel diffusion-based framework for inertial navigation. DiffusionIMU enhances direct velocity regression from IMU data through an iterative generative denoising process, progressively refining motion state estimation. It integrates the noise-adaptive feature modulation for sensor variability handling, the feature alignment mechanism for representation consistency, and the diffusion-based temporal modeling to decrease accumulated drift. Experiments show that DiffusionIMU consistently outperforms existing methods, demonstrating superior generalization to unseen users while alleviating the impact of the sensor noise.
1484: Beyond Feature Mapping GAP: Integrating Real HDRTV Priors for Superior SDRTV-to-HDRTV Conversion
Authors: Gang He, Kepeng Xu, Li Xu, Wenxin Yu, Xianyun Wu
Location: Guangzhou | Day: TBD
Show Abstract
The rise of HDR-WCG display devices has highlighted the need to convert SDRTV to HDRTV, as most video sources are still in SDR. Existing methods primarily focus on designing neural networks to learn a single-style mapping from SDRTV to HDRTV. However, the limited information in SDRTV and the diversity of styles in real-world conversions render this process an ill-posed problem, thereby constraining the performance and generalization of these methods. Inspired by generative approaches, we propose a novel method for SDRTV to HDRTV conversion guided by real HDRTV priors. Despite the limited information in SDRTV, introducing real HDRTV as reference priors significantly constrains the solution space of the originally high-dimensional ill-posed problem. This shift transforms the task from solving an unreferenced prediction problem to making a referenced selection, thereby markedly enhancing the accuracy and reliability of the conversion process. Specifically, our approach comprises two stages: the first stage employs a Vector Quantized Generative Adversarial Network to capture HDRTV priors, while the second stage matches these priors to the input SDRTV content to recover realistic HDRTV outputs. We evaluate our method on public datasets, demonstrating its effectiveness with significant improvements in both objective and subjective metrics across real and synthetic datasets.
1499: Progressive Prefix-Memory Tuning for Complex Logical Query Answering on Knowledge Graphs
Authors: Xingrui Zhuo, Shirui Pan, Jiapu Wang, Gongqing Wu, Zan Zhang, Rui Li, Zizhong Wei, Xindong Wu
Location: Guangzhou | Day: TBD
Show Abstract
Conducting complex logical queries over knowledge graphs remains a significant challenge. Recent research has successfully leveraged Pre-trained Language Models (PLMs) to tackle Knowledge Graph Complex Query Answering (KGCQA) tasks, which is attributed to PLMs’ ability to comprehend logical semantics of queries through context learning. However, existing PLM-based KGCQA methods usually overlook the harm of disordered syntax or fragmented contexts within a serialized query, posing the problem of “impossible language” to limit PLMs in grasping the logical semantics. To address this problem, we propose a Progressive Prefix-Memory Tuning (PPMT) framework for KGCQA tasks, which effectively rectifies erroneous segments in serialized queries to assist PLMs in query answering. First, we propose a prefix-memory rectification mechanism embedded in a PLM module. This mechanism assigns rectification parameters in memory stores to polish the language segments of entities, relations, and queries through specific prefixes. To further capture the logical semantics in queries, we design a progressive fine-tuning strategy, which optimizes our model through a conditional gradient update process guided by knowledge translation constraints. Extensive experiments on widely used KGCQA benchmarks demonstrate the significant superiority of PPMT in terms of HR@3 and MRR. Our codes are available at https://github.com/lazyloafer/PPMT.
1521: Causal View of Time Series Imputation: Some Identification Results on Missing Mechanism
Authors: Ruichu Cai, Kaitao Zheng, Junxian Huang, Zijian Li, Zhengming Chen, Boyan Xu, Zhifeng Hao
Location: Guangzhou | Day: TBD
Show Abstract
Time series imputation is one of the most challenging problems and has broad applications in various fields like health care and the Internet of Things. Existing methods mainly aim to model the temporally latent dependencies and the generation process from the observed time series data. In real-world scenarios, different types of missing mechanisms, like MAR (Missing At Random) and MNAR (Missing Not At Random), can occur in time series data. However, existing methods often overlook the difference among the aforementioned missing mechanisms and use a single model for time series imputation, which can easily lead to misleading results due to mechanism mismatching. In this paper, we propose a framework for the time series imputation problem by exploring Different Missing Mechanisms (DMM in short) and tailoring solutions accordingly. Specifically, we first analyze the data generation processes with temporal latent states and missing cause variables for different mechanisms. Sequentially, we model these generation processes via variational inference and estimate prior distributions of latent variables via a normalizing flow-based neural architecture. Furthermore, we establish identifiability results under the nonlinear independent component analysis framework to show that latent variables are identifiable. Experimental results show that our method surpasses existing time series imputation techniques across various datasets with different missing mechanisms, demonstrating its effectiveness in real-world applications.
1532: Critical Node-aware Augmentation for Hypergraph Contrastive Learning
Authors: Zhuo Li, Yuena Lin, Yipeng Wang, Wenmao Liu, Mingliang Yu, Zhen Yang, Gengyu Lyu
Location: Guangzhou | Day: TBD
Show Abstract
Hypergraph contrastive learning enables effective representation learning for hypergraphs without requiring labels. However, existing methods typically rely on randomly deleting or replacing nodes during hypergraph augmentation, which may lead to the absence of critical nodes and further disrupt the higher-order structural relationships within augmented hypergraphs. To address this issue, we propose a Critical Node-aware hypergraph contrastive learning method, which is the first attempt to leverage hyperedge prediction to retain critical nodes and accordingly maintain the reliable higher-order structural relationships within augmented hypergraphs. Specifically, we first employ contrastive learning to align the augmented hypergraphs, and then generate hyperedge embeddings to characterize node representations and their structural correlations. During the hyperedge embedding encoding process, we introduce a hyperedge prediction discriminator to score these embeddings, which quantifies the nodes’ contributions to identify the critical nodes and maintain the higher-order structural relationships within augmented hypergraphs. Compared with previous studies, our proposed method can effectively alleviate the erroneous deletion or replacement of critical nodes and steadily maintain the inherent structural relationships between original hypergraph and augmented hypergraphs, naturally guiding better hypergraph representations for downstream tasks. Extensive experiments on various tasks demonstrate that our method is significantly superior to state-of-the-art methods.
1533: Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models
Authors: Xin Huang, Ruibin Li, Tong Jia, Wei Zheng, Ya Wang
Location: Guangzhou | Day: TBD
Show Abstract
Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model’s discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs’ performance on complex CR tasks. The source code is available at https://github.com/nynu-BDAI/AHNPL.
1545: Adaptive Deep Learning from Crowds
Authors: Hang Yang, Zhiwu Li, Witold Pedrycz
Location: Guangzhou | Day: TBD
Show Abstract
In the data-driven era, collecting high-quality labeled data requiring human labor is a common approach for training data-hungry models, called crowdsourcing. Recently, end-to-end learning from crowds has shown its flexibility and practicality. However, existing works in an end-to-end manner focus on learning after collecting labels, which results in noisy annotations and also requires cost. Inspired by computerized adaptive testing, we argue that the characteristics of workers should be mined as soon as possible to make the best use of talents. To this end, we propose an adaptive learning from crowds method, AdaCrowd, as a cost-effective solution. Specifically, we propose a probabilistic model to capture the informativeness of possible instances for each worker. The informativeness is considered to be the uncertainty of the annotation prediction model output in its current status. The adaptive learning procedure is optimized by maximizing data likelihood and can be used with existing crowdsourcing models. Extensive experiments are conducted on real-world datasets, LabelMe and CIFAR-10H. The experimental results, e.g., the reduction of annotations without performance degradation, demonstrate the effectiveness.
1551: Mixture-of-Queries Transformer: Camouflaged Instance Segmentation via Queries Cooperation and Frequency Enhancement
Authors: Weiwei Feng, Nanqing Xu, Tengfei Liu, Weiqiang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Due to the high similarity between camouflaged instances and the surroundings and the widespread camouflage-like scenarios, the recently proposed camouflaged instance segmentation (CIS) is a challenging and relevant task. Previous approaches achieve some progress on CIS, while many overlook camouflaged objects’ color and contour nature and then decide on each candidate instinctively. In this paper, we contribute a Mixture-of-Queries
Transformer (MoQT) in an end-to-end manner for CIS based on two key designs (a Frequency Enhancement Feature Extractor and a Mixture-of-Queries Decoder). First, the Frequency Enhancement Feature Extractor is responsible for capturing the camouflaged clues in the frequency domain. To expose camouflaged instances, the extractor enhances the effectiveness of contour, eliminates the interference color, and obtains suitable features simultaneously. Second, a Mixture-of-Queries Decoder utilizes multiple newly initialized experts of queries (a group of queries considered an expert) in each layer for spotting camouflaged characteristics with cooperation. These experts collaborate to generate outputs with the mixture-of-queries mechanism, refined hierarchically to a fine-grained level for more accurate instance masks. Coupling these two components enables MoQT to use multiple experts to integrate effective clues of camouflaged objects in both spatial and frequency domains. Extensive experimental results demonstrate our MoQT outperforms 19 state-of-the-art CIS approaches on both COD10K and NC4K datasets.
1588: GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer
Authors: Yihong Lin, Zhaoxin Fan, Xianjia Wu, Lingyu Xiong, Xiandong Li, Wenxiong Kang, Liang Peng, Songju Lei, Huang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Speech-driven talking head generation is a critical yet challenging task with applications in augmented reality and virtual human modeling. While recent approaches using autoregressive and diffusion-based models have achieved notable progress, they often suffer from modality inconsistencies, particularly misalignment between audio and mesh, leading to reduced motion diversity and lip-sync accuracy. To address this, we propose GLDiTalker, a novel speech-driven 3D facial animation model based on a Graph Latent Diffusion Transformer. GLDiTalker resolves modality misalignment by diffusing signals within a quantized spatiotemporal latent space. It employs a two-stage training pipeline: the Graph-Enhanced Quantized Space Learning Stage ensures lip-sync accuracy, while the Space-Time Powered Latent Diffusion Stage enhances motion diversity. Together, these stages enable GLDiTalker to generate realistic, temporally stable 3D facial animations. Extensive evaluations on standard benchmarks demonstrate that GLDiTalker outperforms existing methods, achieving superior results in both lip-sync accuracy and motion diversity.
1607: SE(3)-Equivariant Diffusion Models for 3D Object Analysis
Authors: Xie Min, Zhao Jieyu, Shen Kedi, Chen Kangxin
Location: Guangzhou | Day: TBD
Show Abstract
SE(3)-equivariance is a critical property for capturing pose information in 3D vision tasks, enabling models to handle transformations such as rotations and translations effectively. While equivariant diffusion models have recently demonstrated promise in 3D object reassembly due to their generative and denoising capabilities, they face key challenges when applied to this task. Specifically, traditional diffusion models rely on fixed input sizes, which limits their adaptability to varying part quantities, and their linear noise addition and removal processes struggle to address the inherently nonlinear transformations of 3D parts. To overcome these limitations, this paper proposes an SE(3)-equivariant diffusion model for pose denoising and 3D object reassembly from fragmented parts. The model incorporates an equivariant encoder to extract SE(3)-equivariant features, a Lie algebra mapping to linearize noise addition and removal, and an elastic diffusion framework capable of adapting to varying part quantities and nonlinear transformations. By leveraging these components, the method achieves accurate and robust pose predictions across diverse input configurations. Experiments conducted on the Breaking Bad dataset, a real-world RePAIR and a self-constructed 3D mannequin dataset demonstrate the effectiveness of the proposed model, outperforming state-of-the-art methods across metrics such as root mean square error and part accuracy. Ablation studies further validate the critical contributions of key modules, emphasizing their roles in improving accuracy and robustness for 3D part reassembly tasks.
1636: Decision-Aware Preference Modeling for Multi-Behavior Recommendation
Authors: Qingfeng Li, Wei Liu, Zaiqiao Meng, Jian Yin
Location: Guangzhou | Day: TBD
Show Abstract
In recommender systems, multi-behavior methods have demonstrated significant effectiveness in addressing issues such as data sparsity—challenges commonly encountered by traditional single-behavior recommendation methods. These methods typically infer user preferences from various auxiliary behaviors and apply them to recommendations for the target behavior. However, existing methods face challenges in uncovering the interaction patterns for different behaviors from multi-behavior implicit feedback, as users exhibit varying preference strengths for different items across behaviors. To address this issue, this paper introduces a novel approach, Decision-Aware Preference Modeling (DAPM), for multi-behavior recommendation. We first construct a behavior-agnostic graph to learn comprehensive representations that are not affected by behavior factors, complementing the behavior-specific representations. Subsequently, we introduce an innovative contrastive learning paradigm that emphasizes inter-behavior consistency and intra-behavior uniformity to alleviate the “false repulsion” problem in traditional contrastive learning. Furthermore, we propose a multi-behavior hinge loss with boundary constraints to explicitly model users’ decision boundaries across different behaviors, thereby enhancing the model’s ability to accurately capture users’ inconsistent preference intensities. Extensive experiments on three real-world datasets demonstrate the consistent improvements achieved by DAPM over thirteen state-of-the-art baselines. We release our code at https://github.com/Breeze-del/DAPM.
1646: G3PT: Unleash the Power of Autoregressive Modeling in 3D Generation via Cross-Scale Querying Transformer
Authors: Jinzhi Zhang, Feng Xiong, Guangyu Wang, Mu Xu
Location: Guangzhou | Day: TBD
Show Abstract
Autoregressive transformers have revolutionized generative models in language processing and shown substantial promise in image and video generation. However, these models face significant challenges when extended to 3D generation tasks due to their reliance on next-token prediction to learn token sequences, which is incompatible with the unordered nature of 3D data. Instead of imposing an artificial order on 3D data, in this paper, we introduce G3PT, a scalable, coarse-to-fine 3D native generative model with cross-scale vector quantization and cross-scale autoregressive modeling. The key is to map point-based 3D data into discrete tokens with different levels of detail, naturally establishing a sequential relationship across a variety of scales suitable for autoregressive modeling. Remarkably, our method connects tokens globally across different levels of detail without manually specified ordering. Benefiting from this approach, G3PT features a versatile 3D generation pipeline that effortlessly supports the generation of 3D shapes under diverse conditional modalities. Extensive experiments demonstrate that G3PT achieves superior generation quality and generalization ability compared to previous baselines. Most importantly, for the first time in 3D generation, scaling up G3PT reveals distinct power-law scaling behaviors.
1649: Picturized and Recited with Dialects: A Multimodal Chinese Representation Framework for Sentiment Analysis of Classical Chinese Poetry
Authors: Xiaocong Du, Haoyu Pei, Haipeng Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Classical Chinese poetry is a vital and enduring part of Chinese literature, conveying profound emotional resonance. Existing studies analyze sentiment based on textual meanings, overlooking the unique rhythmic and visual features inherent in poetry,especially since it is often recited and accompanied by Chinese paintings. In this work, we propose a dialect-enhanced multimodal framework for classical Chinese poetry sentiment analysis. We extract sentence-level audio features from the poetry and incorporate audio from multiple dialects,which may retain regional ancient Chinese phonetic features, enriching the phonetic representation. Additionally, we generate sentence-level visual features, and the multimodal features are fused with textual features enhanced by LLM translation through multimodal contrastive representation learning. Our framework outperforms state-of-the-art methods on two public datasets, achieving at least 2.51% improvement in accuracy and 1.63% in macro F1. We open-source the code to facilitate research in this area and provide insights for general multimodal Chinese representation.
1655: BTPG: A Platform and Benchmark for Behavior Tree Planning in Everyday Service Robots
Authors: Xinglin Chen, Yishuai Cai, Minglong Li, Yunxin Mao, Zhou Yang, Wenjing Yang, Weixia Xu, Ji Wang
Location: Guangzhou | Day: TBD
Show Abstract
Behavior Trees (BTs) are a widely used control architecture in robotics, renowned for their robustness and safety, which are especially crucial for everyday service robots. Recently, several methods have been proposed to automatically plan BTs to accomplish specific tasks. However, existing research in BT planning lacks two main aspects: (1) the absence of a standard platform for modeling and planning BTs, along with testing benchmarks; and (2) insufficient metrics for a comprehensive evaluation of BT planning algorithms. In this paper, we propose Behavior Tree Planning Gym (BTPG), the first platform and benchmark for BT planning in everyday service robots.
In BTPG, behavior nodes are represented by predicate logic, and objects are categorized to better define the predicate domains and action models. The BT planning problem is then formulated in the STRIPS style. We support four environments and three simulators with different action models, which cover most of the needs of everyday service activities. We design a dataset generator for each environment and test three state-of-the-art BT planning algorithms, as well as one proposed by us, using various common metrics. In addition, we design three advanced metrics, planning progress, region distance, and execution robustness, to gain deeper insights into these BT planning algorithms. With a standard test benchmark, we hope BTPG can inspire and accelerate progress in the field of BT planning. Our codes are available at https://github.com/DIDS-EI/BTPG.
1685: Model Rake: A Defense Against Stealing Attacks in Split Learning
Authors: Qinbo Zhang, Xiao Yan, Yanfeng Zhao, Fangcheng Fu, Quanqing Xu, Yukai Ding, Xiaokai Zhou, Chuang Hu, Jiawei Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Split learning is a prominent framework for vertical federated learning, where multiple clients collaborate with a central server for model training by exchanging intermediate embeddings. Recently, it is shown that an adversarial server can exploit the intermediate embeddings to train surrogate models to replace the bottom models on the clients (i.e., model stealing). The surrogate models can also be used to reconstruct private training data of the clients (i.e., data stealing).
To defend against these stealing attacks, we propose Model Rake (i.e., Rake), which runs two bottom models on each client and differentiates their output spaces to make the two models distinct. Rake hinders the stealing attacks because it is difficult for a surrogate model to approximate two distinct bottom models. We prove that, under some assumptions, the surrogate model converges to the average of the two bottom models and thus will be inaccurate. Extensive experiments show that Rake is much more effective than existing methods in defending against both model and data stealing attacks, and the accuracy of normal model training is not affected.
1692: Approximately EFX and fPO Allocations for Bivalued Chores
Authors: Zehan Lin, Xiaowei Wu, Shengwei Zhou
Location: Guangzhou | Day: TBD
Show Abstract
We consider the computation for allocations of indivisible chores that are approximately EFX and fractional Pareto optimal (fPO). It has been shown that 3-EFX and fPO allocations for bi-valued instances always exist, where the cost of an item to an agent is either 1 or k (where k > 1), by rounding the (fractional) earning restricted equilibrium. In this work, we improve the approximation ratio to (2-1/k), while preserving the fractional Pareto optimality. Instead of rounding fractional equilibrium, our algorithm starts with the integral EF1 equilibrium for bi-valued chores and reallocates items until approximate EFX is achieved. We further improve our result for the case when k=2 and devise an algorithm that computes EFX and fPO allocations.
1698: Long-Term Individual Causal Effect Estimation via Identifiable Latent Representation Learning
Authors: Ruichu Cai, Junjie Wan, Weilin Chen, Zeqin Yang, Zijian Li, Peng Zhen, Jiecheng Guo
Location: Guangzhou | Day: TBD
Show Abstract
Estimating long-term causal effects by combining long-term observational and short-term experimental data is a crucial but challenging problem in many real-world scenarios. In existing methods, several ideal assumptions, e.g. latent unconfoundedness assumption or additive equi-confounding bias assumption, are proposed to address the latent confounder problem raised by the observational data. However, in real-world applications, these assumptions are typically violated which limits their practical effectiveness. In this paper, we tackle the problem of estimating the long-term individual causal effects without the aforementioned assumptions. Specifically, we propose to utilize the natural heterogeneity of data, such as data from multiple sources, to identify latent confounders, thereby significantly avoiding reliance on idealized assumptions. Practically, we devise a latent representation learning-based estimator of long-term causal effects. Theoretically, we establish the identifiability of latent confounders,
with which we further achieve long-term effect identification. Extensive experimental studies, conducted on multiple synthetic and semi-synthetic datasets, demonstrate the effectiveness of our proposed method.
1699: Wave-wise Discriminative Tracking by Phase-Amplitude Separation, Augmentation and Mixture
Authors: Huibin Tan, Mingyu Cao, Kun Hu, Xihuai He, Zhe Wang, Hao Li, Long Lan, Mengzhu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Distinguishing key features in complex visual tasks is challenging. A novel approach treats image patches (tokens) as waves. By using both phase and amplitude, it captures richer semantics and specific invariances compared to pixel-based methods, and allows for feature fusion across regions for a holistic image representation. Based on this, we propose the Wave-wise Discriminative Transformer Tracker (WDT). During tracking, WDT represents features via phase-amplitude separation, enhancement, and mixture. First, we designed a Mutual Exclusive Phase-Amplitude Extractor (MEPAE) to separate phase and amplitude features with distinct semantics, representing spatial target info and background brightness respectively. Then, Wave-wise Feature Augmentation is carried out with two submodules: Phase-Amplitude Feature Augmentation and Mixture. The augmentation module disrupts the separated features in the same batch, and the mixture module recombines them to generate positive and negative waves. The original features are aggregated into the original wave. Positive waves have the same phase but different amplitudes, and negative waves have different phase components. Finally, self-supervised and tracking-supervised losses guide the global and local representation learning for original, positive, and negative waves, enhancing wave-level discrimination. Experiments on five benchmarks prove the effectiveness of our method.
1714: Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition
Authors: Xiaogang Xu, Kun Zhou, Tao Hu, Jiafei Wu, Ruixing Wang, Hao Peng, Bei Yu
Location: Guangzhou | Day: TBD
Show Abstract
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise. In this paper, we present an innovative video decomposition strategy that incorporates view-independent and view-dependent components to enhance the performance of LLVE. We leverage dynamic cross-frame correspondences for the view-independent term (which primarily captures intrinsic appearance) and impose a scene-level continuity constraint on the view-dependent term (which mainly describes the shading condition) to achieve consistent and satisfactory decomposition results. To further ensure consistent decomposition, we introduce a dual-structure enhancement network featuring a cross-frame interaction mechanism. By supervising different frames simultaneously, this network encourages them to exhibit matching decomposition features. This mechanism can seamlessly integrate with encoder-decoder single-frame networks, incurring minimal additional parameter costs. Extensive experiments are conducted on widely recognized LLVE benchmarks, covering diverse scenarios. Our framework consistently outperforms existing methods, establishing a new SOTA performance.
1726: Egocentric Object-Interaction Anticipation with Retentive and Predictive Learning
Authors: Guo Chen, Yifei Huang, Yin-dong Zheng, Yicheng Liu, Jiahao Wang, Tong Lu
Location: Guangzhou | Day: TBD
Show Abstract
Egocentric object-interaction anticipation is critical for applications like augmented reality and robotics, but existing methods struggle with misaligned egocentric encoding, insufficient supervision, and underutilized historical context. These limitations stem from a lack of focus on retention, i.e., retaining long-term object-centric interactions, and prediction, i.e., future-centric encoding and future uncertainty modeling. We introduce EgoAnticipator, a novel Retentive and Predictive Learning framework that addresses these challenges. Our approach combines retentive pre-training for domain-specific encoding, predictive pre-training for future uncertainty modeling, and mirror distillation to transfer future-informed knowledge. Additionally, we propose long-term memory prompting to integrate historical interaction cues. We evaluate the effectiveness of our framework using the Ego4D short-term object interaction anticipation benchmark, covering both STAv1 and STAv2. Extensive experiments demonstrate that our framework outperforms existing methods, while ablation studies highlight the effectiveness of each design inside our retentive and predictive learning framework.
1738: A Novel Local Search Algorithm for the Vertex Bisection Minimization Problem
Authors: Rui Sun, Xinyu Wang, Yiyuan Wang, Jiangnan Li, Yi Zhou
Location: Guangzhou | Day: TBD
Show Abstract
The vertex bisection minimization problem (VBMP) is a fundamental graph partitioning problem with numerous real-world applications. In this study, we propose a (k, l, S)-cluster guided local search algorithm to address this challenge.
First, we propose a novel (k,l,S)-cluster enumeration procedure, which is based on two key concepts: the (k, l, S)-cluster and the local cluster core. The (k, l, S)-cluster limits both the connectivity and distinct boundaries of a given vertex set, and the local cluster core represents the most cohesive substructure within a (k, l, S)-cluster. Building up on the above (k, l, S)-cluster enumeration procedure, we present a novel (k, l, S)-cluster guided perturbation mechanism designed to escape from local optima.
Next, we propose a two-manner local search procedure that employs two distinct search models to explore the neighboring search space efficiently. Experimental results demonstrate that the proposed algorithm performs best on nearly all instances.
1748: Few-Shot Incremental Multi-modal Learning via Touch Guidance and Imaginary Vision Synthesis
Authors: Lina Wei, Yuhang Ma, Zhongsheng Lin, Fangfang Wang, Canghong Jin, Hanbin Zhao, Dapeng Chen
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal perception, which integrates vision and touch, is increasingly demonstrating its significance in domains such as embodied intelligence and human-computer interaction. However, in open-world scenarios, multimodal data streams face significant challenges, including catastrophic forgetting and overfitting, during few-shot class incremental learning (FSCIL), leading to a severe degradation in model performance. In this work, we propose a novel approach named Few-Shot Incremental Multi-modal Learning via Touch Guidance and Imaginary Vision Synthesis (TIFS). Our method leverages vision imagination synthesis to enhance the semantic understanding and integrates touch and vision fusion to improve the problem of modal imbalance. Specifically, we introduce a framework that employs touch-guided vision information for cross-modal contrastive learning to address the challenges of few-shot learning. Additionally, we incorporate multiple learning mechanisms, including regularization, memory mechanisms, and attention mechanisms, to mitigate catastrophic forgetting during multi-incremental step learning. Experimental results on the Touch and Go and VisGel datasets demonstrate that the TIFS framework exhibits robust continuous learning capabilities and strong generalization performance in touch-vision few-shot incremental learning tasks. Our code is available at https://github.com/Vision-Multimodal-Lab-HZCU/TIFS.
1755: Backdoor Attack on Vertical Federated Graph Neural Network Learning
Authors: Jirui Yang, Peng Chen, Zhihui Lu, Jianping Zeng, Qiang Duan, Xin Du, Ruijun Deng
Location: Guangzhou | Day: TBD
Show Abstract
Federated Graph Neural Network (FedGNN) integrate federated learning (FL) with graph neural networks (GNNs) to enable privacy-preserving training on distributed graph data. Vertical Federated Graph Neural Network (VFGNN), a key branch of FedGNN, handles scenarios where data features and labels are distributed among participants. Despite the robust privacy-preserving design of VFGNN, we have found that it still faces the risk of backdoor attacks, even in situations where labels are inaccessible. This paper proposes BVG, a novel backdoor attack method that leverages multi-hop triggers and backdoor retention, requiring only four target-class nodes to execute effective attacks. Experimental results demonstrate that BVG achieves nearly 100% attack success rates across three commonly used datasets and three GNN models, with minimal impact on the main task accuracy. We also evaluated various defense methods, and the BVG method maintained high attack effectiveness even under existing defenses. This finding highlights the need for advanced defense mechanisms to counter sophisticated backdoor attacks in practical VFGNN applications.
1774: RLBCD: Residual-guided Latent Brownian-bridge Co-Diffusion for Anatomical-to-Metabolic Image Synthesis
Authors: Tianxu Lv, Hongnian Tian, Jiansong Fan, Yuan Liu, Lihua Li, Xiang Pan
Location: Guangzhou | Day: TBD
Show Abstract
While metabolic imaging can facilitate early diagnosis by revealing physiological changes of lesions, it is limited by high cost, high radiation risk, and potential renal impairment. Thus, developing an effective approach for Anatomical-to-Metabolic Image Synthesis (A2MIS) is highly required. However, existing methods are heavily hindered by the gap between distinct domains, and fail to provide a confidence score for the synthesized images, severely restricting their clinical applications. Here, we propose a novel Residual-guided Latent Brownian-bridge Co-Diffusion (RLBCD) model for A2MIS. Specifically, RLBCD starts with a co-diffusion process that leverages a residual diffusion branch to capture inter-domain differences, which are injected into an enhanced diffusion branch to maximally reconstruct modality-specific details. Furthermore, to explore desired residual guidance, we investigate the encoder and decoder features in diffusion models, and accordingly design a Hybrid-Granularity Fusion to integrate consistent semantics and complementary information for fine-grained reconstruction. Additionally, a latent consistency score is developed to enhance the restoration of modality-specific information, which also serves as an indicator of the inherent confidence of the synthesized images. Extensive experiments conducted on five public and in-house datasets demonstrate that RLBCD not only outperforms state-of-the-art methods for A2MIS, but also is valuable for downstream clinic applications.
1788: Projection, Interaction and Fusion: A Progressive Difference Fusion Network for Salient Object Detection
Authors: Xiao Ke, Weijie Zhou, Yuzhen Niu
Location: Guangzhou | Day: TBD
Show Abstract
In recent years, deep learning-based Salient Object Detection (SOD) methods have made tremendous progress; however, their performance in complex scenarios has reached a bottleneck. In this paper, we propose a novel Progressive Difference Fusion Network (PDFNet) based on fine-grained feature fusion. First, to address the scale variability of salient objects, we introduce a Self-Guided Module (SGM) with dynamic receptive fields. Second, to tackle the shape variability of salient objects, we design a Feature Aggregation Module (FAM) incorporating cross convolutions and a feedback loop. Finally, to alleviate the issue of confusion between global and detail information during multi-scale feature fusion in existing models, we develop a Progressive Difference Fusion Unit (PDFU) to project multi-scale features into fine-grained nodes and enhance them through node interaction based on difference features. Additionally, we propose a Conditional Random Field Based on Patch (CRFbp), which focuses on handling discrete points, further improving the model’s performance. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance on five benchmark datasets. Code is available at: https://github.com/pdfnet2025/PDFNet.git.
1790: Recalling The Forgotten Class Memberships: Unlearned Models Can Be Noisy Labelers to Leak Privacy
Authors: Zhihao Sui, Liang Hu, Jian Cao, Dora D. Liu, Usman Naseem, Zhongyuan Lai, Qi Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Machine Unlearning (MU) technology facilitates the removal of the influence of specific data instances from trained models on request. Despite rapid advancements in MU technology, its vulnerabilities are still underexplored, posing potential risks of privacy breaches through leaks of ostensibly unlearned information. Current limited research on MU attacks requires access to original models containing privacy data, which violates the critical privacy-preserving objective of MU. To address this gap, we initiate the innovative study on recalling the forgotten class memberships from unlearned models (ULMs) without requiring access to the original one. Specifically, we implement a Membership Recall Attack (MRA) framework with a teacher-student knowledge distillation architecture, where ULMs serve as noisy labelers to transfer knowledge to student models. Then, it is translated into a Learning with Noisy Labels (LNL) problem for inferring correct labels of the forgetting instances. Extensive experiments on state-of-the-art MU methods with multiple real datasets demonstrate that the proposed MRA strategy exhibits high efficacy in recovering class memberships of unlearned instances. As a result, our study and evaluation have established a benchmark for future research on MU vulnerabilities.
1791: Progressive Modality-Adaptive Interactive Network for Multi-Modality Image Fusion
Authors: Chaowei Huang, Yaru Su, Huangbiao Xu, Xiao Ke
Location: Guangzhou | Day: TBD
Show Abstract
Multi-modality image fusion (MMIF) integrates features from distinct modalities to enhance visual quality and improve downstream task performance. However, existing methods often overlook the sparsity variations and dynamic correlations between infrared and visible images, potentially limiting the utilization of both modalities. To address these challenges, we propose the Progressive Modality-Adaptive Interactive Network (PoMAI), a novel framework that not only dynamically adapts to the sparsity and structural disparities of each modality but also enhances inter-modal correlations, thereby optimizing fusion quality. The training process consists of two stages: in the first stage, the Neighbor-Group Matching Model (NGMM) models the high sparsity of infrared features, while the Context-Aware Modeling Network (CAMN) captures rich structural details in visible features, jointly refining modality-specific characteristics for fusion. In the second stage, the Modality-Interactive Compensation Module (MICM) refines inter-modal correlations via dynamic compensation mechanism, while freezing the first-stage modules to focus MICM solely on the compensation task. Extensive experiments on benchmark datasets demonstrate that PoMAI surpasses state-of-the-art methods in fusion quality and excels in downstream tasks.
1792: Concentrate on Weakness: Mining Hard Prototypes for Few-Shot Medical Image Segmentation
Authors: Jianchao Jiang, Haofeng Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Few-Shot Medical Image Segmentation (FSMIS) has been widely used to train a model that can perform segmentation from only a few annotated images. However, most existing prototype-based FSMIS methods generate multiple prototypes from the support image solely by random sampling or local averaging, which can cause particularly severe boundary blurring due to the tendency for normal features accounting for the majority of features of a specific category. Consequently, we propose to focus more attention to those weaker features that are crucial for clear segmentation boundary. Specifically, we design a Support Self-Prediction (SSP) module to identify such weak features by comparing true support mask with one predicted by global support prototype. Then, a Hard Prototypes Generation (HPG) module is employed to generate multiple hard prototypes based on these weak features. Subsequently, a Multiple Similarity Maps Fusion (MSMF) module is devised to generate final segmenting mask in a dual-path fashion to mitigate the imbalance between foreground and background in medical images. Furthermore, we introduce a boundary loss to further constraint the edge of segmentation. Extensive experiments on three publicly available medical image datasets demonstrate that our method achieves state-of-the-art performance. Code is available at https://github.com/jcjiang99/CoW.
1793: FGeo-HyperGNet: Geometric Problem Solving Integrating FormalGeo Symbolic System and Hypergraph Neural Network
Authors: Xiaokai Zhang, Yang Li, Na Zhu, Cheng Qin, Zhenbing Zeng, Tuo Leng
Location: Guangzhou | Day: TBD
Show Abstract
Geometric problem solving has always been a long-standing challenge in the fields of mathematical reasoning and artificial intelligence. We built a neural-symbolic system, called FGeo-HyperGNet, to automatically perform human-like geometric problem solving. The symbolic component is a formal system built on FormalGeo, which can automatically perform geometric relational reasoning and algebraic calculations and organize the solution into a hypergraph with conditions as hypernodes and theorems as hyperedges. The neural component, called HyperGNet, is a hypergraph neural network based on the attention mechanism, including an encoder to effectively encode the structural and semantic information of the hypergraph and a theorem predictor to provide guidance in solving problems. The neural component predicts theorems according to the hypergraph, and the symbolic component applies theorems and updates the hypergraph, thus forming a predict-apply cycle to ultimately achieve readable and traceable automatic solving of geometric problems. Experiments demonstrate the correctness and effectiveness of this neural-symbolic architecture. We achieved state-of-the-art results with a TPA of 93.50% and a PSSR of 88.36% on the FormalGeo7K dataset.
1801: Open-Vocabulary Fine-Grained Hand Action Detection
Authors: Ting Zhe, Mengya Han, Xiaoshuai Hao, Yong Luo, Zheng He, Xiantao Cai, Jing Zhang
Location: Guangzhou | Day: TBD
Show Abstract
In this work, we address the new challenge of open-vocabulary fine-grained hand action detection, which aims to recognize hand actions from both known and novel categories using textual descriptions. Traditional hand action detection methods are limited to closed-set detection, making it difficult for them to generalize to new, unseen hand action categories. While current open-vocabulary detection (OVD) methods are effective at detecting novel objects, they face challenges with fine-grained action recognition, particularly when data is limited and heterogeneous. This often leads to poor generalization and performance bias between base and novel categories. To address these issues, we propose a novel approach, Open-FGHA (Open-vocabulary Fine-Grained Hand Action), which learns to distinguish fine-grained features across multiple modalities from limited heterogeneous data. It then identifies optimal matching relationships among these features, enabling accurate open-vocabulary fine-grained hand action detection. Specifically, we introduce three key components: Hierarchical Heterogeneous Low-Rank Adaptation, Bidirectional Selection and Fusion Mechanism, and Cross-Modality Query Generator. These components work in unison to enhance the alignment and fusion of multimodal fine-grained features. Extensive experiments demonstrate that Open-FGHA outperforms existing OVD methods, showing its strong potential for open-vocabulary hand action detection. The source code is available at OV-FGHAD.
1806: DaringFed: A Dynamic Bayesian Persuasion Pricing for Online Federated Learning Under Two-sided Incomplete Information
Authors: Yun Xin, Jianfeng Lu, Shuqin Cao, Gang Li, Haozhao Wang, Guanghui Wen
Location: Guangzhou | Day: TBD
Show Abstract
Online Federated Learning (OFL) is a real-time learning paradigm that sequentially executes parameter aggregation immediately for each random arriving client. To motivate clients to participate in OFL, it is crucial to offer appropriate incentives to offset the training resource consumption. However, the design of incentive mechanisms in OFL is constrained by the dynamic variability of Two-sided Incomplete Information (TII) concerning resources, where the server is unaware of the clients’ dynamically changing computational resources, while clients lack knowledge of the real-time communication resources allocated by the server. To incentivize clients to participate in training by offering dynamic rewards to each arriving client, we design a novel Dynamic Bayesian persuasion pricing for online Federated learning (DaringFed) under TII. Specifically, we begin by formulating the interaction between the server and clients as a dynamic signaling and pricing allocation problem within a Bayesian persuasion game, and then demonstrate the existence of a unique Bayesian persuasion Nash equilibrium. By deriving the optimal design of DaringFed under one-sided incomplete information, we further analyze the approximate optimal design of DaringFed with a specific bound under TII. Finally, extensive evaluation conducted on real datasets demonstrate that DaringFed optimizes accuracy and converges speed by 16.99%, while experiments with synthetic datasets validate the convergence of estimate unknown values and the effectiveness of DaringFed in improving the server’s utility by up to 12.6%.
1808: AdaptPFL: Unlocking Cross-Device Palmprint Recognition via Adaptive Personalized Federated Learning with Feature Decoupling
Authors: Zirui Zhang, Donghai Guan, Çetin Kaya Koç, Jie Wen, Qi Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Contactless palmprint recognition has recently emerged as a promising biometric technology. However, traditional methods that require sharing user data introduce substantial security risks. While federated learning offers privacy-preserving solutions, it often compromises recognition accuracy due to feature distribution drift caused by external factors such as lighting and devices. To address this issue, we propose an adaptive personalized federated learning framework (AdaptPFL). The central innovation lies in decomposing palmprint features into identity-related and contextual-related components using a feature decoupling mechanism. This design isolates the influence of external environmental factors on identity recognition through de-entanglement. Furthermore, two adaptive aggregation strategies are introduced to correct client drift: (1) Intra-Local Adaptive Aggregation (ILAA), which addresses intra-client drift by adaptively combining the two decoupled feature types; (2) Global-Local Adaptive Aggregation (GLAA), which corrects inter-client drift by adaptively aggregating model parameters. Experimental results demonstrate that AdaptPFL achieves superior performance compared to existing state-of-the-art methods.
1815: Credit Assignment and Fine-Tuning Enhanced Reinforcement Learning for Collaborative Spatial Crowdsourcing
Authors: Wei Chen, Yafei Li, Baolong Mei, Guanglei Zhu, Jiaqi Wu, Mingliang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Collaborative spatial crowdsourcing leverages distributed workers’ collective intelligence to accomplish spatial tasks. A central challenge is to efficiently assign suitable workers to collaborate on these tasks. Although mainstream reinforcement learning (RL) methods have proven effective in task allocation, they face two key obstacles: delayed reward feedback and non-stationary data distributions, both hindering optimal allocation and collaborative efficiency. To address these limitations, we propose CAFE (credit assignment and fine-tuning enhanced), a novel multi-agent RL framework for spatial crowdsourcing. CAFE introduces a credit assignment mechanism that distributes rewards based on workers’ contributions and spatiotemporal constraints, coupled with bi-level meta-optimization to jointly optimize credit assignment and RL policy. To handle non-stationary spatial task distributions, CAFE employs an adaptive fine-tuning procedure that efficiently adjusts credit assignment parameters while preserving collaborative knowledge. Experiments on two real-world datasets validate the effectiveness of our framework, demonstrating superior performance in terms of task completion and equitable reward redistribution.
1823: CADP: Towards Better Centralized Learning for Decentralized Execution in MARL
Authors: Yihe Zhou, Shunyu Liu, Yunpeng Qing, Tongya Zheng, Kaixuan Chen, Jie Song, Mingli Song
Location: Guangzhou | Day: TBD
Show Abstract
Centralized Training with Decentralized Execution (CTDE) has recently emerged as a popular framework for cooperative Multi-Agent Reinforcement Learning (MARL), where agents can use additional global state information to guide training in a centralized way and make their own decisions only based on decentralized local policies. Despite the encouraging results achieved, CTDE makes an independence assumption on agent policies, which limits agents from adopting global cooperative information from each other during centralized training. Therefore, we argue that the existing CTDE framework cannot fully utilize global information for training, leading to an inefficient joint exploration and perception, which can degrade the final performance. In this paper, we introduce a novel Centralized Advising and Decentralized Pruning (CADP) framework for MARL, that not only enables an efficacious message exchange among agents during training but also guarantees the independent policies for decentralized execution. Firstly, CADP endows agents the explicit communication channel to seek and take advice from different agents for more centralized training. To further ensure the decentralized execution, we propose a smooth model pruning mechanism to progressively constrain the agent communication into a closed one without degradation in agent cooperation capability. Empirical evaluations on different benchmarks and across various MARL backbones demonstrate that the proposed framework achieves superior performance compared with the state-of-the-art counterparts. Our code is available at https://github.com/zyh1999/CADP
1829: BEVTrack: A Simple and Strong Baseline for 3D Single Object Tracking in Bird’s-Eye View
Authors: Yuxiang Yang, Yingqi Deng, Mian Pan, Zheng-Jun Zha, Jing Zhang
Location: Guangzhou | Day: TBD
Show Abstract
3D Single Object Tracking (SOT) is a fundamental task in computer vision and plays a critical role in applications like autonomous driving. However, existing algorithms often involve complex designs and multiple loss functions, making model training and deployment challenging. Furthermore, their reliance on fixed probability distribution assumptions (e.g., Laplacian or Gaussian) hinders their ability to adapt to diverse target characteristics such as varying sizes and motion patterns, ultimately affecting tracking precision and robustness. To address these issues, we propose BEVTrack, a simple yet effective motion-based tracking method. BEVTrack directly estimates object motion in Bird’s-Eye View (BEV) using a single regression loss. To enhance accuracy for targets with diverse attributes, it learns adaptive likelihood functions tailored to individual targets, avoiding the limitations of fixed distribution assumptions in previous methods. This approach provides valuable priors for tracking and significantly boosts performance. Comprehensive experiments on KITTI, NuScenes, and Waymo Open Dataset demonstrate that BEVTrack achieves state-of-the-art results while operating at 200 FPS, enabling real-time applicability. The code will be released at https://github.com/xmm-prio/BEVTrack.
1833: VQCounter: Designing Visual Prompt Queue for Accurate Open-World Counting
Authors: Fanfan Ye, Yiqi Fan, Qiaoyong Zhong, Shicai Yang, Di Xie, Jie Song, Mingli Song
Location: Guangzhou | Day: TBD
Show Abstract
Class-agnostic counting enables enumerating arbitrary object classes beyond those seen during training. Recent studies attempted to exploit the potential of visual foundation models such as GroundingDINO. Despite the considerable progress, we observe certain shortcomings, including the limited diversity of visual prompts and suboptimal training regimen.
To address these issues, we introduce VQCounter, which incorporates a visual prompt queue mechanism designed to enrich the diversity of visual prompts.
A random modality switching strategy is proposed during training to strengthen both textual and visual modalities.
Besides, in light of weak point supervision, a Voronoi diagram-based cost (VoronoiCost) is designed to improve Hungarian matching, leading to more stable and faster convergence.
Building upon the Voronoi diagram, we also propose a novel set of more stringent evaluation metrics, which take point localization into account.
Extensive experiments on the FSC-147 and CARPK datasets demonstrate that VQCounter achieves state-of-the-art performance in both zero-shot and few-shot settings, significantly outperforming existing methods across nearly all evaluations.
1840: Physical Adversarial Camouflage Through Gradient Calibration and Regularization
Authors: Jiawei Liang, Siyuan Liang, Jianjie Huang, Chenxi Si, Ming Zhang, Xiaochun Cao
Location: Guangzhou | Day: TBD
Show Abstract
The advancement of deep object detectors has greatly affected safety-critical fields like autonomous driving. However, physical adversarial camouflage poses a significant security risk by altering object textures to deceive detectors. Existing techniques struggle with variable physical environments, facing two main challenges: 1) inconsistent sampling point densities across distances hinder the gradient optimization from ensuring local continuity, and 2) updating texture gradients from multiple angles causes conflicts, reducing optimization stability and attack effectiveness. To address these issues, we propose a novel adversarial camouflage framework based on gradient optimization. First, we introduce a gradient calibration strategy, which ensures consistent gradient updates across distances by propagating gradients from sparsely to unsampled texture points, thereby expanding the attack’s effective range. Additionally, we develop a gradient decorrelation method, which prioritizes and orthogonalizes gradients based on loss values, enhancing stability and effectiveness in multi-angle optimization by eliminating redundant or conflicting updates. Extensive experimental results on various detection models, angles, and distances show that our method significantly surpasses the state-of-the-art, with an average attack success rate (ASR) increase of 13.46\% across distances and 11.03\% across angles. Furthermore, experiments in real-world settings confirm the method’s threat potential, highlighting the urgent need for more robust autopilot systems less prone to spoofing.
1844: DDPA-3DVG: Vision-Language Dual-Decoupling and Progressive Alignment for 3D Visual Grounding
Authors: Hongjie Gu, Jinlong Fan, Liang Zheng, Jing Zhang, Yuxiang Yang
Location: Guangzhou | Day: TBD
Show Abstract
3D visual grounding aims to localize target objects in point clouds based on free-form natural language, which often describes both target and reference objects. Effective alignment between visual and text features is crucial for this task. However, existing two-stage methods that rely solely on object-level features can yield suboptimal accuracy, while one-stage methods that align only point-level features can be prone to noise. In this paper, we propose DDPA-3DVG, a novel framework that progressively aligns visual locations and language descriptions at multiple granularities. Specifically, we decouple natural language descriptions into distinct representations of target objects, reference objects, and their mutual relationships, while disentangling 3D scenes into object-level, voxel-level, and point-level features. By progressively fusing these dual-decoupled features from coarse to fine, our method enhances cross-modal alignment and achieves state-of-the-art performance on three challenging benchmarks—ScanRefer, Nr3D, and Sr3D. The code will be released at https://github.com/HDU-VRLab/DDPA-3DVG.
1852: DONIS: Importance Sampling for Training Physics-Informed DeepONet
Authors: Shudong Huang, Rui Huang, Ming Hu, Wentao Feng, Jiancheng Lv
Location: Guangzhou | Day: TBD
Show Abstract
Deep Operator Network (DeepONet) effectively learns complex operator mappings, especially for systems governed by differential equations. Physics-informed DeepONet (PI-DeepONet) extends these capabilities by integrating physical constraints, enabling robust performance with limited or no labeled data. However, combining operator learning with these constraints increases computational complexity, which makes training more difficult and convergence slower, particularly for nonlinear or high-dimensional problems. In this work, we present an enhanced PI-DeepONet framework, that applies importance sampling to both of DeepONet inputs (i.e., the functions and the collocation points) to alleviate these training challenges. By focusing on critical data regions in both input domains, our approach showcases accelerated convergence and improved accuracy across various complex applications.
1866: A Dynamic Stiefel Graph Neural Network for Efficient Spatio-Temporal Time Series Forecasting
Authors: Jiankai Zheng, Liang Xie
Location: Guangzhou | Day: TBD
Show Abstract
Spatio-temporal time series (STTS) have been widely used in many applications. However, accurately forecasting STTS is challenging due to complex dynamic correlations in both time and space dimensions. Existing graph neural networks struggle to balance effectiveness and efficiency in modeling dynamic spatio-temporal relations. To address this problem, we propose the Dynamic Spatio-Temporal Stiefel Graph Neural Network (DST-SGNN) to efficiently process STTS. For DST-SGNN, we first introduce the novel Stiefel Graph Spectral Convolution (SGSC) and Stiefel Graph Fourier Transform (SGFT). The SGFT matrix in SGSC is constrained to lie on the Stiefel manifold, and SGSC can be regarded as a filtered graph spectral convolution. We also propose the Linear Dynamic Graph Optimization on Stiefel Manifold (LDGOSM), which can efficiently learn the SGFT matrix from the dynamic graph and significantly reduce the computational complexity. Finally, we propose a multi-layer SGSC (MSGSC) that efficiently captures complex spatio-temporal correlations. Extensive experiments on seven spatio-temporal datasets show that DST-SGNN outperforms state-of-the-art methods while maintaining relatively low computational costs.
1890: FreqMoE: Dynamic Frequency Enhancement for Neural PDE Solvers
Authors: Tianyu Chen, Haoyi Zhou, Ying Li, Hao Wang, Zhenzhe Zhang, Tianchen Zhu, Shanghang Zhang, Jianxin Li
Location: Guangzhou | Day: TBD
Show Abstract
Fourier Neural Operators (FNO) have emerged as promising solutions for efficiently solving partial differential equations (PDEs) by learning infinite-dimensional function mappings through frequency domain transformations. However, the sparsity of high-frequency signals limits computational efficiency for high-dimensional inputs, and fixed-pattern truncation often causes high-frequency signal loss, reducing performance in scenarios such as high-resolution inputs or long-term predictions. To address these challenges, we propose FreqMoE, an efficient and progressive training framework that exploits the dependency of high-frequency signals on low-frequency components. The model first learns low-frequency weights and then applies a sparse upward-cycling strategy to construct a mixture of experts (MoE) in the frequency domain, effectively extending the learned weights to high-frequency regions. Experiments on both regular and irregular grid PDEs demonstrate that FreqMoE achieves up to 16.6 percent accuracy improvement while using merely 2.1 percent parameters (47.32x reduction) compared to dense FNO. Furthermore, the approach demonstrates remarkable stability in long-term predictions and generalizes seamlessly to various FNO variants and grid structures, establishing a new Low frequency Pretraining, High frequency Fine-tuning” paradigm for solving PDEs.
1895: Physics-Assisted and Topology-Informed Deep Learning for Weather Prediction
Authors: Jiaqi Zheng, Qing Ling, Yerong Feng
Location: Guangzhou | Day: TBD
Show Abstract
Although deep learning models have demonstrated remarkable potential in weather prediction, most of them overlook either the physics of the underlying weather evolution or the topology of the Earth’s surface. In light of these disadvantages, we develop PASSAT, a novel Physics-ASSisted And Topology-informed deep learning model for weather prediction. PASSAT attributes the weather evolution to two key factors: (i) the advection process that can be characterized by the advection equation and the Navier-Stokes equation; (ii) the Earth-atmosphere interaction that is difficult to both model and calculate. PASSAT also takes the topology of the Earth’s surface into consideration, other than simply treating it as a plane. With these considerations, PASSAT numerically solves the advection equation and the Navier-Stokes equation on the spherical manifold, utilizes a spherical graph neural network to capture the Earth-atmosphere interaction, and generates the initial velocity fields that are critical to solving the advection equation from the same spherical graph neural network. In the 5.625-degree resolution ERA5 data set, PASSAT outperforms both the state-of-the-art deep learning-based weather prediction models and the operational numerical weather prediction model IFS T42.
1915: Conditional Information Bottleneck-Based Multivariate Time Series Forecasting
Authors: Xinhui Li, Liang Duan, Lixing Yu, Kun Yue, Yuehua Li
Location: Guangzhou | Day: TBD
Show Abstract
Multivariate time series (MTS) forecasting endeavors to anticipate the forthcoming sequence of interdependent variables through the utilization of past observations. The prevailing methodologies, relying on deep neural networks, Transformer, or information bottleneck frameworks, persist in confronting challenges such as overlooking or inadequately capturing the inter / intra-series correlations evident in practical MTS datasets. In response to these challenges, we introduce a conditional information bottleneck-based strategy for MTS forecasting, grounded in information theory. Initially, we establish a conditional information bottleneck principle to capture the inter-series correlations via conditioning on non-target variables. Subsequently, a conditional mutual information-based technique is introduced to extract intra-series correlations by conditioning historical data, ensuring temporal consistency within each variable. Lastly, we devise a unified optimization objective and propose a training algorithm to collectively capture inter / intra-series correlations. Empirical investigations on authentic datasets underscore the superiority of our proposed approach over other cutting-edge competitors. Our code is available at https:
//github.com/Xinhui-Lee/CIB-MTSF.
1921: FreEformer: Frequency Enhanced Transformer for Multivariate Time Series Forecasting
Authors: Wenzhen Yue, Yong Liu, Xianghua Ying, Bowei Xing, Ruohao Guo, Ji Shi
Location: Guangzhou | Day: TBD
Show Abstract
This paper presents FreEformer, a simple yet effective model that leverages a Frequency Enhanced Transformer for multivariate time series forecasting. Our work is based on the assumption that the frequency spectrum provides a global perspective on the composition of series across various frequencies and is highly suitable for robust representation learning. Specifically, we first convert time series into the complex frequency domain using the Discrete Fourier Transform (DFT). The Transformer architecture is then applied to the frequency spectra to capture cross-variate dependencies, with the real and imaginary parts processed independently. However, we observe that the vanilla attention matrix exhibits a low-rank characteristic, thus limiting representation diversity. To address this, we enhance the vanilla attention mechanism by introducing an additional learnable matrix to the original attention matrix, followed by row-wise L1 normalization. Theoretical analysis demonstrates that this enhanced attention mechanism improves both feature diversity and gradient flow. Extensive experiments demonstrate that FreEformer consistently outperforms state-of-the-art models on eighteen real-world benchmarks covering electricity, traffic, weather, healthcare and finance. Notably, the enhanced attention mechanism also consistently improves the performance of state-of-the-art Transformer-based forecasters. Code is available at https://anonymous.4open.science/r/FreEformer.
1941: FCKT: Fine-Grained Cross-Task Knowledge Transfer with Semantic Contrastive Learning for Targeted Sentiment Analysis
Authors: Wei Chen, Zhao Zhang, Meng Yuan, Kepeng Xu, Fuzhen Zhuang
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we address the task of targeted sentiment analysis , which involves two sub-tasks, i.e., identifying specific aspects from reviews and determining their corresponding senti-ments. Aspect extraction forms the foundation for sentiment prediction, highlighting the critical dependency between these two tasks for effective cross-task knowledge transfer.
While most existing studies adopt a multi-task learning paradigm to align task-specific features in the latent space, they predominantly rely on coarse-grained knowledge transfer. Such approaches lack fine-grained control over aspect-sentiment relationships, often assuming uniform sentiment polarity within related aspects. This oversimplification neglects contextual cues that differentiate sentiments, leading to negative transfer.
To overcome these limitations, we propose FCKT, a fine-grained cross-task knowledge transfer framework tailored for TSA. By explicitly incorporating aspect-level information into sentiment prediction, our framework achieves fine-grained knowledge transfer, effectively mitigating negative transfer and enhancing task performance.
Extensive experiments on three real-world datasets, including comparisons with various baselines and large language models (LLMs), demonstrate the effectiveness of FCKT. The source code
is available on https://github.com/cwei01/FCKT.
1950: Multi-Objective Neural Bandits with Random Scalarization
Authors: Ji Cheng, Bo Xue, Chengyu Lu, Ziqiang Cui, Qingfu Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-objective multi-armed bandit (MOMAB) problems are crucial for complex decision-making scenarios where multiple conflicting objectives must be simultaneously optimized. However, most existing works are based on the linear assumption of the feedback rewards, which significantly constrains their applicability and efficacy in capturing the intricate dynamics of real-world environments. This paper explores a multi-objective neural bandit (MONB) framework, which integrates the universal approximators, neural networks, with the classical MOMABs. We adopt random scalarization to accommodate the special needs of a practitioner by setting an appropriate distribution on the regions of interest. Using the trade-off capabilities of upper confidence bound (UCB) and Thompson sampling (TS) strategies, we propose two novel algorithms, MONeural-UCB and MONeural-TS. Theoretical and empirical analysis demonstrate the superiority of our methods in multi-objective or multi-task bandit problems, which makes great improvement over the classical linear MOMABs.
1959: Leveraging Pretrained Diffusion Models for Zero-Shot Part Assembly
Authors: Ruiyuan Zhang, Qi Wang, Jiaxiang Liu, Yuchi Huo, Chao Wu
Location: Guangzhou | Day: TBD
Show Abstract
3D part assembly aims to understand part relationships and predict their 6-DoF poses to construct realistic 3D shapes, addressing the growing demand for autonomous assembly, which is crucial for robots. Existing methods mainly estimate the transformation of each part by training neural networks under supervision, which requires a substantial quantity of manually labeled data. However, the high cost of data collection and the immense variability of real-world shapes and parts make traditional methods impractical for large-scale applications. In this paper, we propose first a zero-shot part assembly method that utilizes pre-trained point cloud diffusion models as discriminators in the assembly process, guiding the manipulation of parts to form realistic shapes. Specifically, we theoretically demonstrate that utilizing a diffusion model for zero-shot part assembly can be transformed into an Iterative Closest Point (ICP) process. Then, we propose a novel pushing-away strategy to address the overlap parts, thereby further enhancing the robustness of the method. To verify our work, we conduct extensive experiments and quantitative comparisons to several strong baseline methods, demonstrating the effectiveness of the proposed approach, which even surpasses the supervised learning method. The code has been released on https://github.com/Ruiyuan-Zhang/Zero-Shot-Assembly.
1963: Trace: Structural Riemannian Bridge Matching for Transferable Source Localization in Information Propagation
Authors: Li Sun, Suyang Zhou, Bowen Fang, Hechuan Zhang, Junda Ye, Yutong Ye, Philip S. Yu
Location: Guangzhou | Day: TBD
Show Abstract
Source localization, the inverse problem of information diffusion, shows fundamental importance for understanding social dynamics. While achieving notable progress, existing solutions are typically exposed to the risk of error accumulation, and require a large number of observations for effective inference. However, it is often impractical to obtain quantities of observations in real scenarios, highlighting the need for a transferable model with broad applicability. Recently, Riemannian geometry has demonstrated its effectiveness in information diffusion and offers guidance in knowledge transfer, but has yet to be explored in source localization. In light of the issues above, we propose to study transferable source localization from a fresh geometric perspective, and present a novel approach (Trace) on the Riemannian manifold. Concretely, we establish a structural Schrodinger bridge to directly model the map between source and final distributions, where a functional curvature, encapsulating the graph structure, is formulated to govern the Schrodinger bridge and facilitate domain adaptation. Furthermore, we design a simple yet effective learning algorithm for Riemannian Schrodinger bridges (geodesics bridge matching) in which we prove the optimal projection holds for Riemannian measure so that the expensive iterative procedure is avoided. Extensive experiments demonstrate the effectiveness and transferability of Trace on both synthetic and real datasets.
1980: A Centrality-based Graph Learning Framework
Authors: Jiajun Yu, Zhihao Wu, Jielong Lu, Tianyue Wang, Haishuai Wang
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) have become powerful models for both node- and graph-level tasks. While node-level learning focuses on individual nodes and their local structures, graph-level learning encounters challenges in capturing the global properties of graphs. In this paper, we conduct a theoretical and experimental analysis of existing graph-level learning frameworks and find that these frameworks typically adopt a single-view perspective based solely on node degree, which limits their ability to capture comprehensive graph characteristics.
To address these issues, we propose a multi-view approach that leverages different types of centrality measures to capture diverse aspects of graph structure. We design an attention-based mechanism to adaptively integrate these multiple views, and use it as a readout function to perform weighted summation of node embeddings, termed as Adaptive Centrality Readout (ACRead). ACRead demonstrates enhanced flexibility and effectiveness when integrated with various GNN architectures, outperforming state-of-the-art readout methods, including KerRead and Set Transformer.
Additionally, this multi-view centrality approach can serve as a standalone graph-level learning framework without relying on GNNs, referred to as Adaptive Centrality-based Graph Learning (ACGL), which achieves competitive performance by effectively combining different centrality perspectives.
1997: SDDiff: Boosting Radar Perception via Spatial-Doppler Diffusion
Authors: Shengpeng Wang, Xin Luo, Yulong Xie, Wei Wang
Location: Guangzhou | Day: TBD
Show Abstract
Point cloud extraction (PCE) and ego velocity estimation (EVE) are key capabilities gaining attention in 3D radar perception. However, existing work typically treats these two tasks independently, which may neglect the interplay between radar’s spatial and Doppler domain features, potentially introducing additional bias. In this paper, we observe an underlying correlation between 3D points and ego velocity, which offers reciprocal benefits for PCE and EVE. To fully unlock such inspiring potential, we take the first step to design a Spatial-Doppler Diffusion (SDDiff) model for simultaneously dense PCE and accurate EVE. To seamlessly tailor it to radar perception, SDDiff improves the conventional latent diffusion process in three major aspects. First, we introduce a representation that embodies both spatial occupancy and Doppler features. Second, we design a directional diffusion with radar priors to streamline the sampling. Third, we propose Iterative Doppler Refinement to enhance the model’s adaptability to density variations and ghosting effects. Extensive evaluations show that SDDiff significantly outperforms state-of-the-art baselines by achieving 59% higher in EVE accuracy, 4X greater in valid generation density while boosting PCE effectiveness and reliability. The code and dataset will be available on https://github.com/StellarEsti/SDDiff.
2003: Seeing the Unseen: Composing Outliers for Compositional Zero-Shot Learning
Authors: Chenchen Jing, Mingyu Liu, Hao Chen, Yuling Xi, Xingyuan Bu, Dong Gong, Chunhua Shen
Location: Guangzhou | Day: TBD
Show Abstract
Compositional zero-shot learning (CZSL) is to recognize unseen attribute-object compositions by learning from seen compositions. The distribution shift between unseen compositions and seen compositions poses challenges to CZSL models, especially when test images are mixed with both seen and unseen compositions. The challenge will be addressed more easily if a model can distinguish unseen/seen compositions and treat them with specific recognition strategies. However, identifying images with unseen compositions is non-trivial, considering that unseen compositions are absent in training and usually contain only subtle differences from seen compositions. In this paper, we propose a novel compositional zero-shot learning method called COMO, which composes outliers in training for distinguishing seen and unseen compositions and further applying specific strategies for them. Specifically, we compose attribute-object representations for unseen compositions based on primitive representations of training images as outliers to enable the model to identify unseen compositions in inference. At test time, the method distinguishes images containing seen/unseen compositions and uses different weights for composition classification and primitive classification to recognize seen/unseen compositions. Experimental results on three datasets show the effectiveness of our method in both the closed-world setting and the open-world setting.
2020: Automated Strategy Invention for Confluence of Term Rewrite Systems
Authors: Liao Zhang, Fabian Mitterwallner, Jan Jakubuv, Cezary Kaliszyk
Location: Guangzhou | Day: TBD
Show Abstract
Term rewriting plays a crucial role in software verification and compiler optimization. With dozens of highly parameterizable techniques developed to prove various system properties, automatic term rewriting tools work in an extensive parameter space. This complexity exceeds human capacity for parameter selection, motivating an investigation into automated strategy invention. In this paper, we focus on confluence of term rewrite systems, and apply AI techniques to invent strategies for automatic confluence proving. Moreover, we randomly generate a large dataset to analyze confluence for term rewrite systems. We improve the state-of-the-art automatic confluence prover CSI: When equipped with our invented strategies, it surpasses its human-designed strategies both on the augmented dataset and on the original human-created benchmark dataset ARI-COPS, proving/disproving the confluence of several term rewrite systems for which no automated proofs were known before.
2035: Enhancing Transferability of Audio Adversarial Example for Both Frequency- and Time-domain
Authors: Zilin Tian, Yunfei Long, Liguo Zhang, Jiahong Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Audio adversarial examples impose acoustically imperceptible perturbations to clean audio examples, fooling classification models into producing incorrect results. Transferability is a critical property of audio adversarial examples, making black-box attacks applicable in practice and attracting increasing interest. Despite recent studies achieving transferability across models within the same domain, they consistently fail to achieve transferability across different domains. Given that time-domain and frequency-domain models are the two predominant approaches in audio classification, we observe that adversarial examples generated for one domain demonstrate significantly constrained transferability to the other. To address this limitation, we propose an Adaptive Inter-domain Ensemble (AIE) attack, which integrates transferable adversarial information from both domains and dynamically optimizes their contributions through adaptive weighting, improving the cross-domain transferability of audio adversarial examples. Extensive evaluations on diverse datasets consistently demonstrate that AIE outperforms existing methods, establishing its effectiveness in enhancing adversarial transferability across domains.
2036: Model-Based Closed-Loop Control Algorithm for Stochastic Partial Differential Equation Control
Authors: Peiyan Hu, Haodong Feng, Yue Wang, Zhiming Ma
Location: Guangzhou | Day: TBD
Show Abstract
Neural operators have demonstrated promise in modeling and controlling systems governed by Partial Differential Equations (PDEs). Beyond PDEs, Stochastic Partial Differential Equations (SPDEs) play a critical role in modeling systems influenced by randomness, with applications in finance, physics, and beyond. However, controlling SPDE-governed systems remains a significant challenge. On the one hand, the regularity of the system’s state (which can be intuitively understood as smoothness) deteriorates, making modeling and generalization more challenging. On the other hand, this stochasticity also renders control more unstable and thus less accurate. To address this gap, we propose the Model-Based Closed-Loop Control Algorithm (MB-CC), the first model-based closed-loop control method for SPDEs. MB-CC introduces two key innovations to enhance control robustness and efficiency: a Regularity Feature (RF) block and a closed-loop strategy with an operator-encoded policy network. The RF block, inspired by the regularity structure theory of SPDEs, addresses noise-induced irregularities by transforming the network’s input—including the system state and noise-perturbed external forces—into a refined feature space for improved forward prediction. Compared to previous works using regularity features, we introduce a new parameterization, data augmentation, and extend the RF block as a plug-and-play component. Additionally, to achieve closed-loop control, we introduce an operator-encoded policy network to map the current state to optimal control, which integrates physical priors and swiftly makes decisions based on states returned by the environment. We conduct a systematic evaluation of MB-CC on two notable SPDEs, showcasing its effectiveness and efficiency. The ablation studies show its ability to handle stochasticity more effectively.
2038: Partial Label Clustering
Authors: Yutong Xie, Fuchao Yang, Yuheng Jia
Location: Guangzhou | Day: TBD
Show Abstract
Partial label learning (PLL) is a significant weakly supervised learning framework, where each training example corresponds to a set of candidate labels and only one label is the ground-truth label. For the first time, this paper investigates the partial label clustering problem, which takes advantage of the limited available partial labels to improve the clustering performance. Specifically, we first construct a weight matrix of examples based on their relationships in the feature space and disambiguate the candidate labels to estimate the ground-truth label based on the weight matrix. Then, we construct a set of must-link and cannot-link constraints based on the disambiguation results. Moreover, we propagate the initial must-link and cannot-link constraints based on an adversarial prior promoted dual-graph learning approach. Finally, we integrate weight matrix construction, label disambiguation, and pairwise constraints propagation into a joint model to achieve mutual enhancement. We also theoretically prove that a better disambiguated label matrix can help improve clustering performance. Comprehensive experiments demonstrate our method realizes superior performance when comparing with state-of-the-art constrained clustering methods, and outperforms PLL and semi-supervised PLL methods when only limited samples are annotated. The code and appendix are publicly available at https://github.com/xyt-ml/PLC.
2056: HSRMamba: Contextual Spatial-Spectral State Space Model for Single Hyperspectral Image Super-Resolution
Authors: Shi Chen, Lefei Zhang, Liangpei Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Mamba has demonstrated exceptional performance in visual tasks due to its powerful global modeling capabilities and linear computational complexity, offering considerable potential in hyperspectral image super-resolution (HSISR). However, in HSISR, Mamba faces challenges as transforming images into 1D sequences neglects the spatial-spectral structural relationships between locally adjacent pixels, and its performance is highly sensitive to input order, which affects the restoration of both spatial and spectral details. In this paper, we propose HSRMamba, a contextual spatial-spectral modeling state space model for HSISR, to address these issues both locally and globally. Specifically, a local spatial-spectral partitioning mechanism is designed to establish patch-wise causal relationships among adjacent pixels in 3D features, mitigating the local forgetting issue. Furthermore, a global spectral reordering strategy based on spectral similarity is employed to enhance the causal representation of similar pixels across both spatial and spectral dimensions. Finally, experimental results demonstrate our HSRMamba outperforms the state-of-the-art methods in quantitative quality and visual results. Code is available at: https://github.com/Tomchenshi/HSRMamba.
2068: Towards Anytime Retrieval: A Benchmark for Anytime Person Re-Identification
Authors: Xulin Li, Yan Lu, Bin Liu, Jiaze Li, Qinhong Yang, Tao Gong, Qi Chu, Mang Ye, Nenghai Yu
Location: Guangzhou | Day: TBD
Show Abstract
In real applications, person re-identification (ReID) expects to retrieve the target person at any time, including both daytime and nighttime, ranging from short-term to long-term. However, existing ReID tasks and datasets cannot meet this requirement, as they are constrained by available time and only provide training and evaluation for specific scenarios. Therefore, we investigate a new task called Anytime Person Re-identification (AT-ReID), which aims to achieve effective retrieval in multiple scenarios based on variations in time. To address the AT-ReID problem, we collect the first large-scale dataset, AT-USTC, which contains 135k images of individuals wearing multiple clothes captured by RGB and IR cameras. Our data collection spans over an entire year and 270 volunteers were photographed on average 29.1 times across different dates or scenes, 4-15 times more than current datasets, providing conditions for follow-up investigations in AT-ReID. Further, to tackle the new challenge of multi-scenario retrieval, we propose a unified model named Uni-AT, which comprises a multi-scenario ReID (MS-ReID) framework for scenario-specific features learning, a Mixture-of-Attribute-Experts (MoAE) module to alleviate inter-scenario interference, and a Hierarchical Dynamic Weighting (HDW) strategy to ensure balanced training across all scenarios. Extensive experiments show that our model leads to satisfactory results and exhibits excellent generalization to all scenarios.
2078: InstGAN: Instant Actor-Critic-Driven GAN for De Novo Molecule Generation and Property Optimization
Authors: Huidong Tang, Chen Li, Sayaka Kamei, Yoshihiro Yamanishi, Yasuhiko Morimoto
Location: Guangzhou | Day: TBD
Show Abstract
Deep generative models, such as generative adversarial networks (GANs), have been employed for de~novo molecular generation in drug discovery. Most prior studies have utilized reinforcement learning (RL) algorithms, particularly Monte Carlo tree search (MCTS), to handle the discrete nature of molecular representations in GANs. However, due to the inherent instability in training GANs and RL models, along with the high computational cost associated with MCTS sampling, MCTS RL-based GANs struggle to scale to large chemical databases. To tackle these challenges, this study introduces a novel GAN based on actor-critic RL with instant and global rewards, called InstGAN, to generate molecules at the token-level with multi-property optimization. Furthermore, maximized information entropy is leveraged to alleviate the mode collapse. The experimental results demonstrate that InstGAN outperforms other baselines, achieves comparable performance to state-of-the-art models, and efficiently generates molecules with multi-property optimization. The code is available at: https://github.com/tang777777/InstGAN.
2090: Screening, Rectifying, and Re-Screening: A Unified Framework for Tuning Vision-Language Models with Noisy Labels
Authors: Chaowei Fang, Hangfei Ma, Zhihao Li, De Cheng, Yue Zhang, Guanbin Li
Location: Guangzhou | Day: TBD
Show Abstract
Pre-trained vision-language models have shown remarkable potential for downstream tasks. However, their fine-tuning under noisy labels remains an open problem due to challenges like self-confirmation bias and the limitations of conventional small-loss criteria. In this paper, we propose a unified framework to address these issues, consisting of three key steps: Screening, Rectifying, and Re-Screening. First, a dual-level semantic matching mechanism is introduced to categorize samples into clean, ambiguous, and noisy samples by leveraging both macro-level and micro-level textual prompts. Second, we design tailored pseudo-labeling strategies to rectify noisy and ambiguous labels, enabling their effective incorporation into the training process. Finally, a re-screening step, utilizing cross-validation with an auxiliary vision-language model, mitigates self-confirmation bias and enhances the robustness of the framework. Extensive experiments across ten datasets demonstrate that the proposed method significantly outperforms existing approaches for tuning vision-language pre-trained models with noisy labels.
2098: Inverse Game Theory: An Incenter-Based Approach
Authors: Lvye Cui, Haoran Yu, Pierre Pinson, Dario Paccagnan
Location: Guangzhou | Day: TBD
Show Abstract
Estimating player utilities from observed equilibria is crucial for many applications. Existing approaches to tackle this problem are either limited to specific games or do not scale well with the number of players. Our work addresses these issues by proposing a novel utility estimation method for general multi-player non-cooperative games. Our main idea consists in reformulating the inverse game problem as an inverse variational inequality problem and in selecting among all utility parameters consistent with the data, the so-called incenter. We show that the choice of the incenter can produce parameters that are most robust to the observed equilibrium behaviors. However, its computation is challenging, as the number of constraints in the corresponding optimization problem increases with the number of players and the behavior space size. To tackle this challenge, we propose a loss function-based algorithm, making our method scalable to games with many players or a continuous action space. Furthermore, we show that our method can be extended to incorporate prior knowledge of player utilities, and that it can handle inconsistent data, i.e., data where players do not play exact equilibria. Numerical experiments on three game applications demonstrate that our methods outperform the state of the art. The code, datasets, and supplementary material are available at https://github.com/cuilvye/Incenter-Project.
2101: DcDsDiff: Dual-Conditional and Dual-Stream Diffusion Model for Generative Image Tampering Localization
Authors: Qixian Hao, Shaozhang Niu, Jiwei Zhang, Kai Wang
Location: Guangzhou | Day: TBD
Show Abstract
Generative Image Tampering (GIT), due to its high diversity and realism, poses a significant challenge to traditional image tampering localization techniques. Consequently, this paper introduces a denoising diffusion probabilistic model-based DcDsDiff, which comprises a Dual-View Conditional Network (DVCN) and a Dual-Stream Denoising Network (DSDN). DVCN provides clues about the tampered areas. It extracts tampering features in the high-frequency view and integrates them with spatial domain features using attention mechanisms. DSDN jointly generates mask image and detail image, enhancing the generalization capability of the model against new tampering forms through iterative denoising. A multi-stream interaction mechanism enables the two generative tasks to promote each other, prompting the model to generate localization results that are rich in detail and complete. Experiments show that DcDsDiff outperforms mainstream methods in accurate localization, generalization, extensibility, and robustness. Code page: https://github.com/QixianHao/DcDsDiff-and-GIT10K.
2104: RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations
Authors: Zunhai Su, Hanyu Wei, Zhe Chen, Wang Shen, Linge Li, Huangqi Yu, Kehong Yuan
Location: Guangzhou | Day: TBD
Show Abstract
Key-Value (KV) cache facilitates efficient large language models (LLMs) inference by avoiding recomputation of past KVs.
As the batch size and context length increase, the oversized KV caches become a significant memory bottleneck, highlighting the need for efficient compression.
Existing KV quantization rely on fine-grained quantization or the retention of a significant portion of high bit-widths caches, both of which compromise compression ratio and often fail to maintain robustness at extremely low average bit-widths.
In this work, we explore the potential of rotation technique for 2-bit KV quantization and propose RotateKV, which achieves accurate and robust performance through the following innovations:
(i) Outlier-Aware Rotation, which utilizes channel-reordering to adapt the rotations to varying channel-wise outlier distributions without sacrificing the computational efficiency of the fast Walsh-Hadamard transform (FWHT);
(ii) Pre-RoPE Grouped-Head Rotation, which mitigates the impact of rotary position embedding (RoPE) on proposed outlier-aware rotation and further smooths outliers across heads;
(iii) Attention-Sink-Aware Quantization, which leverages the massive activations to precisely identify and protect attention sinks.
RotateKV achieves less than 0.3 perplexity (PPL) degradation with 2-bit quantization on WikiText-2 using LLaMA-2-13B, maintains strong CoT reasoning and long-context capabilities, with less than 1.7% degradation on GSM8K, outperforming existing methods even at lower average bit-widths.
RotateKV also showcases a 3.97× reduction in peak memory usage, supports 5.75× larger batch sizes, and achieves a 2.32× speedup in decoding stage.
2112: Q-MiniSAM2: A Quantization-based Benchmark for Resource-Efficient Video Segmentation
Authors: Xuanxuan Ren, Xiangyu Li, Kun Wei, Xu Yang, Yanhua Yang
Location: Guangzhou | Day: TBD
Show Abstract
Segment Anything Model 2 (SAM2) is a new-generation, high-precision model for image and video segmentation, offering extensive application prospects across numerous computer vision fields. However, as a large-scale model, its huge memory demands and expansive computing costs pose challenges for practical deployment. This paper presents Q-MiniSAM2, an efficient Quantization-based segmentation benchmark tailored to optimize SAM2 by Minimizing memory consumption and accelerating computations. We begin with applying Post-Training Quantization (PTQ) to SAM2, requiring only a relatively small dataset for network calibration, thereby eliminating the need for retraining. Building upon PTQ, we further introduce a Hierarchy-based Video Quantization method to enhance the model’s capacity to capture video semantics and temporal correlations across different time scales. Furthermore, we observe that SAM2’s memory overhead is predominantly concentrated on processing historical frames, and the redundant cross-attention computations significantly increase memory and computational costs due to the imperceptible change of the short time intervals between these frames. To tackle this issue, an Adaptive Mutual-KV mechanism is proposed to mitigate excessive cross-attention by leveraging inter-frame similarities. Comprehensive experiments demonstrate that the proposed approach achieves superior performance compared to state-of-the-art methods, underscoring its potential for efficient and scalable video segmentation.
2116: ARPDL: Adaptive Relational Prior Distribution Loss as an Adapter for Document-Level Relation Extraction
Authors: Huangming Xu, Fu Zhang, Jingwei Cheng, Xin Li
Location: Guangzhou | Day: TBD
Show Abstract
The goal of document-level relation extraction (DocRE) is to identify relations between entities from multiple sentences. As a multi-label classification task, a common approach is to determine whether there are relations for an entity pair by selecting a multi-label classification threshold, with scores of relations above the threshold predicted as positive and the rest as negative. However, we find that predicting multiple relations for entity pairs causes the decrease of predicted scores in positive classes. This could lead to many positive classes being incorrectly predicted as negative. Additionally, our analysis suggests that fitting the distribution of predicted relations to the prior distribution of relations can help improve prediction performance. However, previous studies have not explored or leveraged the prior distribution of relations. To address these issues and findings, we for the first time propose the idea of incorporating the relational prior distribution into the loss calculation in DocRE tasks. We innovatively propose an Adaptive Relational Prior Distribution Loss (ARPDL), which can adaptively adjust relation prediction scores based on the relational prior distribution. Our designed relational prior distribution component can also be integrated as an adapter into other threshold-based losses to improve prediction performance. Experimental results demonstrate that ARPDL consistently improves the performance of existing DocRE models, achieving new state-of-the-art results. Furthermore, integrating our relational prior distribution adapter into other losses significantly enhances their performance in DocRE tasks, validating the effectiveness and generality of our approach. Code is available at https://github.com/xhm-code/ARPDL.
2117: EFX Feasible Scheduling for Time-dependent Resources
Authors: Jiazhu Fang, Qizhi Fang, Minming Li, Wenjing Liu
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we study a fair resource scheduling problem involving the assignment of a set of interval jobs among a group of heterogeneous machines. Each job is associated with a release time, a deadline, and a processing time. A machine can process a job if the entire processing period falls within the release time and deadline of the job. Each machine can process at most one job at any given time, and different jobs yield different utilities for the machines. The goal is to find a fair and efficient schedule of the jobs. We discuss the compatibility between envy-freeness up to any item (EFX) and various efficiency concepts. Additionally, we present polynomial-time algorithms for various settings.
2120: A Correlation Manifold Self-Attention Network for EEG Decoding
Authors: Chen Hu, Rui Wang, Xiaoning Song, Tao Zhou, Xiao-Jun Wu, Nicu Sebe, Ziheng Chen
Location: Guangzhou | Day: TBD
Show Abstract
Riemannian neural networks, which generalize the deep learning paradigm to non-Euclidean geometries, have garnered widespread attention across diverse applications in artificial intelligence. Among these, the representative attention models have been studied on various non-Euclidean spaces to geometrically capture the spatiotemporal dependencies inherent in time series data, e.g., electroencephalography (EEG). Recent studies have highlighted the full-rank correlation matrix as an advantageous alternative to the covariance matrix for data representation, owing to its invariance to the scale of variables. Motivated by these advancements, we propose the Correlation Attention Network (CorAtt) tailored for full-rank correlation matrices and implement it under the permutation-invariant and computationally efficient Off-Log and Log-Scaled geometries, respectively. Extensive evaluations on three benchmarking EEG datasets provide substantial evidence for the effectiveness of our introduced CorAtt. The code and supplementary material can be found at https://github.com/ChenHu-ML/CorAtt.
2131: Exploiting Text Semantics for Few and Zero Shot Node Classification on Text-attributed Graph
Authors: Yuxiang Wang, Xiao Yan, Shiyu Jin, Quanqing Xu, Chuang Hu, Yuanyuan Zhu, Bo Du, Jia Wu, Jiawei Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Text-attributed graph (TAG) provides a text description for each graph node, and few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. Existing work utilizes various graph-based augmentation techniques to train the node and text embeddings, while text-based augmentations are largely unexplored. In this paper, we propose Text Semantics Augmentation (TSA) to improve accuracy by introducing more text semantic supervision signals. Specifically, we design two augmentation techniques, i.e., positive semantics matching and negative semantics contrast, to provide more reference texts for each graph node or text description. Positive semantic matching retrieves texts with similar embeddings to match with a graph node. Negative semantic contrast adds a negative prompt to construct a text description with the opposite semantics, which is contrasted with the original node and text. We evaluate TSA on 5 datasets and compare with 13 state-of-the-art baselines. The results show that TSA consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%. The code is at https://github.com/wyx11112/TSA.
2135: Unleashing the Semantic Adaptability of Controlled Diffusion Model for Image Colorization
Authors: Xiangcheng Du, Zhao Zhou, Yanlong Wang, Yingbin Zheng, Xingjiao Wu, Peizhu Gong, Cheng Jin
Location: Guangzhou | Day: TBD
Show Abstract
Recent data-driven image colorization methods have leveraged pre-trained Text-to-Image (T2I) diffusion models as generative prior, while still suffering from unsatisfactory and inaccurate semantic-level color control. To address these issues, we propose a Semantic Adaptation method (SeAda) that enhances the prior while considering the semantic discrepancy between color and grayscale image pairs. The SeAda employs a semantic adapter to produce refined semantic embeddings and a controlled T2I diffusion model to create reasonably colored images. Specifically, the semantic adapter transfers the embedding from grayscale to color domain, while the diffusion model utilizes the refined embedding and prior knowledge to achieve realistic and diverse results. We also design a three-staged training strategy to improve semantic comprehension and prior integration for further performance improvement. Extensive experiments on public datasets demonstrate that our method outperforms existing state-of-the-art techniques, yielding superior performance in image colorization.
2164: Leveraging Peer-Informed Label Consistency for Robust Graph Neural Networks with Noisy Labels
Authors: Kailai Li, Jiawei Sun, Jiong Lou, Zhanbo Feng, Hefeng Zhou, Chentao Wu, Guangtao Xue, Wei Zhao, Jie Li
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) excel in many applications but struggle when trained with noisy labels, especially as noise can propagate through the graph structure.
Despite recent progress in developing robust GNNs, few methods exploit the intrinsic properties of graph data to filter out noise.
In this paper, we introduce ProCon, a novel framework that identifies mislabeled nodes by measuring label consistency among semantically similar peers, which are determined by feature similarity and graph adjacency.
Mislabeled nodes typically exhibit lower consistency with these peers, a signal we measure using pseudo-labels derived from representational prototypes.
A Gaussian Mixture Model is fitted to the consistency distribution to identify clean samples, which refine prototype quality in an iterative feedback loop.
Experiments on multiple datasets demonstrate that ProCon significantly outperforms state-of-the-art methods, effectively mitigating label noise and enhancing GNN robustness.
2169: Zero-shot Federated Unlearning via Transforming from Data-Dependent to Personalized Model-Centric
Authors: Wenhan Wu, Huanghuang Liang, Jingling Yuan, Jiawei Jiang, Kanye Ye Wang, Chuang Hu, Xiaobo Zhou, Dazhao Cheng
Location: Guangzhou | Day: TBD
Show Abstract
Federated Unlearning (FU) addresses the "right to be forgotten" in federated learning by removing specific client data’s contribution without retraining from scratch. Existing FUs are data-dependent, which make the assumption that systems can access original training data or stored historical parameter updates during unlearning. However, the assumption cannot always hold in practice, as users usually request the deletion of client data and historical parameter updates due to privacy concerns or storage limitations. Therefore, it is crucial to develop a zero-shot FU method without such data access. The key challenge is how to distinguish and remove the impact of target clients without data-level information. Motivated by the idea that if we can learn client-specific personalized information from the model instead of data, FU can be model-centric and data-free, we present the first zero-shot FU framework ZeroFU. By embedding client contributions into the model during learning via condition computation, ZeroFU enables the model to possess personalized features for unlearning. The unlearning is achieved using a proposed GAN-based distillation framework that obfuscates the personalized feature of the target client. Evaluations demonstrate its effectiveness in unlearning under non-IID settings.
2173: Training-free Fourier Phase Diffusion for Style Transfer
Authors: Siyuan Zhang, Wei Ma, Libin Liu, Zheng Li, Hongbin Zha
Location: Guangzhou | Day: TBD
Show Abstract
Diffusion models have shown significant potential for image style transfer tasks. However, achieving effective stylization while preserving content in a training-free setting remains a challenging issue due to the tightly coupled representation space and inherent randomness of the models. In this paper, we propose a Fourier phase diffusion model that addresses this challenge. Given that the Fourier phase spectrum encodes an image’s edge structures, we propose modulating the intermediate diffusion samples with the Fourier phase of a content image to conditionally guide the diffusion process. This ensures content retention while fully utilizing the diffusion model’s style generation capabilities. To implement this, we introduce a content phase spectrum incorporation method that aligns with the characteristics of the diffusion process, preventing interference with generative stylization. To further enhance content preservation, we integrate homomorphic semantic features extracted from the content image at each diffusion stage. Extensive experimental results demonstrate that our method outperforms state-of-the-art models in both content preservation and stylization. Code is available at https://github.com/zhang2002forwin/Fourier-Phase-Diffusion-for-Style-Transfer.
2182: Risk-Aware Task Migration for Multiplex Unmanned Swarm Networks in Adversarial Environments
Authors: Kai Di, Tienyu Zuo, Pan Li, Yuanshuang Jiang, Fulin Chen, Yichuan Jiang
Location: Guangzhou | Day: TBD
Show Abstract
With the rapid development and deep integration of artificial intelligence and automation technologies, autonomous unmanned swarms dynamically organize into multiplex network structures based on diverse task requirements in adversarial environments. Frequent task variations lead to load imbalances among agents and between network layers, significantly increasing the risk of enemy detection and destruction. Existing approaches typically simplify multiplex networks into single-layer structures for task scheduling, failing to address these load imbalance issues. Moreover, the coupling between task dynamics and network multiplexity dramatically increases the complexity of designing task migration strategies, and it is proven NP-hard to achieve such load balancing. To address these challenges, this paper proposes a risk-aware task migration method that achieves dynamic load balancing by matching task requirements with both intra-layer agent capabilities and inter-layer swarm capabilities. Simulation results demonstrate that our approach significantly outperforms benchmark algorithms in task completion cost, task completion proportion, and system robustness. In particular, the algorithm achieves solutions statistically indistinguishable from the optimal solutions computed by the CPLEX solver, while exhibiting significantly reduced computational overhead.
2204: DHTAGK: Deep Hierarchical Transitive-Aligned Graph Kernels for Graph Classification
Authors: Xinya Qin, Lu Bai, Lixin Cui, Ming Li, Ziyu Lyu, Hangyuan Du, Edwin Hancock
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we propose a family of novel Deep Hierarchical Transitive-Aligned Graph Kernels (DHTAGK) for graph classification. To this end, we commence by developing a new Hierarchical Aligned Graph Auto-Encoder (HA-GAE) to construct transitive-aligned embedding graphs that encapsulate the structural correspondence information between graphs. The DHTAGK kernels then measure either the Jensen-Shannon Divergence between the adjacency matrices or the Gaussian kernel between the node feature matrices of the embedding graphs. Unlike the classical R-convolution kernels and node-based alignment kernels, the DHTAGK kernels can capture the transitive structural correspondence information and thus ensure the positive definiteness. Furthermore, the HA-GAE enables the DHTAGK kernels to simultaneously reflect both local and global graph structures and identify common structural patterns. Experimental results show that the DHTAGK kernels outperform state-of-the-art graph kernels and deep learning methods on benchmark datasets.
2234: AKBR: Learning Adaptive Kernel-based Representations for Graph Classification
Authors: Lu Bai, Feifei Qian, Lixin Cui, Ming Li, Hangyuan Du, Yue Wang, Edwin Hancock
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we propose a new model to learn Adaptive Kernel-based Representations (AKBR) for graph classification. Unlike state-of-the-art R-convolution graph kernels that are defined by merely counting any pair of isomorphic substructures between graphs and cannot provide an end-to-end learning mechanism for the classifier, the proposed AKBR approach aims to define an end-to-end representation learning model to construct an adaptive kernel matrix for graphs. To this end, we commence by leveraging a novel feature-channel attention mechanism to capture the interdependencies between different substructure invariants of original graphs. The proposed AKBR model can thus effectively identify the structural importance of different substructures, and compute the R-convolution kernel between pairwise graphs associated with the more significant substructures specified by their structural attentions. Furthermore, the proposed AKBR model employs all sample graphs as the prototype graphs, naturally providing an end-to-end learning architecture between the kernel computation as well as the classifier. Experimental results show that the proposed AKBR model outperforms existing state-of-the-art graph kernels and deep learning methods on standard graph benchmarks.
2243: Sentiment-enhanced Multi-hop Connected Graph Attention Network for Multimodal Aspect-Based Sentiment Analysis
Authors: Linlin Zhu, Heli Sun, Xiaoyong Huang, Qi Zhang, Ruichen Cao, Liang He
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal aspect-based sentiment analysis aims to extract aspects from different data sources and recognize the corresponding sentiments. While current research has broadly focused on syntax relation-driven semantic comprehension, the impact of the importance of different syntactic relations on semantic understanding has not been adequately investigated. To address this issue, we propose a Sentiment-enhanced Multi-hop Connected Graph Attention Network (MCG), aiming to enhance the discriminative capability of model for sentiments and to delve into the syntactic relationships within the text. Firstly, we design a contrastive sentiment-enhanced pre-training task that expands the diversity and complexity of training samples to improve the recognition of multiple sentiments. Secondly, we construct a multi-hop connected syntactic dependency graph to deeply explore the rich syntactic dependencies in the text and to reveal the differences among syntactic relations. Moreover, we develop a multi-hop connected graph attention mechanism that enables the model to focus on the key syntactic relations within the syntactic structure, thereby enhancing the comprehension and predictive capabilities of model in multimodal sentiment analysis. Experimental results on two benchmark datasets demonstrate that our method outperforms state-of-the-art methods. The source code is provided in the supplementary materials.
2265: A Dual Stream Visual Tokenizer for LLM Image Generation
Authors: Yongqian Li, Yong Luo, Xiantao Cai, Zheng He, Zhennan Meng, Nidong Wang, Yunlin Chen, Zhifei Li
Location: Guangzhou | Day: TBD
Show Abstract
We proposes a novel visual tokenizer by combining high-level semantic tokens and low-level pixel tokens to represent images, aiming to address the challenges of image-to-sequence conversion for Large Language Models (LLMs). Existing visual tokenizers, such as VQ-VAE and diffusion-based models, either struggle with token explosion as image resolution increases or fail to capture detailed structural information. Our method introduces a dual-token system: high-level semantic tokens capture the main content of the image, while low-level pixel tokens preserve structural details. By integrating these tokens in a hybrid architecture, we leverage a VQ-VAE branch to generate low-resolution guidance and a diffusion process to reconstruct high-resolution images with both semantic coherence and structural accuracy. This approach significantly reduces the number of required tokens and enhances image reconstruction quality, offering an efficient solution for tasks like image generation and understanding based on LLMs.
2286: Graph Embedded Contrastive Learning for Multi-View Clustering
Authors: Hongqing He, Jie Xu, Guoqiu Wen, Yazhou Ren, Na Zhao, Xiaofeng Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Recently, numerous multi-view clustering (MVC) and multi-view graph clustering (MVGC) methods have been proposed. Despite significant progress, they still face two issues: I) MVC and MVGC are often developed independently for multi-view and multi-graph data. They have redundancy but lack a unified methodology to combine their strengths. II) Contrastive learning is usually adopted to explore the associations across multiple views. However, traditional contrastive losses ignore the neighbor relationship in multi-view scenarios and easily lead to false associations in sample pairs. To address these issues, we propose Graph Embedded Contrastive Learning for Multi-View Clustering. Concretely, we propose a process of view-specific pre-training with adaptive graph convolution to make our method compatible with both multi-view and multi-graph data, which aggregates the graph information into data and leverages autoencoders to learn view-specific representations. Furthermore, to explore the view-cross associations, we introduce the process of view-cross contrastive learning and clustering, where we propose the graph-guided contrastive learning that can generate global graph to mitigate the false association issue as well as the cluster-guided contrastive clustering for improving the model robustness. Finally, extensive experiments demonstrate that our method achieves superior performance on both MVC and MVGC tasks.
2292: Exploiting Self-Refining Normal Graph Structures for Robust Defense against Unsupervised Adversarial Attacks
Authors: Bingdao Feng, Di Jin, Xiaobao Wang, Dongxiao He, Jingyi Cao, Zhen Wang
Location: Guangzhou | Day: TBD
Show Abstract
Defending against adversarial attacks on graphs has become increasingly important. Graph refinement to enhance the quality and robustness of representation learning is a critical area that requires thorough investigation. We observe that representations learned from attacked graphs are often ineffective for refinement due to perturbations that cause the endpoints of perturbed edges to become more similar, complicating the defender’s ability to distinguish them. To address this challenge, we propose a robust unsupervised graph learning framework that utilizes cleaner graphs to learn effective representations. Specifically, we introduce an anomaly detection model based on contrastive learning to obtain a rough graph excluding a large number of perturbed structures. Subsequently, we then propose the Graph Pollution Degree (GPD), a mutual information-based measure that leverages the encoder’s representation capability on the rough graph to assess the trustworthiness of the predicted graph and refine the learned representations. Extensive experiments on four benchmark datasets demonstrate that our method outperforms nine state-of-the-art defense models, effectively defending against adversarial attacks and enhancing node classification performance.
2296: GBGC: Efficient and Adaptive Graph Coarsening via Granular-ball Computing
Authors: Shuyin Xia, Guan Wang, Gaojie Xu, Sen Zhao, Guoyin Wang
Location: Guangzhou | Day: TBD
Show Abstract
The objective of graph coarsening is to generate smaller, more manageable graphs while preserving key information of the original graph. Previous work were mainly based on the perspective of spectrum-preserving, using some predefined coarsening rules to make the eigenvalues of the Laplacian matrix of the original graph and the coarsened graph match as much as possible. However, they largely overlooked the fact that the original graph is composed of subregions at different levels of granularity, where highly connected and similar nodes should be more inclined to be aggregated together as nodes in the coarsened graph. By combining the multi-granularity characteristics of the graph structure, we can generate coarsened graph at the optimal granularity. To this end, inspired by the application of granular-ball computing in multi-granularity, we propose a new multi-granularity, efficient, and adaptive coarsening method via granular-ball (GBGC), which significantly improves the coarsening results and efficiency. Specifically, GBGC introduces an adaptive granular-ball graph refinement mechanism, which adaptively splits the original graph from coarse to fine into granular-balls of different sizes and optimal granularity, and constructs the coarsened graph using these granular-balls as supernodes. In addition, compared with other state-of-the-art graph coarsening methods, the processing speed of this method can be increased by tens to hundreds of times and has lower time complexity. The accuracy of GBGC is almost always higher than that of the original graph due to the good robustness and generalization of the granular-ball computing, so it has the potential to become a standard graph data preprocessing method.
2311: CFDONEval: A Comprehensive Evaluation of Operator-Learning Neural Network Models for Computational Fluid Dynamics
Authors: Menghan Liu, Jianhuan Cen, Ziyang Zhou, Haolong Fan, Hongji Li, Ping Wei, Guohang Peng, Changye He, Yuzhe Qin, Yutong Lu, Qingsong Zou
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we introduce CFDONEval, a comprehensive evaluation of 12 operator-learning-based neural network (ON) models to simulate 7 benchmark fluid dynamics problems. These problems cover a range of 2D scenarios, including Darcy flow, two-phase flow, Taylor-Green vortex, lid-driven cavity flow, tube flow, circular cylinder flow, and 3D periodic hill flow. For a rigorous evaluation, we establish 22 fluid dynamics datasets for these benchmark problems, 18 of which are newly generated using traditional numerical methods, such as the finite element method. Our evaluation tackles 5 key challenges: multiscale phenomena, convection dominance, long-term predictions, multiphase flows, and unstructured meshes over complex geometries. We assess computational accuracy, efficiency, and flow field visualization, offering valuable insights into the application of ON models in fluid dynamics research. Our findings show that attention-based models perform well in handling almost all challenges; models with a U-shaped structure excel in handling multiscale problems; and the NU-FNO model demonstrates the smallest relative error in L2 norm when processing nonuniform grid data. The related code, dataset, and appendix are publicly available at: https://github.com/Sysuzqs/CFDNNEval.
2320: Enhanced Graph Similarity Learning via Adaptive Multi-scale Feature Fusion
Authors: Cuifang Zou, Guangquan Lu, Wenzhen Zhang, Xuxia Zeng, Shilong Lin, Longqing Du, Shichao Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Graph similarity computation plays a crucial role in a variety of fields such as chemical molecular structure comparison, social network analysis and code clone detection. However, due to inadequate feature representation, existing methods often struggle to cope with complex graph structures, which in turn limits the feature fusion capability and leads to low accuracy of similarity computation. To address these issues, this paper introduces an Adaptive Multi-scale Feature Fusion(AMFF) framework. AMFF firstly enhances feature extraction through a residual graph neural network, which robustly captures key information in complex graph structures. Based on this, a multi-pooled attention network is used to aggregate multi-scale features and accurately extract key node features while minimizing information loss. Finally, the adaptive multi-scale feature fusion mechanism dynamically adjusts the feature fusion weights according to the interactions between nodes and graph embeddings, thus improving the accuracy and sensitivity of similarity computation. Extensive experiments on benchmark datasets including AIDS700nef, LINUX, IMDBMulti, and PTC show that AMFF significantly outperforms existing methods on several metrics. These results confirm the efficiency and robustness of AMFF in graph similarity computation, providing a promising solution for assessing the similarity of complex graph data.
2322: Rolling in Classical Planning with Conditional Effects and Constraints
Authors: Matteo Cardellini, Enrico Giunchiglia
Location: Guangzhou | Day: TBD
Show Abstract
In classical planning, conditional effects (CEs) allow modelling non-idempotent actions, where the resulting state may depend on how many times each action is consecutively repeated.
Though CEs have been widely studied in the literature, no one has ever studied how to exploit rolling, i.e., how to effectively model the consecutive repetition of an action.
In this paper, we fill this void by (i) showing that planning with CEs remains PSPACE-complete even in the limit case of problems with a single action, (ii) presenting a correct and complete planning as satisfiability encoding exploiting rolling while effectively dealing with constraints imposed on the set of reachable states, and (iii) theoretically and empirically showing its substantial benefits.
2327: ST-TAR: An Efficient Spatio-Temporal Learning Framework for Traffic Accident Risk Forecasting
Authors: Hongyu Wang, Lisi Chen, Shuo Shang, Peng Han, Christian S. Jensen
Location: Guangzhou | Day: TBD
Show Abstract
Traffic accidents represent a significant concern due to their devastating consequences. The ability to predict future traffic accident risks is of key importance to accident prevention activities in transportation systems. Although existing studies have made substantial efforts to model spatio-temporal correlations, they fall short when it comes to addressing the zero-inflated data issue and capturing spatio-temporal heterogeneity, which reduces their predictive abilities. In addition, improving efficiency is an urgent requirement for traffic accident forecasting. To overcome these limitations, we propose an efficient Spatio-Temporal learning framework for Traffic Accident Risk forecasting (ST-TAR). Taking long-term and short-term data as separate inputs, the ST-TAR model integrates hierarchical multi-view GCN and long short-term cross-attention mechanism to encode spatial dependencies and temporal patterns. We leverage long-term periodicity and short-term proximity for spatio-temporal contrastive learning to capture spatio-temporal heterogeneity. A tailored adaptive risk-level weighted loss function based on efficient locality-sensitive hashing is introduced to alleviate the zero-inflated issue. Extensive experiments on two real-world datasets offer evidence that ST-TAR is capable of advancing state-of-the-art forecasting accuracy with improved efficiency. This makes ST-TAR suitable for applications that require accurate real-time forecasting.
2328: ProMEA: Prompt-driven Expansion and Alignment for Single Domain Generalization
Authors: Yunyun Wang, Yi Guo, Xiaodong Liu, Songcan Chen
Location: Guangzhou | Day: TBD
Show Abstract
In single Domain Generalization (single-DG), data scarcity in the single source domain hampers the learning for invariant features, leading to overfitting over source domain and poor generalization to unseen target domains. Existing single-DG methods primarily augment the source domain by adversarial generation. However, there are still two key challenges. i) With simple feature perturbation to confuse the classifier, it may generate unnatural samples with semantic ambiguity or distortion. ii) It is still difficult to cover the sufficient shift in a real domain by generating indistinguishable samples from source data, thus the learning model is inescapable from overfitting to the single source domain. To this end, we turn to augment the domain prompt, considering that text prompt perturbation is easier to generate and generalize.
Then the source domain is expanded with the guidance of augmented text prompts, which are learnable with both semantic consistency and style diversity. Specifically, we propose a ProMpt-driven Expansion and Alignment (ProMEA) method for single-DG, in which a Domain Prompt Expansion module is first developed to expand the single source domain with frequency features of augmented text prompts, in which the amplitude spectrum predominantly harbors the domain style information. With source prompts, a Domain Prompt Alignment module is further designed in inference for adapting target samples to the expanded source domains, in order to reduce the domain discrepancy. Finally, empirically results over single-DG benchmarks demonstrate the superiority of our proposal.
2330: DisPIM: Distilling PreTrained Image Models for Generalizable Visuo-Motor Control
Authors: Haitao Wang, Hejun Wu
Location: Guangzhou | Day: TBD
Show Abstract
We introduce DisPIM, a framework that leverages pretrained image models (PIMs) for visuo-motor control. Applying PIMs to visuo-motor control faces a big difficulty due to the distribution shift between the distribution of visual environmental states and that of the pretraining datasets. Due to such a distribution shift, fine-tuning PIMs specifically for visuo-motor control may hurt the generalizability of PIMs, while adding additional tunable parameters for specific actions apparently lead to high computational costs. DisPIM addresses these challenges using a novel feature distillation approach, which obtains a compact model that not only inherit the generalization capability of PIMs but also acquire task-specific skills for visuo-motor control. This good for both sides is mainly achieved by means of a target Q-ensemble mechanism, which is inspired by double Q-learning. This Q-ensemble mechanism can adaptively adjust the distillation rate, so as to balance the objective of generalization and task-specific ability during training. With this balancing mechanism, DisPIM achieves both task-specific and generalizable control requiring a low computation cost. Across a series of algorithms, task domains, and evaluation metrics in both simulation and real robot, our DisPIM demonstrates significant improvements in generalization and overall performance with low computational overhead.
2331: HA-SCN: Learning Hierarchical Aligned Subtree Convolutional Networks for Graph Classification
Authors: Xinya Qin, Lu Bai, Lixin Cui, Ming Li, Hangyuan Du, Yue Wang, Edwin Hancock
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we propose a Hierarchical Aligned Subtree Convolutional Network (HA-SCN) for graph classification. Our idea is to transform graphs of arbitrary sizes into fixed-sized aligned graphs and construct a normalized K-layer m-ary subtree for each node in the aligned graphs. By sliding convolutional filters over the entire subtree at each node, we define a novel subtree convolution and pooling operation that hierarchically abstracts node-level information. We demonstrate that the proposed HA-SCN model not only realizes the convolution mechanism similar to the Convolutional Neural Networks (CNNs), which have the characteristics of weight sharing and fixed-sized receptive fields, but also effectively mitigates the over-squashing problem. Meanwhile, it establishes the correspondence information between nodes, alleviating the information loss issue. Experimental results on various benchmark graph datasets show that our approach achieves state-of-the-art performance in graph classification tasks.
2332: Towards Micro-Action Recognition with Limited Annotations: An Asynchronous Pseudo Labeling and Training Approach
Authors: Yan Zhang, Lechao Cheng, Yaxiong Wang, Zhun Zhong, Meng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Micro-Action Recognition (MAR) aims to classify subtle human actions in video. However, annotating MAR datasets is particularly challenging due to the subtlety of actions. To this end, we introduce the setting of Semi-Supervised MAR (SSMAR), where only a part of samples are labeled. We first evaluate traditional Semi-Supervised Learning (SSL) methods to SSMAR and find that these methods tend to overfit on inaccurate pseudo-labels, leading to error accumulation and degraded performance. This issue primarily arises from the common practice of directly using the predictions of classifier as pseudo-labels to train the model. To solve this issue, we propose a novel framework, called Asynchronous Pseudo Labeling and Training (APLT), which explicitly separates the pseudo-labeling process from model training. Specifically, we introduce a semi-supervised clustering method during the offline pseudo-labeling phase to generate more accurate pseudo-labels. Moreover, a self-adaptive thresholding strategy is proposed to dynamically filter noisy labels of different classes. We then build a memory-based prototype classifier based on the filtered pseudo-labels, which is fixed and used to guide the subsequent model training phase. By alternating the two pseudo-labeling and model training phases in an asynchronous manner, the model can not only be learned with more accurate pseudo-labels but also avoid the overfitting issue. Experiments on three MAR datasets show that our APLT largely outperforms state-of-the-art SSL methods. For instance, APLT improves accuracy by 14.5% over FixMatch on the MA-12 dataset when using only 50% labeled data. Code is available at https://github.com/zy-hfut/APLT
2334: TESTN: A Triad-Enhanced Spatio-Temporal Network for Multi-Temporal POI Relationship Inference
Authors: Hongyu Wang, Lisi Chen, Shuo Shang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-temporal Point-of-Interest (POI) relationship inference aims to identify evolving relationships among locations over time, providing critical insights for location-based services. While existing studies have made substantial efforts to model relationships with custom-designed graph neural networks, they face the challenge of leveraging POI contextual information characterized by spatial dependencies and temporal dynamics, as well as capturing the heterogeneity of multi-type relationships. To address these challenges, we propose a Triad-Enhanced Spatio-Temporal Network (TESTN), which conceptualizes triads as interactions between relationships for capturing potential interplay. Specifically, TESTN incorporates the spatial 2-hop aggregation layer to capture geographical and semantic information beyond first-order neighbors and the temporal context extractor to integrate relational dynamics within adjacent time segments. Furthermore, we introduce a self-supervised pairwise neighboring relation consistency detection scheme to preserve the heterogeneity of multi-type relationships. Extensive experiments on three real-world datasets demonstrate the superior performance of our TESTN framework.
2337: Mitigating Message Imbalance in Fraud Detection with Dual-View Graph Representation Learning
Authors: Yudan Song, Yuecen Wei, Yuhang Lu, Qingyun Sun, Minglai Shao, Li-e Wang, Chunming Hu, Xianxian Li, Xingcheng Fu
Location: Guangzhou | Day: TBD
Show Abstract
Graph representation learning has become a mainstream method for fraud detection due to its strong expressive power, which focuses on enhancing node representations through improved neighborhood knowledge capture. However, the focus on local interactions leads to imbalanced transmission of global topological information and increased risk of node-specific information being overwhelmed during aggregation due to the imbalance between fraud and benign nodes. In this paper, we first summarize the impact of topology and class imbalance on downstream tasks in GNN-based fraud detection, as the problem of imbalanced supervisory messages is caused by fraudsters’ topological behavior obfuscation and identity feature concealment. Based on statistical validation, we propose a novel dual-view graph representation learning method to mitigate Message imbalance in Fraud Detection (MimbFD). Specifically, we design a topological message reachability module for high-quality node representation learning to penetrate fraudsters’ camouflage and alleviate insufficient propagation. Then, we introduce a local confounding debiasing module to adjust node representations, enhancing the stable association between node representations and labels to balance the influence of different classes. Finally, we conducted experiments on three public fraud datasets, and the results demonstrate that MimbFD exhibits outstanding performance in fraud detection.
2345: Granular-Ball-Induced Multiple Kernel K-Means
Authors: Shuyin Xia, Yifan Wang, Lifeng Shen, Guoyin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Most existing multi-kernel clustering algorithms, such as multi-kernel K-means, often struggle with computational efficiency and robustness when faced with complex data distributions. These challenges stem from their dependence on point-to-point relationships for optimization, which can lead to difficulty in accurately capturing data sets’ inherent structure and diversity. Additionally, the intricate interplay between multiple kernels in such algorithms can further exacerbate these issues, effectively impacting their ability to cluster data points in high-dimensional spaces. In this paper, we leverage granular-ball computing to improve the multi-kernel clustering framework.
The core of granular-ball computing is to adaptively fit data distribution by balls from coarse to acceptable levels.
Each ball can enclose data points based on a density consistency measurement.
Such ball-based data description thus improves the computational efficiency and the robustness to unknown noises. Specifically, based on granular-ball representations, we introduce the granular-ball kernel (GBK) and its corresponding granular-ball multi-kernel K-means framework (GB-MKKM) for efficient clustering.
Using granular-ball relationships in multiple kernel spaces, the proposed GB-MKKM framework shows its superiority in efficiency and clustering performance in the empirical evaluation of various clustering tasks.
2353: Modality-Fair Preference Optimization for Trustworthy MLLM Alignment
Authors: Songtao Jiang, Yan Zhang, Ruizhe Chen, Tianxiang Hu, Yeying Jin, Qinglin He, Yang Feng, Jian Wu, Zuozhu Liu
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal large language models (MLLMs) have achieved remarkable success across various tasks. However, separate training of visual and textual encoders often results in a misalignment of the modality. Such misalignment may lead models to generate content that is absent from the input image, a phenomenon referred to as hallucination. These inaccuracies severely undermine the trustworthiness of MLLMs in real-world applications. Despite attempts to optimize text preferences to mitigate this issue, our initial investigation indicates that the trustworthiness of MLLMs remains inadequate. Specifically, these models tend to provide preferred answers even when the input image is heavily distorted. Analysis of visual token attention also indicates that the model focuses primarily on the surrounding context rather than the key object referenced in the question. These findings highlight a misalignment between the modalities, where answers inadequately leverage input images. Motivated by our findings, we propose Modality-Fair Preference Optimization (MFPO), which comprises three components: the construction of a multimodal preference dataset in which dispreferred images differ from originals solely in key regions; an image reward loss function encouraging the model to generate answers better aligned with the input images; and an easy-to-hard iterative alignment strategy to stabilize joint modality training. Extensive experiments on three trustworthiness benchmarks demonstrate that MFPO significantly enhances the trustworthiness of MLLMs. In particular, it enables the 7B models to attain trustworthiness levels on par with, or even surpass, those of the 13B, 34B, and larger models.
2372: Towards Recognizing Spatial-temporal Collaboration of EEG Phase Brain Networks for Emotion Understanding
Authors: Jiangfeng Sun, Kaiwen Xue, Qika Lin, Yufei Qiao, Yifan Zhu, Zhonghong Ou, Meina Song
Location: Guangzhou | Day: TBD
Show Abstract
Emotion recognition from EEG signals is crucial for understanding complex brain dynamics. Existing methods typically rely on static frequency bands and graph convolutional networks (GCNs) to model brain connectivity. However, EEG signals are inherently non-stationary and exhibit substantial individual variability, making static-band approaches inadequate for capturing their dynamic properties. Moreover, spatial-temporal dependencies in EEG often lead to feature degradation during node aggregation, ultimately limiting recognition performance. To address these challenges, we propose the Spatial-Temporal Electroencephalograph Collaboration framework (Stella). Our approach introduces an Adaptive Bands Selection module (ABS) that dynamically extracts low- and high-frequency components, generating dual-path features comprising phase brain networks for connectivity modeling and time-series representations for local dynamics. To further mitigate feature degradation, the Fourier Graph Operator (FGO) operates in the spectral domain, while the Spatial-Temporal Encoder (STE) enhances representation stability and density. Extensive experiments on benchmark EEG datasets demonstrate that Stella achieves state-of-the-art performance in emotion recognition, offering valuable insights for graph-based modeling of non-stationary neural signals. The code is available at https://github.com/sun2017bupt/EEGBrainNetwork.
2390: Preventing Latent Diffusion Model-Based Image Mimicry via Angle Shifting and Ensemble Learning
Authors: Minghao Li, Rui Wang, Ming Sun, Lihua Jing
Location: Guangzhou | Day: TBD
Show Abstract
The remarkable progress of Latent Diffusion Models (LDMs) in image generation has raised concerns about the potential for unauthorized image mimicry. To address these concerns, studies on adversarial attacks against LDMs have gained increasing attention in recent years. However, existing methods face bottlenecks when attacking the denoising module. In this work, we reveal that the robustness of the denoising module stems from two key factors: the cancellation effect between adversarial perturbations and estimated noise, and unstable gradients caused by randomly sampled timesteps and Gaussian noise. Based on these insights, we introduce a cosine similarity adversarial loss to prevent the generation of perturbations that are easily impaired and develop a more stable optimization strategy by ensembling gradients and fixing the noise in the latent space. Additionally, we propose an alternating iterative framework to reduce memory usage by mathematically dividing the optimization process into two spaces: latent space and pixel space. Compared to previous strategies, our proposed framework reduces video memory demands without sacrificing attack effectiveness. Extensive experiments demonstrate that the alternating iterative framework and the stable optimization strategy on cosine similarity loss are more efficient and more effective. Code is available at https://github.com/MinghaoLi01/cosattack.
2415: Adversarial Training for Graph Convolutional Networks: Stability and Generalization Analysis
Authors: Chang Cao, Han Li, Yulong Wang, Rui Wu, Hong Chen
Location: Guangzhou | Day: TBD
Show Abstract
Recently, numerous methods have been proposed to enhance the robustness of the Graph Convolutional Networks (GCNs) for their vulnerability against adversarial attacks. Despite their empirical success, a significant gap remains in understanding GCNs’ adversarial robustness from the theoretical perspective. This paper addresses this gap by analyzing generalization against both node and structure attacks for multi-layer GCNs through the framework of uniform stability. Under the smoothness assumption of the loss function, we establish the first adversarial generalization bound of GCNs in expectation. Our theoretical analysis contributes to a deeper understanding of how adversarial perturbations and graph architectures influence generalization performance, which provides meaningful insights for designing robust models. Experimental results on benchmark datasets confirm the validity of our theoretical findings, highlighting their practical significance.
2417: Identifying and Reusing Learnwares Across Different Label Spaces
Authors: Jian-Dong Liu, Zhi-Hao Tan, Zhi-Hua Zhou
Location: Guangzhou | Day: TBD
Show Abstract
The learnware paradigm focuses on leveraging numerous established high-performing models to solve machine learning tasks instead of starting from scratch. As the key concept of this paradigm, a learnware consists of a well-trained model of any structure and a specification that characterizes the model’s capabilities, allowing it to be identified and reused for future tasks. Given the existence of numerous real-world models trained on diverse label spaces, effectively identifying and combining these models to address tasks involving previously unseen label spaces represents a critical challenge in this paradigm. In this paper, we make the first attempt to identify and reuse effective learnware combinations for tackling learning tasks across different label spaces, extending their applicability beyond the original purposes of individual learnwares. To this end, we introduce a statistical class-wise specification for establishing similarity relations between various label spaces. Leveraging these relations, we model the utility of a learnware combination as a minimum-cost maximum-flow problem, and further develop fine-grained learnware identification and assembly methods. Extensive experiments with thousands of heterogeneous models validate our approach, demonstrating that reusing identified learnware combinations can outperform both training from scratch and fine-tuning a generic pre-trained model.
2426: RLMiniStyler: Light-weight RL Style Agent for Arbitrary Sequential Neural Style Generation
Authors: Jing Hu, Chengming Feng, Shu Hu, Ming-Ching Chang, Xin Li, Xi Wu, Xin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Arbitrary style transfer aims to apply the style of any given artistic image to another content image. Still, existing deep learning-based methods often require significant computational costs to generate diverse stylized results. Motivated by this, we propose a novel reinforcement learning-based framework for arbitrary style transfer RLMiniStyler. This framework leverages a unified reinforcement learning policy to iteratively guide the style transfer process by exploring and exploiting stylization feedback, generating smooth sequences of stylized results while achieving model lightweight. Furthermore, we introduce an uncertainty-aware multi-task learning strategy that automatically adjusts loss weights to adapt to the content and style balance requirements at different training stages, thereby accelerating model convergence. Through a series of experiments across image various resolutions, we have validated the advantages of RLMiniStyler over other state-of-the-art methods in generating high-quality, diverse artistic image sequences at a lower cost. Codes are available at https://github.com/fengxiaoming520/RLMiniStyler.
2430: Self-Consistent Model-based Adaptation for Visual Reinforcement Learning
Authors: Xinning Zhou, Chengyang Ying, Yao Feng, Hang Su, Jun Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Visual reinforcement learning agents typically face serious performance declines in real-world applications caused by visual distractions. Existing methods rely on fine-tuning the policy’s representations with hand-crafted augmentations. In this work, we propose Self-Consistent Model-based Adaptation (SCMA), a novel method that fosters robust adaptation without modifying the policy. By transferring cluttered observations to clean ones with a denoising model, SCMA can mitigate distractions for various policies as a plug-and-play enhancement. To optimize the denoising model in an unsupervised manner, we derive an unsupervised distribution matching objective with a theoretical analysis of its optimality. We further present a practical algorithm to optimize the objective by estimating the distribution of clean observations with a pre-trained world model. Extensive experiments on multiple visual generalization benchmarks and real robot data demonstrate that SCMA effectively boosts performance across various distractions and exhibits better sample efficiency.
2440: CMFS: CLIP-Guided Modality Interaction for Mitigating Noise in Multi-Modal Image Fusion and Segmentation
Authors: Guilin Su, Yuqing Huang, Chao Yang, Zhenyu He
Location: Guangzhou | Day: TBD
Show Abstract
Infrared-visible image fusion and semantic segmentation are pivotal tasks for robust scene understanding under challenging conditions such as low light. However, existing methods often struggle with high noise, modality inconsistencies, and inefficient cross-modal interactions, limiting fusion quality and segmentation accuracy. To this end, we propose CMFS, a unified framework that leverages CLIP-guided modality interaction to mitigate noise in multi-modal image fusion and segmentation. Our approach features a region-aware Modal Interaction Alignment module that combines a VMamba-based encoder with an additional shuffle layer to obtain more robust features and a CLIP-guided, regionally constrained multi-modal feature interaction block to emphasize foreground targets while suppressing low-light noise. Additionally, a Frequency-Spatial Collaboration module uses selective scanning and integrates wavelet-, spatial-, and Fourier-domain features to achieve adaptive denoising and balanced feature allocation. Furthermore, we employ a low-rank mixture-of-experts with dynamic routing to improve region-specific fusion and enhance pixel-level accuracy. Extensive experiments on several benchmarks show that, compared with state-of-the-art methods, the proposed approach demonstrates effectiveness in both image fusion quality and semantic segmentation accuracy, especially in complex environments. The source code will be released at IJCAI2025-CMFS.
2446: Exact Algorithms with New Upper Bounds for the Maximum k-plex Problem
Authors: Jiongzhi Zheng, Mingming Jin, Kun He
Location: Guangzhou | Day: TBD
Show Abstract
The Maximum k-plex Problem (MKP) is a degree relaxation of the widely known Maximum Clique Problem. As a practical NP-hard problem, MKP has many important real-world applications, such as the analysis of various complex networks. Branch-and-bound (BnB) algorithms are a type of well-studied and effective exact algorithms for MKP, and the key for BnB algorithms is the bound design. Recent BnB MKP algorithms involve two kinds of upper bounds based on graph coloring and partition, respectively, that work in different perspectives and thus are complementary with each other. We first propose a new coloring-based upper bound, termed Relaxed Graph Color Bound (RelaxGCB), that significantly outperforms the previous coloring-based upper bound. Then we further propose another new upper bound, termed RelaxPUB, that incorporates RelaxGCB and a partition-based upper bound in a novel way, making use of their complementarity. We apply RelaxGCB and RelaxPUB to state-of-the-art BnB MKP algorithms and produce eight new BnB algorithms. Extensive experiments using diverse k values on hundreds of instances based on dense or massive sparse graphs demonstrate the excellent performance and robustness of our proposed methods.
2449: Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking
Authors: Shenglan Li, Rui Yao, Yong Zhou, Hancheng Zhu, Kunyang Sun, Bing Liu, Zhiwen Shao, Jiaqi Zhao
Location: Guangzhou | Day: TBD
Show Abstract
To reduce the reliance on large-scale annotations, self-supervised RGB-T tracking approaches have garnered significant attention. However, the omission of the object region by erroneous pseudo-label or the introduction of background noise affects the efficiency of modality fusion, while pseudo-label noise triggered by similar object noise can further affect the tracking performance. In this paper, we propose GDSTrack, a novel approach that introduces dynamic graph fusion and temporal diffusion to address the above challenges in self-supervised RGB-T tracking. GDSTrack dynamically fuses the modalities of neighboring frames, treats them as distractor noise, and leverages the denoising capability of a generative model. Specifically, by constructing an adjacency matrix via an Adjacency Matrix Generator (AMG), the proposed Modality-guided Dynamic Graph Fusion (MDGF) module uses a dynamic adjacency matrix to guide graph attention, focusing on and fusing the object’s coherent regions. Temporal Graph-Informed Diffusion (TGID) models MDGF features from neighboring frames as interference, and thus improving robustness against similar-object noise. Extensive experiments conducted on four public RGB-T tracking datasets demonstrate that GDSTrack outperforms the existing state-of-the-art methods.
The source code is available at https://github.com/LiShenglana/GDSTrack.
2452: Trajectory-Dependent Generalization Bounds for Pairwise Learning with φ-mixing Samples
Authors: Liyuan Liu, Hong Chen, Weifu Li, Tieliang Gong, Hao Deng, Yulong Wang
Location: Guangzhou | Day: TBD
Show Abstract
Recently, the mathematical tool from fractal geometry (i.e., fractal dimension) has been employed to investigate optimization trajectory-dependent generalization ability for some pointwise learning models with independent and identically distributed (i.i.d.) observations. This paper goes beyond the limitations of pointwise learning and i.i.d. samples, and establishes generalization bounds for pairwise learning with uniformly strong mixing samples. The derived theoretical results fill the gap of trajectory-dependent generalization analysis for pairwise learning, and can be applied to wide learning paradigms, e.g., metric learning, ranking and gradient learning. Technically, our framework brings concentration estimation with Rademacher complexity and trajectory-dependent fractal dimension together in a coherent way for felicitous learning theory analysis. In addition, the efficient computation of fractal dimension can be guaranteed for random algorithms (e.g., stochastic gradient descent algorithm for deep neural networks) by bridging topological data analysis tools and the trajectory-dependent fractal dimension.
2453: Self-calibration Enhanced Whole Slide Pathology Image Analysis
Authors: Haoming Luo, Xiaotian Yu, Shengxuming Zhang, Jiabin Xia, Jian Yang, Yuning Sun, Xiuming Zhang, Jing Zhang, Zunlei Feng
Location: Guangzhou | Day: TBD
Show Abstract
Pathology images are considered the “gold standard" for cancer diagnosis and treatment, with gigapixel images providing extensive tissue and cellular information. Existing methods fail to simultaneously extract global structural and local detail features for comprehensive pathology image analysis efficiently. To address these limitations, we propose a self-calibration enhanced framework for whole slide pathology image analysis, comprising three components: a global branch, a focus predictor, and a detailed branch. The global branch initially classifies using the pathological thumbnail, while the focus predictor identifies relevant regions for classification based on the last layer features of the global branch. The detailed extraction branch then assesses whether the magnified regions correspond to the lesion area. Finally, a feature consistency constraint between the global and detail branches ensures that the global branch focuses on the appropriate region and extracts sufficient discriminative features for final identification. These focused discriminative features can facilitate the discovery of novel prognostic tumor markers, from the perspective of feature uniqueness and tissue spatial distribution. Extensive experiment results demonstrate that the proposed framework can rapidly deliver accurate and explainable results for pathological grading and prognosis tasks.
2466: RDPA: Real-Time Distributed-Concentrated Penetration Attack for Point Cloud Learning
Authors: Youtong Shi, Lixin Chen, Yu Zang, Chenhui Yang, Cheng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Partial point attack approaches focus on leveraging the fewest points to achieve the best attack efficiency for easy implementation in the physical domain. For the first time, this paper proposes that the partial point attack strategy should pay attention to not only the selection and disturbance of points, but also the penetration of current defense methods. By re-examining characteristics of previous partial point attack approaches leading to performance improvement, we discover two fundamental principles: first, the selection of attacked points should consider not only the favourable visual salience but also the proper position concentration, thus to acquire effective structural destruction on the basis of remaining imperceptible; second, the perturbation of target points should form meaningful structures rather than outliers. To achieve this, we first propose a novel distributed-concentrated point selection (DPS) strategy, which is easier to concentrate salient points containing rich local information in a few tiny regions. Additionally, to enhance the penetration efficacy and real-time performance of attack point clouds against defenses, we further design a perturbation network based on the multi-scale penetration loss (L_msp), which can generate adversarial samples with as few outliers as possible only through a single forward propagation. Experimental results demonstrate that the real-time distributed-concentrated penetration attack (RDPA) framework can achieve state-of-the-art (SOTA) success rates by perturbing only 3.5% of points, and have the best penetration for mainstream defense methods such as SRS and SOR.
2480: S-EPOA: Overcoming the Indistinguishability of Segments with Skill-Driven Preference-Based Reinforcement Learning
Authors: Ni Mu, Yao Luan, Yiqin Yang, Bo Xu, Qing-Shan Jia
Location: Guangzhou | Day: TBD
Show Abstract
Preference-based reinforcement learning (PbRL) stands out by utilizing human preferences as a direct reward signal, eliminating the need for intricate reward engineering. However, despite its potential, traditional PbRL methods are often constrained by the indistinguishability of segments, which impedes the learning process. In this paper, we introduce Skill-Enhanced Preference Optimization Algorithm (S-EPOA), which addresses the segment indistinguishability issue by integrating skill mechanisms into the preference learning framework. Specifically, we first conduct the unsupervised pretraining to learn useful skills. Then, we propose a novel query selection mechanism to balance the information gain and distinguishability over the learned skill space. Experimental results on a range of tasks, including robotic manipulation and locomotion, demonstrate that S-EPOA significantly outperforms conventional PbRL methods in terms of both robustness and learning efficiency. The results highlight the effectiveness of skill-driven learning in overcoming the challenges posed by segment indistinguishability.
2495: Enhancing Chemical Reaction and Retrosynthesis Prediction with Large Language Model and Dual-task Learning
Authors: Xuan Lin, Qingrui Liu, Hongxin Xiang, Daojian Zeng, Xiangxiang Zeng
Location: Guangzhou | Day: TBD
Show Abstract
Chemical reaction and retrosynthesis prediction are fundamental tasks in drug discovery. Recently, large language models (LLMs) have shown potential in many domains. However, directly applying LLMs to these tasks faces two major challenges: (i) lacking a large-scale chemical synthesis-related instruction dataset; (ii) ignoring the close correlation between reaction and retrosynthesis prediction for the existing fine-tuning strategies. To address these challenges, we propose ChemDual, a novel LLM framework for accurate chemical synthesis. Specifically, considering the high cost of data acquisition for reaction and retrosynthesis, ChemDual regards the reaction-and-retrosynthesis of molecules as a related recombination-and-fragmentation process and constructs a large-scale of 4.4 million instruction dataset. Furthermore, ChemDual introduces an enhanced LLaMA, equipped with a multi-scale tokenizer and dual-task learning strategy, to jointly optimize the process of recombination and fragmentation as well as the tasks between reaction and retrosynthesis prediction. Extensive experiments on Mol-Instruction and USPTO-50K datasets demonstrate that ChemDual achieves state-of-the-art performance in both predictions of reaction and retrosynthesis, outperforming the existing conventional single-task approaches and the general open-source LLMs. Through molecular docking analysis, ChemDual generates compounds with diverse and strong protein binding affinity, further highlighting its strong potential in drug design.
2503: Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling
Authors: Rui Qin, Qijie Wang, Ming Sun, Haowei Zhu, Chao Zhou, Bin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Diffusion models have gained attention for their success in modeling complex distributions, achieving impressive perceptual quality in SR tasks. However, existing diffusion-based SR methods often suffer from high computational costs, requiring numerous iterative steps for training and inference. Existing acceleration techniques, such as distillation and solver optimization, are generally task-agnostic and do not fully leverage the specific characteristics of low-level tasks like super-resolution (SR). In this study, we analyze the frequency- and spatial-domain properties of diffusion-based SR methods, revealing key insights into the temporal and spatial dependencies of high-frequency signal recovery. Specifically, high-frequency details benefit from concentrated optimization during early and late diffusion iterations, while spatially textured regions demand adaptive denoising strategies. Building on these observations, we propose the Time-Spatial-aware Sampling strategy (TSS) for the acceleration of Diffusion SR without any extra training cost. TSS combines Time Dynamic Sampling (TDS), which allocates more iterations to refining textures, and Spatial Dynamic Sampling (SDS), which dynamically adjusts strategies based on image content. Extensive evaluations across multiple benchmarks demonstrate that TSS achieves state-of-the-art (SOTA) performance with significantly fewer iterations, improving MUSIQ scores by 0.2~3.0 and outperforming the current acceleration methods with only half the number of steps.
2512: Multimodal Cancer Survival Analysis via Hypergraph Learning with Cross-Modality Rebalance
Authors: Mingcheng Qu, Guang Yang, Donglin Di, Tonghua Su, Yue Gao, Yang Song, Lei Fan
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal pathology-genomic analysis has become increasingly prominent in cancer survival prediction. However, existing studies mainly utilize multi-instance learning to aggregate patch-level features, neglecting the information loss of contextual and hierarchical details within pathology images. Furthermore, the disparity in data granularity and dimensionality between pathology and genomics leads to a significant modality imbalance. The high spatial resolution inherent in pathology data renders it a dominant role while overshadowing genomics in multimodal integration. In this paper, we propose a multimodal survival prediction framework that incorporates hypergraph learning to effectively capture both contextual and hierarchical details from pathology images. Moreover, it employs a modality rebalance mechanism and an interactive alignment fusion strategy to dynamically reweight the contributions of the two modalities, thereby mitigating the pathology-genomics imbalance. Quantitative and qualitative experiments are conducted on five TCGA datasets, demonstrating that our model outperforms advanced methods by over 3.4% in C-Index performance. Code: https://github.com/MCPathology/MRePath.
2522: SyncAnimation: A Real-Time End-to-End Framework for Audio-Driven Human Pose and Talking Head Animation
Authors: Yujian Liu, Shidang Xu, Jing Guo, Dingbin Wang, Zairan Wang, Xianfeng Tan, Xiaoli Liu
Location: Guangzhou | Day: TBD
Show Abstract
Generating talking avatar driven by audio remains a significant challenge. Existing methods typically require high computational costs and often lack sufficient facial detail and realism, making them unsuitable for applications that demand high real-time performance and visual quality. Additionally, while some methods can synchronize lip movement, they still face issues with consistency between facial expressions and upper body movement, particularly during silent periods. In this paper, we introduce SyncAnimation, the first NeRF-based method that achieves audio-driven, stable, and real-time generation of speaking avatar by combining generalized audio-to-pose matching and audio-to-expression synchronization. By integrating AudioPose Syncer and AudioEmotion Syncer, SyncAnimation achieves high-precision poses and expression generation, progressively producing audio-synchronized upper body, head, and lip shapes. Furthermore, the High-Synchronization Human Renderer ensures seamless integration of the head and upper body, and achieves audio-sync lip. The project page can be found at https://syncanimation.github.io.
2523: Denoising Diffusion Models are Good General Gaze Feature Learners
Authors: Guanzhong Zeng, Jingjing Wang, Pengwei Yin, Zefu Xu, Mingyang Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Since the collection of labeled gaze data is laborious and time-consuming, methods which can learn generalizable features by leveraging large-scale available unlabeled data are desirable. In recent years, we have witnessed the tremendous capabilities of diffusion models in generating images as well as their potential in feature representation learning. In this paper, we investigate whether they can acquire discriminative representations for gaze estimation via generative pre-training. To achieve this goal, we propose a self-supervised learning framework with diffusion models for gaze estimation, called GazeDiff. Specifically, we utilize a conditional diffusion model to generate target image with gaze direction specified by the reference image as the pre-training task. To facilitate the diffusion model to learn gaze related features as condition, we propose a disentangling feature learning strategy, which first learns appearance feature, head pose feature, and eye direction feature respectively, and then combines them as the conditional features. Extensive experiments demonstrate denoising diffusion models are also good general gaze feature learners.
2551: An End-to-End Simple Clustering Hierarchical Pooling Operation for Graph Learning Based on Top-K Node Selection
Authors: Zhehan Zhao, Lu Bai, Ming Li, Lixin Cui, Hangyuan Du, Yue Wang, Edwin Hancock
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) are powerful tools for graph learning, but one of the important challenges is how to effectively extract representations for graph-level tasks. In this paper, we propose an end-to-end Simple Clustering Hierarchical Pooling (SCHPool) operation, which is based on Top-K node selection for learning expressive graph representations. Specifically, SCHPool considers each node and its local neighborhood as a cluster, and introduces a novel multi-view scoring function to evaluate node importance. Based on these scores, clusters centered around the Top-K nodes are retained. This design eliminates the need for complex clustering operations, significantly reducing computational overhead. Furthermore, during the coarsening process, SCHPool employs a lightweight yet comprehensive attention mechanism to adaptively aggregate both the node features within clusters and the edge connectivity strengths between clusters. This facilitates the construction of more informative coarsened graphs, enhancing model performance. Experimental results demonstrate the effectiveness of the proposed model.
2552: SIFAR: A Simple Faster Accelerated Variance-Reduced Gradient Method
Authors: Zhize Li
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we propose a simple faster accelerated gradient method called SIFAR for solving the finite-sum optimization problems. Concretely, we consider both general convex and strongly convex settings: i) For general convex finite-sum problems, SIFAR improves previous state-of-the-art result given by Varag. In particular, for large-scale problems or the convergence error is not very small, SIFAR obtains the first optimal result O(n), matching the lower bound. ii) For strongly convex finite-sum problems, we also show that SIFAR can achieve the optimal convergence rate matching the lower bound. Besides, SIFAR enjoys a simpler loopless algorithmic structure while previous algorithms use double-loop structures. Moreover, we provide a novel dynamic multi-stage convergence analysis, which is the key for improving previous results to the optimal rates. Our new theoretical rates and novel convergence analysis for the fundamental finite-sum problem can directly lead to key improvements for many other related problems, such as distributed/federated/decentralized optimization problems. Finally, the numerical experiments show that SIFAR converges faster than the previous state-of-the-art Varag, validating our theoretical results and confirming the practical superiority of SIFAR.
2554: Label Distribution Learning with Biased Annotations Assisted by Multi-Label Learning
Authors: Zhiqiang Kou, Si Qin, Hailin Wang, Jing Wang, Mingkun Xie, Shuo Chen, Yuheng Jia, Tongliang Liu, Masashi Sugiyama, Xin Geng
Location: Guangzhou | Day: TBD
Show Abstract
Multi-label learning (MLL) has gained attention for its ability to represent real-world data. Label Distribution Learning (LDL), an extension of MLL to learning from label distributions, faces challenges in collecting accurate label distributions. To address the issue of biased annotations, based on the low-rank assumption, existing works recover true distributions from biased observations by exploring the label correlations. However, recent evidence shows that the label distribution tends to be full-rank, and naive apply of low-rank approximation on biased observation leads to inaccurate recovery and performance degradation. In this paper, we address the LDL with biased annotations problem from a novel perspective, where we first degenerate the soft label distribution into a hard multi-hot label and then recover the true label information for each instance. This idea stems from an insight that assigning hard multi-hot labels is often easier than assigning a soft label distribution, and it shows stronger immunity to noise disturbances, leading to smaller label bias. Moreover, assuming that the multi-label space for predicting label distributions is low-rank offers a more reasonable approach to capturing label correlations. Theoretical analysis and experiments confirm the effectiveness and robustness of our method on real-world datasets.
2595: Probabilistic Multimodal Learning with von Mises-Fisher Distributions
Authors: Peng Hu, Yang Qin, Yuanbiao Gou, Yunfan Li, Mouxing Yang, Xi Peng
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal learning is pivotal for the advancement of artificial intelligence, enabling machines to integrate complementary information from diverse data sources for holistic perception and understanding. Despite significant progress, existing methods struggle with challenges such as noisy inputs, noisy correspondence, and the inherent uncertainty of multimodal data, limiting their reliability and robustness. To address these issues, this paper presents a novel Probabilistic Multimodal Learning framework (PML) that models each data point as a von Mises-Fisher (vMF) distribution, effectively capturing intrinsic uncertainty and enabling robust fusion. Unlike traditional Gaussian-based models, PML learns directional representation with a concentration parameter to quantify reliability directly, enhancing stability and interpretability. To enhance discrimination, we propose a von Mises-Fisher Prototypical Contrastive Learning paradigm (vMF-PCL), which projects data onto a hypersphere by pulling within-class samples closer to their class prototype while pushing between-class prototypes apart, adaptively learning the reliability estimations. Building upon the estimated reliability, we develop a Reliable Multimodal Fusion mechanism (RMF) that dynamically adjusts the contribution and conflict of each modality, ensuring robustness against noisy data, noisy correspondence, and uncertainty. Extensive experiments on nine benchmarks demonstrate the superiority of PML, consistently outperforming 14 state-of-the-art methods. Code is available at https://github.com/XLearning-SCU/2025-IJCAI-PML.
2603: FAST: A Lightweight Mechanism Unleashing Arbitrary Client Participation in Federated Learning
Authors: Zhe Li, Seyedsina Nabavirazavi, Bicheng Ying, Sitharama Iyengar, Haibo Yang
Location: Guangzhou | Day: TBD
Show Abstract
Federated Learning (FL) provides a flexible distributed platform where numerous clients with high data and system heterogeneity can collaborate to learn a model. While previous research has shown that FL can handle diverse data, it often completely assumes idealized conditions. In practice, real-world factors make it hard to predict or design individual client participation. This complexity results in an unknown participation pattern – arbitrary client participation (ACP). Hence, the key open problem is to understand the impact of client participation and develop a lightweight mechanism to support ACP in FL. In this paper, we first empirically investigate the client participation’s influence in FL, revealing that FL algorithms are adversely impacted by ACP. To alleviate the impact, we propose a lightweight solution, Federated Average with Snapshot (FAST), that supports almost ACP for FL and can seamlessly integrate with other classic FL algorithms. Specifically, FAST enforces clients to take a snapshot once in a while and facilitates ACP for the majority of training processes. We prove that the convergence rates of FAST in non-convex and strongly-convex cases match those under ideal client participation. Furthermore, we empirically introduce an adaptive strategy to dynamically configure the snapshot frequency, tailored to accommodate diverse FL systems. Extensive experiments show that FAST significantly improves performance under ACP and high data heterogeneity.
2627: Parallel Belief Contraction via Order Aggregation
Authors: Jake Chandler, Richard Booth
Location: Guangzhou | Day: TBD
Show Abstract
The standard “serial” (aka “singleton”) model of belief contraction models the manner in which an agent’s corpus of beliefs responds to the removal of a single item of information. One salient extension of this model introduces the idea of “parallel” (aka “package” or “multiple”) change, in which an entire set of items of information are simultaneously removed. Existing research on the latter has largely focussed on single-step parallel contraction: understanding the behaviour of beliefs after a single parallel contraction. It has also focussed on generalisations to the parallel case of serial contraction operations whose characteristic properties are extremely weak. Here we consider how to extend serial contraction operations that obey stronger properties. Potentially more importantly, we also consider the iterated case: the behaviour of beliefs after a sequence of parallel contractions. We propose a general method for extending serial iterated belief change operators to handle parallel change based on an n-ary generalisation of Booth & Chandler’s TeamQueue binary order aggregators.
2629: Parallel Belief Revision via Order Aggregation
Authors: Jake Chandler, Richard Booth
Location: Guangzhou | Day: TBD
Show Abstract
Despite efforts to better understand the constraints that operate on single-step parallel (aka “package”, “multiple”) revision, very little work has been carried out on how to extend the model to the iterated case. A recent paper by Delgrande & Jin outlines a range of relevant rationality postulates. While many of these are plausible, they lack an underlying unifying explanation. We draw on recent work on iterated parallel contraction to offer a general method for extending serial iterated belief revision operators to handle parallel change. This method, based on a family of order aggregators known as TeamQueue aggregators, provides a principled way to recover the independently plausible properties that can be found in the literature, without yielding the more dubious ones.
2639: Consensus-Guided Incomplete Multi-view Clustering via Cross-view Affinities Learning
Authors: Qian Liu, Huibing Wang, Jinjia Peng, Yawei Chen, Mingze Yao, Xianping Fu, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Incomplete multi-view clustering (IMC) has garnered substantial attention due to its capacity to handle unlabeled data. Existing methods predominantly explore pairwise consistency between every two views. However, such consistency is highly susceptible to missing samples and outliers within a certain view and thus deviates from the true clustering distribution. Moreover, dual-view interaction neglects the collaboration effects of multiple views, making it challenging to capture the holistic characteristics across views. In response to these issues, we propose a novel Consensus-Guided Incomplete Multi-view Clustering via Cross-view Affinities Learning (CAL). Specifically, CAL reconstructs views with available instances to mine sample-wise affinities and harness comprehensive content information within views. Subsequently, to extract clean structural information, CAL imposes a structured sparse constraint on the representation tensor to eliminate biased errors. Furthermore, by integrating the consensus representation into a representation tensor, CAL can employ high-order interaction of multiple views to depict the semantic correlation between views while acquiring a unified structural graph across multiple views. Extensive experiments on seven benchmark datasets demonstrate that CAL outperforms some state-of-the-art methods in clustering performance. The code is available at https://github.com/whbdmu/CAL.
2642: A Multi-view Fusion Approach for Enhancing Speech Signals via Short-time Fractional Fourier Transform
Authors: Zikun Jin, Yuhua Qian, Xinyan Liang, Haijun Geng
Location: Guangzhou | Day: TBD
Show Abstract
Deep learning-based speech enhancement (SE) methods focus on reconstructing speech from the time or frequency domain. However, these domains cannot provide enough information to capture the dynamics of non-stationary signals accurately. To enrich information, this work proposes a multi-view fusion SE method (MFSE). Specifically, MFSE extends the representation space of speech to the dynamic domain (also called fractional domain) between the time and frequency domains by using the short-time fractional Fourier transform (STFrFT). Subsequently, we construct inputs as modes of the primary short-time Fourier transform (STFT) spectrum and the auxiliary STFrFT spectrum views and adaptively identify the optimal fractional STFrFT spectrum from the infinitely continuous fractional domain by leveraging the average spectral centroids. The framework extracts potential features through multiple designed convolutional modules and captures the correlation between different speech frequencies through multi-granularity attention.
Experimental results show that the proposed method significantly improves performance in several metrics compared to existing single-channel SE methods based on time and frequency domains. Furthermore, the results of its generalizability evaluation show that the multi-view method outperforms the single-view method under a wide range of SNR conditions.
2648: Enhancing Counterfactual Estimation: A Focus on Temporal Treatments
Authors: Xin Wang, Shengfei Lyu, Kangyang Luo, Lishan Yang, Huanhuan Chen, Chunyan Miao
Location: Guangzhou | Day: TBD
Show Abstract
In the medical field, treatment sequences significantly influence future outcomes through complex temporal interactions. Therefore, highlighting the role of temporal treatments within the model is crucial for accurate counterfactual estimation, which is often overlooked in current methods. To address this, we employ Koopman theory, known for its capability to model complex dynamic systems, and introduce a novel model named the Counterfactual Temporal Dynamics Network via Neural Koopman Operators (CTD-NKO). This model utilizes Koopman operators to encapsulate sequential treatment data, aiming to capture the causal dynamics within the system induced by temporal interactions between treatments. Moreover, CTD-NKO implements a weighting strategy that aligns joint and marginal distributions of the system state and the current treatment to mitigate time-varying confounding bias. This deviates from the balanced representation strategy employed by existing methods, as we demonstrate that such a strategy may suffer from the potential information loss of historical treatments. These designs allow CTD-NKO to exploit treatment information more thoroughly and effectively, resulting in superior performance on both synthetic and real-world datasets.
2661: Prototype-based Optimal Transport for Out-of-Distribution Detection
Authors: Ao Ke, Wenlong Chen, Chuanwen Feng, Yukun Cao, Xike Xie, S. Kevin Zhou, Lei Feng
Location: Guangzhou | Day: TBD
Show Abstract
Detecting Out-of-Distribution (OOD) inputs is crucial for improving the reliability of deep neural networks in the real-world deployment. In this paper, inspired by the inherent distribution shift between in-distribution (ID) and OOD data, we propose a novel method that leverages optimal transport to measure the distribution discrepancy between test inputs and ID prototypes. The resulting transport costs are used to quantify the individual contribution of each test input to the overall discrepancy, serving as a desirable measure for OOD detection. To address the issue that solely relying on the transport costs to ID prototypes is inadequate for identifying OOD inputs closer to ID data, we generate virtual outliers to approximate the OOD region via linear extrapolation. By combining the transport costs to ID prototypes with the costs to virtual outliers, the detection of OOD data near ID data is emphasized, thereby enhancing the distinction between ID and OOD inputs. Extensive evaluations demonstrate the superiority of our method over state-of-the-art methods.
2663: Variational Offline Multi-agent Skill Discovery
Authors: Jiayu Chen, Tian Lan, Vaneet Aggarwal
Location: Guangzhou | Day: TBD
Show Abstract
Skills are effective temporal abstractions established for sequential decision making, which enable efficient hierarchical learning for long-horizon tasks and facilitate multi-task learning through their transferability. Despite extensive research, research gaps remain in multi-agent scenarios, particularly for automatically extracting subgroup coordination patterns in a multi-agent task. In this case, we propose two novel auto-encoder schemes: VO-MASD-3D and VO-MASD-Hier, to simultaneously capture subgroup- and temporal-level abstractions and form multi-agent skills, which firstly solves the aforementioned challenge. An essential algorithm component of these schemes is a dynamic grouping function that can automatically detect latent subgroups based on agent interactions in a task. Further, our method can be applied to offline multi-task data, and the discovered subgroup skills can be transferred across relevant tasks without retraining. Empirical evaluations on StarCraft tasks indicate that our approach significantly outperforms existing hierarchical multi-agent reinforcement learning (MARL) methods. Moreover, skills discovered using our method can effectively reduce the learning difficulty in MARL scenarios with delayed and sparse reward signals. The codebase is available at: https://github.com/LucasCJYSDL/VOMASD.
2666: POMP: Pathology-omics Multimodal Pre-training Framework for Cancer Survival Prediction
Authors: Suixue Wang, Shilin Zhang, Huiyuan Lai, Weiliang Huo, Qingchen Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Cancer survival prediction is an important direction in precision medicine, aiming to help clinicians tailor treatment regimens for patients. With the rapid development of high-throughput sequencing and computational pathology technologies, survival prediction has shifted from clinical features to joint modeling of multi-omics data and pathology images. However, existing multimodal learning methods struggle to effectively learn pathology-omics interactions due to the lack of proper alignment of multimodal data before fusion. In this paper, we propose POMP, a pathology-omics multimodal pre-training framework jointly learned with three training tasks for integrating pathological images and omics data for cancer survival prediction. To better perform cross-modal learning, we introduce a pathology-omics contrastive learning method to align the pathology and omics information. POMP leverages the principle of pre-trained models and explores the benefit of aligning multimodal information from the same patient, achieving state-of-the-art results on six cancer datasets from the Cancer Genome Atlas (TCGA). We also show that our contrastive learning method allows us to exploit the cosine similarity of pathological images and omics data as the survival risk score, which can further boost prediction performance compared with other commonly used methods. The code is available at https://github.com/SuixueWang/POMP.
2667: How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback
Authors: Manzong Huang, Chenyang Bu, Yi He, Xindong Wu
Location: Guangzhou | Day: TBD
Show Abstract
Knowledge Graph (KG)-augmented Large Language Models (LLMs) have recently propelled significant advances in complex reasoning tasks, thanks to their broad domain knowledge and contextual awareness. Unfortunately, current methods often assume KGs to be complete, which is impractical given the inherent limitations of KG construction and the potential loss of contextual cues when converting unstructured text into entity-relation triples.
In response, this paper proposes the Triple Context Restoration and Query-driven Feedback (TCR-QF) framework, which reconstructs the textual context underlying each triple to mitigate information loss, while dynamically refining the KG structure by iteratively incorporating query-relevant missing knowledge.
Experiments on five benchmark question-answering datasets substantiate the effectiveness of TCR-QF in KG and LLM integration, where itachieves a 29.1% improvement in Exact Match and a 15.5% improvement in F1 over its state-of-the-art GraphRAG competitors. The code is publicly available at https://github.com/HFUT-DMiC-Lab/TCR-QF.git.
2676: Inconsistency-Based Federated Active Learning
Authors: Chen-Chen Zong, Tong Jin, Sheng-Jun Huang
Location: Guangzhou | Day: TBD
Show Abstract
Federated learning (FL) enables distributed collaborative learning across local clients while preserving data privacy. However, its practical application in weakly supervised learning (WSL), where only a small subset of data is labeled, remains underexplored. Active learning (AL) is a promising solution for label-limited scenarios, but its adaptation to federated settings presents unique challenges, such as data heterogeneity and noise. In this paper, we propose Inconsistency-based Federated Active Learning (IFAL), a novel approach to address these challenges. First, we introduce a data-driven probability formulation that aligns the biases between local and global models in heterogeneous FL settings. Next, to mitigate noise, we propose an inter-model inconsistency criterion that filters out noisy examples and focuses on those with beneficial prediction discrepancies. Additionally, we introduce an intra-model inconsistency criterion to query examples that help refine the model’s decision boundaries. By combining these strategies with clustering, IFAL effectively selects a diverse and informative query set. Extensive experiments on benchmark datasets demonstrate that IFAL outperforms state-of-the-art methods.
2687: CSF-GAN: Cross-modal Semantic Fusion-based Generative Adversarial Network for Text-guided Image Inpainting
Authors: Shilin Zhang, Suixue Wang, Qingchen Zhang, Liang Zhao, Weiliang Huo, Sijia Hou, Chunjiang Fu
Location: Guangzhou | Day: TBD
Show Abstract
Most visual-guided image inpainting methods based on generative adversarial networks (GANs) struggle when the missing region has weak correlations with the surrounding visual context. Recently, diffusion-based methods guided by textual context have been proposed to address this limitation by leveraging additional semantic information to restore corrupted objects. However, these models typically involve more parameters and exhibit slower generation speeds compared to GAN-based approaches. To address this problem, we propose a novel text-guided image inpainting model, the cross-modal semantic fusion generative adversarial network (CSF-GAN). CSF-GAN is designed as a one-stage GAN with the following key contributions. First, a novel semantic fusion module (SFM) is introduced to integrate sentence- and word-level textual context into the inpainting process, enabling more effective guidance from multi-granularity semantic information. Second, a newly designed word-level local discriminator provides detailed feedback to the generator, enhancing the accuracy of generated content in alignment with word-level semantics. Third, two loss functions, the inpainting loss and edge loss, are employed to enhance both structural coherence and textural realism in the generated results. Extensive experiments on two benchmark datasets demonstrate that CSF-GAN outperforms state-of-the-art methods.
2693: K-Buffers: A Plug-in Method for Enhancing Neural Fields with Multiple Buffers
Authors: Haofan Ren, Zunjie Zhu, Xiang Chen, Ming Lu, Rongfeng Lu, Chenggang Yan
Location: Guangzhou | Day: TBD
Show Abstract
Neural fields are now the central focus of research in 3D vision and computer graphics. Existing methods mainly focus on various scene representations, such as neural points and 3D Gaussians. However, few works have studied the rendering process to enhance the neural fields. In this work, we propose a plug-in method named K-Buffers that leverages multiple buffers to improve the rendering performance. Our method first renders K buffers from scene representations and constructs K pixel-wise feature maps. Then, We introduce a K-Feature Fusion Network (KFN) to merge the K pixel-wise feature maps. Finally, we adopt a feature decoder to generate the rendering image. We also introduce an acceleration strategy to improve rendering speed and quality. We apply our method to well-known radiance field baselines, including neural point fields and 3D Gaussian Splatting (3DGS). Extensive experiments demonstrate that our method effectively enhances the rendering performance of neural point fields and 3DGS.
2695: From Sparse to Complete: Semantic Understanding Based on Stroke Evolution in On-the-fly Sketch-based Image Retrieval
Authors: Yingge Liu, Dawei Dai, Xiangling Hou, Shilin Zhao, Guoyin Wang
Location: Guangzhou | Day: TBD
Show Abstract
In contrast with human sketching, which pre-conceptualizes outlines and features, conventional sketch retrieval models rely primarily rely on pixel-level processing and feature extraction, limiting their ability to capture early sketch intent. Consequently, these models are susceptible to subjective stroke noise, reducing retrieval accuracy. To address this issue, we propose a novel on-the-fly noise stroke retrieval framework designed to align with human sketch-drawing cognition. The proposed framework introduces two core innovations. (i) A stroke consistency detection module that effectively discriminates and suppresses noise strokes by quantifying the structural similarity between the current stroke and the target image, as well as its alignment with key skeletal components. (ii) An adaptive gated mixture of experts module that dynamically selects and integrates features from multiple expert networks during the early, sparse stages of sketching, thereby capturing relevant information with greater precision. Experimental results across diverse sketch datasets demonstrate that the proposed method effectively identifies and suppresses early noise strokes, significantly enhances sketch retrieval performance, and exhibits strong robustness across varying sketch styles.
2696: Strategyproofness and Monotone Allocation of Auction in Social Networks
Authors: Yuhang Guo, Dong Hao, Bin Li, Mingyu Xiao, Bakh Khoussainov
Location: Guangzhou | Day: TBD
Show Abstract
Strategyproofness in network auctions requires that bidders not only report their valuations truthfully, but also do their best to invite neighbours from the social network. In contrast to canonical auctions, where the value-monotone allocation in Myerson’s Lemma is a cornerstone, a general principle of allocation rules for strategyproof network auctions is still missing. We show that, due to the absence of such a principle, even extensions to multi-unit network auctions with single-unit demand present unexpected difficulties, and all pioneering researches fail to be strategyproof.
For the first time in this field, we identify two categories of monotone allocation rules on networks: Invitation-Depressed Monotonicity (ID-MON) and Invitation-Promoted Monotonicity (IP-MON). They encompass all existing allocation rules of network auctions as specific instances. For any given ID-MON or IP-MON allocation rule, we characterize the existence and sufficient conditions for the strategyproof payment rules, and show that among all such payment rules, the revenue-maximizing one exists and is computationally feasible.
With these results, the obstacle of combinatorial network auction with single-minded bidders is now resolved.
2698: Higher-order Logical Knowledge Representation Learning
Authors: Suixue Wang, Weiliang Huo, Shilin Zhang, Qingchen Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Real-world knowledge graphs abound with higher-order logical relations that simple triples, limited to pairwise connections, fail to represent. Thus, capturing higher-order logical relations involving multiple entities has garnered significant attention. However, existing methods ignore the structural information in higher-order relations. To this end, we propose a higher-order logical knowledge representation learning method, named LORE, which leverages network motifs, the patterns/subgraphs that naturally capture the structural information in graphs, to extract higher-order features and ultimately, learn effective representations of knowledge graphs. Compared to existing approaches, LORE aggregates the attribute features of entities with the extracted higher-order logical relations to form enhanced representations of knowledge graphs. In particular, three aggregators (i.e., Hadamard, Connection, and Summation) are proposed and employed. Extensive experiments have been conducted on six real-world datasets for two downstream tasks (i.e., entity classification and link prediction). The results show that LORE outperforms baselines significantly and consistently.
2724: TrajCogn: Leveraging LLMs for Cognizing Movement Patterns and Travel Purposes from Trajectories
Authors: Zeyu Zhou, Yan Lin, Haomin Wen, Shengnan Guo, Jilin Hu, Youfang Lin, Huaiyu Wan
Location: Guangzhou | Day: TBD
Show Abstract
Spatio-temporal trajectories are crucial for data mining tasks, requiring versatile learning methods that can accurately extract movement patterns and travel purposes. While large language models (LLMs) have shown remarkable versatility through training on extensive datasets, and trajectories share similarities with natural language, standard LLMs cannot directly handle spatio-temporal features or extract trajectory-specific information.
We propose TrajCogn, a model that effectively adapts LLMs for trajectory learning. TrajCogn incorporates a novel trajectory semantic embedder to process spatio-temporal features and extract movement patterns and travel purposes, along with a trajectory prompt that integrates this information into LLMs for various downstream tasks. Experiments on three real-world datasets and four representative tasks demonstrate TrajCogn’s effectiveness.
2735: M4Bench: A Benchmark of Multi-domain Multi-granularity Multi-image Understanding for Multi-modal Large Language Models
Authors: Xiaojun Ye, Guanbao Liang, Chun Wang, Liangcheng Li, Pengfei Ke, Rui Wang, Bingxin Jia, Gang Huang, Qiao Sun, Sheng Zhou
Location: Guangzhou | Day: TBD
Show Abstract
The increasing demands in analyzing complex associated scenes pose necessities to researching multi-image understanding abilities. Compared with understanding individual images, both the alignments and differences between images are essential aspects of understanding the intricate relationships for multi-image inference tasks. However, existing benchmarks face difficulties in addressing both of these aspects simultaneously, resulting in obstacles to modeling relationships under various granularities and domains of images. In this paper, we introduce M4Bench to enhance the capability of aligning and distinguishing multi-images with multi-domain multi-granularity comparison. We carefully design five comparison tasks related to coarse and fine-grained granularities in single and multiple domains of images and evaluate them on 13 state-of-the-art multi-modal large language models with various sizes. Besides, we analyze the evaluation results and provide several observations and viewpoints for the multi-image understanding research. The data and evaluation code are available at https://github.com/eaglelab-zju/M4Bench.
2740: Neuron Similarity-Based Neural Network Verification via Abstraction and Refinement
Authors: Yuehao Liu, Yansong Dong, Liang Zhao, Wensheng Wang, Cong Tian
Location: Guangzhou | Day: TBD
Show Abstract
Deep neural networks (DNNs) have become integral to numerous safety-critical applications, necessitating rigorous verification of their trustworthiness. However, the problem of verifying DNNs has high computational complexity, and existing techniques have limited efficiency, insufficient to deal with large-scale network models. To address this challenge, we propose a novel abstraction-refinement verification method that reduces network size while maintaining verification accuracy. Specifically, the method quantifies the similarity between neurons based on various factors such as their interval outputs, and then merges similar neurons to generate a smaller abstract network. In addition, a counterexample-guided refinement process is developed to mitigate the impact of potential spurious counterexamples, so that verification results from the abstract network are applicable to the original network. We have implemented this method as a tool named ARVerifier and integrated it with three state-of-the-art verification tools for evaluation on ACAS Xu and MNIST benchmarks. Experimental results demonstrate that ARVerifier significantly reduces network size and yields verification time reductions by 11.61%, 18.70%, and 12.20% compared to α,β-CROWN, Verinet, and Marabou, respectively. Moreover, ARVerifier exhibits efficiency improvements by 26.64% and 46.87% compared to existing abstraction-refinement methods NARv and CEGAR-NN, respectively.
2743: PALA: Class-imbalanced Graph Domain Adaptation via Prototype-anchored Learning and Alignment
Authors: Xin Ma, Yifan Wang, Siyu Yi, Wei Ju, Bei Wu, Ziyue Qiao, Chenwei Tang, Jiancheng Lv
Location: Guangzhou | Day: TBD
Show Abstract
Graph domain adaptation is a key subfield of graph transfer learning that aims to bridge domain gaps by transferring knowledge from a label-rich source graph to an unlabeled target graph. However, most existing methods assume balanced labels in the source graph, which often fails in practice and leads to biased knowledge transfer. To address this, in this paper, we propose a prototype-anchored learning and alignment framework for class-imbalanced graph domain adaptation. Specifically, we incorporate pointwise node mutual information into the graph encoder to capture high-order topological proximity and learn generalized node representations. Leveraging this, we then introduce categorical prototypes with adversarial proto-instances for prototype-anchored learning and recalibration to represent the source graph under an imbalanced class distribution. Finally, we introduce a weighted prototype contrastive adaptation strategy that aligns target pseudo-labels with source prototypes to handle class imbalance during adaptation. Extensive experiments show that our PALA outperforms the state-of-the-art methods. Our code is available at https://github.com/maxin88scu/PALA.
2774: Beyond Low-rankness: Guaranteed Matrix Recovery via Modified Nuclear Norm
Authors: Jiangjun Peng, Yisi Luo, Xiangyong Cao, Shuang Xu, Deyu Meng
Location: Guangzhou | Day: TBD
Show Abstract
The nuclear norm (NN) has been widely explored in matrix recovery problems, such as Robust PCA and matrix completion, leveraging the inherent global low-rank structure of the data. In this study, we introduce a new modified nuclear norm (MNN) framework, where the MNN family norms are defined by adopting suitable transformations and performing the NN on the transformed matrix. The MNN framework offers two main advantages: (1) it jointly captures both local information and global low-rankness without requiring trade-off parameter tuning; (2) under mild assumptions on the transformation, we provide theoretical recovery guarantees for both Robust PCA and MC tasks—an achievement not shared by existing methods that combine local and global information. Thanks to its general and flexible design, MNN can accommodate various proven transformations, enabling a unified and effective approach to structured low-rank recovery. Extensive experiments demonstrate the effectiveness of our method. Code and supplementary material are available at https://github.com/andrew-pengjj/modified_nuclear_norm.
2778: Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization
Authors: Xinhao Yao, Hongjin Qian, Xiaolin Hu, Gengze Xu, Wei Liu, Jian Luan, Bin Wang, Yong Liu
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs), built on Transformer architectures, exhibit remarkable generalization across a wide range of tasks. However, fine-tuning these models for specific tasks remains resource-intensive due to their extensive parameterization. In this paper, we explore two remarkable phenomena related to the attention mechanism during the fine-tuning of LLMs (where Wq, Wk, and Wv denote the weights of the query, key, and value layers, respectively). The first phenomenon, termed “Unequal Importance of Attention Matrices”, highlights the impact of fine-tuning different weight matrices. It shows that optimizing the Wv matrix yields significantly better performance than optimizing the Wk matrix. Fine-tuning only the Wq and Wv matrices is computationally efficient while delivering results comparable to, or even better than fine-tuning all three matrices (Wq, Wk, and Wv). The second phenomenon, “Attention Matrices with Customized Learning Rate Lead to Better Convergence”, emphasizes the importance of assigning distinct learning rates to these matrices. Specifically, a higher learning rate for the Wv matrix compared to Wq and Wk accelerates convergence and improves performance. Building on these insights, we propose a new strategy that improves fine-tuning efficiency in terms of both storage and time. Experimental results on benchmark datasets validate the effectiveness of this approach, supporting our theoretical findings. Our analysis lays the theoretical groundwork for configuring and improving algorithms in LLMs fine-tuning.
2793: TP-Eval: Tap Multimodal LLMs’ Potential in Evaluation by Customizing Prompts
Authors: Yuxuan Xie, Tianhua Li, Wenqi Shao, Kaipeng Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Recently, multimodal large language models (MLLMs) have received much attention for their impressive capabilities. The evaluation of MLLMs is becoming critical to analyzing attributes of MLLMs and providing valuable insights. However, current benchmarks overlook the problem of prompt sensitivity – minor prompt variations may lead to significant performance fluctuations. Thus, inappropriate prompts may obscure the models’ capabilities, underestimating the models’ performance. Moreover, different models have different preferences for different prompts, and thus, using the same prompt for all models will cause evaluation bias. This paper analyzes this deficiency in existing benchmarks and further introduces a new evaluation framework named TP-Eval, which introduces a prompt customization method to reduce evaluation biases and tap models’ potential. TP-Eval will rewrite the original prompts to different customized prompts for different models. In particular, we propose some well-designed modules for prompt customization tailored to the scenario of MLLM evaluation. Extensive experiments demonstrate the effectiveness of our approach to uncovering models’ capabilities, and TP-Eval should benefit the community in developing more comprehensive and convincing MLLM evaluation benchmarks.
2810: Multi-Task Curriculum Graph Contrastive Learning with Clustering Entropy Guidance
Authors: Chusheng Zeng, Bocheng Wang, Jinghui Yuan, Mulin Chen, Xuelong Li
Location: Guangzhou | Day: TBD
Show Abstract
Recent advances in unsupervised deep graph clustering have been significantly promoted by contrastive learning. Despite the strides, most graph contrastive learning models face challenges: 1) graph augmentation is used to improve learning diversity, but commonly used random augmentation methods may destroy inherent semantics and cause noise; 2) the fixed positive and negative sample selection strategy ignores the difficulty distribution of samples when deal with complex real data, thereby impeding the model’s capability to capture fine-grained patterns and trapping the model in sub-optimal for clustering. To reduce these problems, we propose the Clustering-guided Curriculum Graph contrastive Learning (CurGL) framework. CurGL uses clustering entropy as the guidance of the following graph augmentation and contrastive learning. Specifically, according to the clustering entropy, the intra-class edges and important features are emphasized in augmentation. Then, a multi-task curriculum learning scheme is proposed, which employs the clustering guidance to shift the focus from the discrimination task to the clustering task. In this way, the sample selection strategy of contrastive learning can be adjusted adaptively from early to late stage, which enhances the model’s flexibility for complex data structure. Experimental results demonstrate that CurGL has achieved excellent performance compared to state-of-the-art competitors.
2815: Template-based Uncertainty Multimodal Fusion Network for RGBT Tracking
Authors: Zhaodong Ding, Chenglong Li, Shengqing Miao, Jin Tang
Location: Guangzhou | Day: TBD
Show Abstract
RGBT tracking is to localize the predefined targets in video sequences by effectively leveraging the information from both visible light (RGB) and thermal infrared (TIR) modalities. However, the quality of different modalities changes dynamically in complex scenes, and effectively perceiving modal quality for multimodal fusion remains a significant challenge. To address this challenge, we propose to employ the reliability of initial template to explore the uncertainty across different modalities, and design a novel template-based uncertainty computation framework for robust multimodal fusion in RGBT tracking.
In particular, we introduce an Uncertainty-aware Multimodal Fusion Module (UMFM), which constructs the uncertainty of each modality by leveraging the correlation between the template and search region in the Subjective Logic framework, aiming to achieve robust multimodal fusion. In addition, existing methods focus on dynamic template update while overlooking the potential role of a reliable initial template in the template updating process.To this end, we design a simple yet effective Contrastive Template Update Module (CTUM) to assess the reliability of the new template by comparing its quality with that of the initial template. Extensive experiments suggest that our method outperforms existing approaches on four RGBT tracking benchmarks.
2834: A Simple yet Effective Hypergraph Clustering Network
Authors: Qianqian Wang, Bowen Zhao, Zhengming Ding, Xiangdong Zhang, Quanxue Gao
Location: Guangzhou | Day: TBD
Show Abstract
Hypergraph Clustering has gained significant attention due to its capability of capturing high order structural information. Among different approaches, contrastive learning-based methods leverage self-supervised learning and data augmentation, exhibiting impressive performance. However, most of them come with the following limitations: 1) Augmentation strategies like feature dropout can potentially disrupt the intrinsic clustering structure of hypergraphs. 2) High computational demands hinder their real-world application. To address the above issues, we propose a simple yet effective Hypergraph Clustering Network framework (HCN). Specifically, HCN replaces the hypergraph convolution operation with smoothing preprocessing, which avoids high computational complexity. Besides, to retain intrinsic structure, it develops two key modules: the self-diagonal consistency module and the structure alignment mod ule. They respectively align the similarity matrix with the identity matrix and the structural affinity matrix, which ensures intra-cluster compact ness and inter-cluster separability. Extensive experiments on five benchmark datasets demonstrate HCN’s superiority over state-of-the-art methods.
2846: Wavelet Multi-scale Region-Enhanced Network for Medical Image Segmentation
Authors: Hang Lu, Liang Du, Peng Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Medical image segmentation is an important task in medical artificial intelligence. Traditional segmentation methods often suffer from the information loss problem, especially in medical image data which contain many different-scale organs or tissues. To address this problem, we propose a novel medical image segmentation method called Wavelet Multi-scale Region-Enhanced Network (WMREN), which has a UNet structure. In the encoder, we design a bi-branch feature extraction architecture, which simultaneously learns the representations with Haar wavelet transform and the residual blocks. The bi-branch architecture can effectively tackle the information loss problem when extracting features. In the decoder we design an innovative Spatial Adaptive Fusion Module to enhance the regions of interest. As we know, the boundaries of objects play an important role in segmentation. To this end, we also carefully design a Contrast Refinement Enhancement Module to highlight the boundaries of the medical objects. Extensive experiments on several benchmark datasets show that our method outperforms state-of-the-art medical image segmentation methods, demonstrating its effectiveness and superiority. The source code is publicly available at https://github.com/C101812/WMREN/tree/master.
2852: Meta Label Correction with Generalization Regularizer
Authors: Tao Tong, Yujie Mo, Yucheng Xie, Songyue Cai, Xiaoshuang Shi, Xiaofeng Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Deep neural networks can easily lead to the over-fitting issue due to the influence of noisy labels. However, previous label correction methods for dealing with noisy labels often need expensive computation cost to achieve effectiveness and ignore the generalization ability of the model. To address these issues, in this paper, we propose a new meta-based self-correction method to achieve accurate filtering of noisy labels and to enhance the generalization ability of the label correction model. Specifically, we first investigate a new gradient score method to filter noisy labels with less computation cost, and then theoretically design a new generalization regularizer into the meta-learner and the base learner, for correcting noisy labels as well as achieving the generalization ability. Experimental results on real datasets verify the effectiveness of our proposed method in terms of different classification tasks.
2866: Inter3D: A Benchmark and Strong Baseline for Human-Interactive 3D Object Reconstruction
Authors: Gan Chen, Ying He, Mulin Yu, F.Richard Yu, Gang Xu, Fei Ma, Ming Li, Guang Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Recent advancements in implicit 3D reconstruction methods, e.g., neural rendering fields and Gaussian splatting, have primarily focused on novel view synthesis of static or dynamic objects with continuous motion states. However, these approaches struggle to efficiently model a human-interactive object with n movable parts, requiring 2^n separate models to represent all discrete states. To overcome this limitation, we propose Inter3D, a new benchmark and approach for novel state synthesis of human-interactive objects. We introduce a self-collected dataset featuring commonly encountered interactive objects and a new evaluation pipeline, where only individual part states are observed during training, while part combination states remain unseen. We also propose a strong baseline approach that leverages Space Discrepancy Tensors to efficiently modelling all states of an object. To alleviate the impractical constraints on camera trajectories across training states, we propose a Mutual State Regularization mechanism to enhance the spatial density consistency of movable parts. In addition, we explore two occupancy grid sampling strategies to facilitate training efficiency. We conduct extensive experiments on the proposed benchmark, showcasing the challenges of the task and the superiority of our approach. The code and data are publicly available at https://github.com/Inter3D-ui/Inter3D.
2870: Find and Perceive: Tell Visual Change with Fine-Grained Comparison
Authors: Feixiao Lv, Rui Wang, Lihua Jing, Lijun Liu
Location: Guangzhou | Day: TBD
Show Abstract
The goal of the image change captioning task is to capture the differences between two similar images and describe them in natural language. In this paper, we decompose this task into two sub-problems, i.e., fine-grained change feature learning and discrimination of changed regions. Compared with existing methods which only focus on change feature learning, we propose a novel change captioning learning paradigm, Find and Perceive (F&P). Our proposed F&P consists of two main ideas, i.e., the Fine-Grained Semantic Change Perception (FGSCP) module for improving the model’s perception ability of subtle changes and the Weakly-Supervised Discriminator (WSD) of changed regions for improving the model’s sensitivity of localising the important regions. Specifically, the FGSCP deploys a two-step manner, firstly introducing the fine-grained categorisation and then enhancing the interaction of the two paired images. And the WSD adopts the contributions of each image region for final generated captions, accurately indicating which regions are important for change captions without any extra annotations. Finally, we conduct extensive experiments on four change captioning datasets, and experimental results show that our proposed method F&P outperforms existing change caption methods and achieves new state-of-the-art performance.
2886: MCF-Spouse: A Multi-Label Causal Feature Selection Method with Optimal Spouses Discovery
Authors: Lin Ma, Liang Hu, Qiang Huang, Pingting Hao, Juncheng Hu
Location: Guangzhou | Day: TBD
Show Abstract
Multi-label causal feature selection has garnered considerable attention for its ability to identify the most informative features while accounting for the causal dependencies between labels and features. However, previous work often overlooks the unique contributions of labels to the target variables in multi-label settings, focusing instead on prioritizing feature variables. Moreover, existing methods typically rely on traditional Markov Blanket (MB) discovery to construct an initial MB, which often fails to explore the most valuable form of spouse variables to feature selection in multi-label scenarios, leading to significant computational overhead due to redundant Conditional Independence (CI) tests required for spouse search. To address these challenges, we propose the Multi-label Causal Feature Selection Method with Optimal Spouses Discovery, MCF-Spouse, which leverages mutual information to quantify the contributions of both labels and features, ensuring the retention of the most informative variables in multi-label settings. Moreover, we systematically analyzes all potential forms of spouse variables to identify the optimal spouse case, significantly reducing the spouse search space and alleviating the time overhead associated with CI tests. Experiments conducted on diverse real-world datasets demonstrate that MCF-Spouse consistently outperforms state-of-the-art methods across multiple metrics, offering a scalable and interpretable solution for multi-label causal feature selection.
2893: Disentangled and Personalized Representation Learning for Next Point-of-Interest Recommendation
Authors: Xuan Rao, Shuo Shang, Lisi Chen, Renhe Jiang, Peng Han
Location: Guangzhou | Day: TBD
Show Abstract
Next POInt-of-Interest (POI) recommendation predicts a user’s next move and facilitates location-based services such as navigation and travel planning. SOTA methods fuse each POI and its contexts (e.g., time, category, and region) into a single representation to model sequential user movement. This hinders the effective utilization of context information, and diverse user preferences are also neglected. To tackle these limitations, we propose Disentangled and Personalized Representation Learning (DPRL) as a novel method for next POI recommendation. DPRL decouples POIs and contexts during representation learning, capturing their sequential regularities independently using separate recurrent neural networks (RNNs). To model the preference of each user, DPRL adopts an aggregation mechanism that integrates dynamic user preferences and spatial-temporal factors into the learned representations. We compare DPRL with 16 state-of-the-art baselines. The results show that DPRL outperforms all baselines and achieves an average accuracy improvement of 10.53% over the best-performing baseline.
2900: Do You Steal My Model? Signature Diffusion Embedded Dual-Verification Watermarking for Protecting Intellectual Property of Hyperspectral Image Classification Models
Authors: Yufei Yang, Song Xiao, Lixiang Li, Wenqian Dong, Jiahui Qu
Location: Guangzhou | Day: TBD
Show Abstract
Due to the high cost of data collection and training, the well-performed hyperspectral image (HSI) classification models are of great value and vulnerable to piracy threat during transmission and use. Model watermarking is a promising technology for intellectual property (IP) protection of models. However, the existing model watermarking methods for RGB image classification models ignore the complexity of ground objects and high dimension of HSIs, which makes trigger samples easy to be detected and forged. To address this problem, we propose a signature diffusion embedded dual-verification watermarking method, which generates imperceptible trigger samples with explicit owner information to achieve dual verification of both model ownership and legality of trigger set. Specifically, the subpixel-space owner signature diffusion incorporated imperceptible trigger set generation method is proposed to manipulate owner signature incorporated to the abundance matrix of seeds via diffusion model in subpixel space, thus balancing the perceptual quality of trigger samples and signature extraction capability. To resist ownership confusion, dual-stamp ownership verification is proposed to query the suspicious model with trigger samples for ownership verification, and further extracts signature from trigger samples to guarantee their legality. Extensive experiments demonstrate the proposed method can effectively protect IP of HSI classification models.
2902: Bi-DiffCD: Bidirectional Diffusion Guided Collaborative Change Detection for Arbitrary-Modal Remote Sensing Images
Authors: Jingyu Zhao, Jiahui Qu, Wenqian Dong
Location: Guangzhou | Day: TBD
Show Abstract
Change detection aims to identify land cover changes by analyzing multitemporal images that cover the same area. However, It may be difficult to effectively obtain high-quality multitemporal images with the same modality in real dynamic scenarios. The rapid development of remote sensing technology enables collaborative observation of multimodal images, but it is challenging for uni-modal image-specific methods to overcome modal discrepancy and achieve complementary advantage detection. To this end, we propose a bidirectional diffusion guided collaborative change detection model (Bi-DiffCD) for arbitrary-modal images, which eliminates the modal discrepancy between arbitrary-modal images through the bidirectional diffusion and makes full use of the multilevel complementary advantage features to improve the detection accuracy. Specifically, a conditional diffusion-based bidirectional modal alignment module (CDBMA) is designed to step-wise align the modal attribute bidirectionally while preserving the multimodal complementary features. Furthermore, a multilevel complementary feature collaborative change detection module (MLCCD) is proposed to collaborate the multilevel enhanced complementary change information from transformed images and potential features for change detection. Experiments have been conducted on three widely used and one self-made multimodal datasets to demonstrate the effectiveness of the proposed method with different combinations of modalities. Code is available at https://github.com/Jiahuiqu/Bi-DiffCD.
2918: Can Retelling Have Adequate Information for Reasoning? An Enhancement Method for Imperfect Video Understanding with Large Language Model
Authors: Mingxin Li, Wenhao Wang, Hongru Ji, Xianghua Li, Chao Gao
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) demonstrate strong capabilities in video understanding. However, it exhibits hallucinations and factual errors in video description. On the one hand, existing Multimodal Large Language Models (MLLMs) are primarily trained by combining language models and vision models, with their visual understanding capabilities depending on the performance of the backbone. Moreover, video descriptions often suffer from incomplete content and the possibility of errors. Given the proven assessment of the strong reasoning capabilities of LLMs, this paper proposes ERSR, a novel Entity and Relationship based Self-Enhanced Reasoning method for imperfect video understanding. Specifically, an entities and relationships strategy is designed to perform scene graphs based on the limited observed entity relationships, thereby enhancing video descriptions. Furthermore, by providing question feedbacks, a self-enhanced forward and feedback reasoning strategy is provided to enhance reasoning logic. Finally, the prediction question answering results are re-validated through rethinking and verifying using the LLMs. Extensive experiments show that the proposed method achieves competitive results on real-world video understanding datasets, with an overall improvement of no less than 1.4%.
2926: FedBG: Proactively Mitigating Bias in Cross-Domain Graph Federated Learning Using Background Data
Authors: Sheng Huang, Lele Fu, Tianchi Liao, Bowen Deng, Chuanfu Zhang, Chuan Chen
Location: Guangzhou | Day: TBD
Show Abstract
Federated graph learning is focused on aggregating knowledge from multi-source graph data and training graph neural networks. Unlike the data that traditional federated learning needs to deal with, federated graph learning also needs to face additional topological information. Further, there are also biases in features and topologies among clients, increasing the difficulty of training models. Previous methods usually seek global calibration information, however, this approach may suffer from information bias caused by data skews, and it is also difficult to naturally combine feature and topology information. Therefore, adjusting the bias before it occurs will hopefully address the learning difficulties caused by the skew. In view of this, we employ background graph data, which works as reference information for local training, to proactively correct bias before it occurs. As a kind of graph data, background graphs are naturally capable of combining feature and topology information to accomplish bias correction among clients in a comprehensive way. Mixing strategy is employed on the background graph to additionally provide privacy-preserving capabilities. Graph generation methods are employed to restore the diversity of background graphs that are blurred by the mixing strategy. Extensive experiments on two real-world datasets demonstrate the sufficient motivation and effectiveness of the proposed method.
2929: Enhancing the Performance of Global Model by Improving the Adaptability of Local Models in Federated Learning
Authors: Wujun Zhou, Shu Ding, Zelin Li, Wei Wang
Location: Guangzhou | Day: TBD
Show Abstract
Federated learning enables the clients to collaboratively train a global model, which is aggregated from local models. Due to the heterogeneous data distributions over clients and data privacy in federated learning, it is difficult to train local models to achieve a well-performed global model. In this paper, we introduce the adaptability of local models, i.e., the average performance of local models on data distributions over clients, and enhance the performance of the global model by improving the adaptability of local models. Since each client does not know the data distributions over other clients, the adaptability of the local model cannot be directly optimized. First, we provide the property of an appropriate local model which has good adaptability on the data distributions over clients. Then, we formalize the property into the local training objective with a constraint and propose a feasible solution to train the local model. Extensive experiments on federated learning benchmarks demonstrate that our method significantly improves the adaptability of local models and achieves a well-performed global model that consistently outperforms the baseline methods.
2942: LLM-TPF: Multiscale Temporal Periodicity-Semantic Fusion LLMs for Time Series Forecasting
Authors: Qihong Pan, Haofei Tan, Guojiang Shen, Xiangjie Kong, Mengmeng Wang, Chenyang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Large language models have demonstrated remarkable generalization capabilities and strong performance across various fields. Recent research has highlighted their significant potential in time series forecasting. However, time series data often exhibit complex periodic characteristics, posing a substantial challenge in enabling these models to effectively capture latent patterns. To address this challenge, we propose a novel framework, LLM-TPF, which leverages individuality and commonality fusion to enhance time series forecasting. In the frequency domain, periodic features are extracted to reveal the intrinsic periodicity of the data, while textual prototypes are used to indicate temporal trends. In the time domain, carefully designed prompts are employed to guide the models in comprehending global information. A commonality fusion mechanism further aggregates heterogeneous information across dimensions, and three distinct language models are utilized to independently process different types of information. Extensive real-world experiments demonstrate that LLM-TPF is a powerful tool for time series forecasting, achieving superior performance compared to state-of-the-art specialized models and exhibiting exceptional generalization ability in zero-shot scenarios. Code is available at https://github.com/switchsky/LLM-TPF.
2943: Guiding LLM-based Smart Contract Generation with Finite State Machine
Authors: Hao Luo, Yuhao Lin, Xiao Yan, Xintong Hu, Yuxiang Wang, Qiming Zeng, Hao Wang, Jiawei Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Smart contract is a kind of self-executing code based on blockchain technology with a wide range of application scenarios, but the traditional generation method relies on manual coding and expert auditing, which has a high threshold and low efficiency. Although Large Language Models (LLMs) show great potential in programming tasks, they still face challenges in smart contract generation w.r.t. effectiveness and security. To solve these problems, we propose FSM-SCG, a smart contract generation framework based on finite state machine (FSM) and LLMs, which significantly improves the quality of the generated code by abstracting user requirements to generate FSM, guiding LLMs to generate smart contracts, and iteratively optimizing the code with the feedback of compilation and security checks. The experimental results show that FSM-SCG significantly improves the quality of smart contract generation. Compared to the best baseline, FSM-SCG improves the compilation success rate of generated smart contract code by at most 48%, and reduces the average vulnerability risk score by approximately 68%.
2949: Hybrid Relational Graphs with Sentiment-laden Semantic Alignment for Multimodal Emotion Recognition in Conversation
Authors: Hongru Ji, Xianghua Li, Mingxin Li, Meng Zhao, Chao Gao
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal Emotion Recognition in Conversation (MERC) focuses on detecting the emotions expressed by speakers in each utterance. Recent research has increasingly leveraged graph-based models to capture interactive relationships in conversations, enhancing the ability to extract emotional cues. However, existing methods primarily focus on explicit utterance-level relationships, neglecting both the implicit connections within individual modality and the differences in implicit relationships across modalities. Moreover, these methods often overlook the role of sentimental features in conversation history in cross-modal semantic alignment. To address these issues, we propose a novel model that employs modality-adaptive hybrid relational graphs to enrich the dialogue graph by inferring implicit relationships between nodes within each modality. Furthermore, we introduce historical sentiment through a progressive strategy that utilizes contrastive learning to refine cross-modal semantic alignment. Experimental results demonstrate the superior performance of our approach over state-of-the-art methods on the IEMOCAP and MELD datasets. Our code is available at https://github.com/cgao-comp/HRG-SSA.
2974: Pseudo-Label Reconstruction for Partial Multi-Label Learning
Authors: Yu Chen, Fang Li, Na Han, Guanbin Li, Hongbo Gao, Sixian Chan, Xiaozhao Fang
Location: Guangzhou | Day: TBD
Show Abstract
In Partial Multi-Label Learning (PML), each instance is associated with a candidate label set containing multiple relevant labels along with other false positive labels. Currently, most PML methods directly extract instance correlation from instance features while ignoring the candidate labels, which may contain more discriminative instance-related information. This paper argues that, with a well-designed model, more accurate instance correlation can be mined from the candidate labels to facilitate label disambiguation. To this end, we propose a novel PML method based on pseudo-label reconstruction (PML-PLR). Specifically, we first propose a novel orthogonal candidate label reconstruction method, which jointly optimizes with instance features to extract more consistent instance correlation. Then, we use instance correlation as reconstruction coefficient to reconstruct pseudo-labels. Subsequently, through local manifold learning, the reconstructed pseudo-labels are leveraged to propagate the consistency relationship between labels and instances, thereby improving the accuracy of pseudo-labels. Extensive experiments and analyses demonstrate that the proposed PML-PLR outperforms state-of-the-art methods.
2994: Spatially Resolved Transcriptomics Data Clustering with Tailored Spatial-scale Modulation
Authors: Yuang Xiao, Yanran Zhu, Chang Tang, Xiao Zheng, Yuanyuan Liu, Kun Sun, Xinwang Liu
Location: Guangzhou | Day: TBD
Show Abstract
Spatial transcriptomics, comprising spatial location and high-throughput gene expression information, provides revolutionary insights into disease discovery and cellular evolution. Spatial transcriptomic clustering, which pinpoints distinct spatial domains within tissues, reveals cellular interactions and enhances our understanding of the intricate architecture of tissues. Existing methods typically construct spatial graphs using a static radius based on spatial coordinates, which hinders the accurate identification of spatial domains and complicates the precise partitioning of boundary nodes within clusters. To address this issue, we introduce a novel spatially resolved transcriptomics data clustering network (TSstc). Specifically, we employ a tailored spatial-scale modulation approach, constructing different spatial graphs incrementally as the radius of the spatial domain expands, and a Spatiality-Aware Sampling (SAS) strategy is proposed to aggregate node representations by considering the spatial dependencies between spots. We then use GCN encoders to learn gene embedding with gene graph and multiple spatial embeddings with spatial graphs. During training, we incorporate cross-view correlation-based tailored spatial regularization constraints to preserve high-quality neighbor relationships across spatial embeddings at different scales. Finally, a zero-inflated negative binomial model is utilized to capture the global probability distribution of gene expression profiles. Extensive experimental results demonstrate that our approach surpasses existing state-of-the-art methods in clustering tasks and related downstream applications.
2998: IE-PMMA:Point Cloud Completion Through Inverse Edge-aware Upsampling and Precise Multi-Modal Feature Alignment
Authors: Ran Jia, Junpeng Xue, Shuai Ma, Wenbo Lu, Kelei Wang
Location: Guangzhou | Day: TBD
Show Abstract
Point cloud completion is a crucial task in 3D computer vision. Multi-modal completion approaches have gained attention among the popular two-stage point cloud completion methods. However, there is a notable lack of research focused on accurately aligning data from different modalities within these methods. Additionally, in other point cloud-based tasks, edge point information often provides unexpected positive contributions. In this paper, we propose a novel point cloud completion method that leverages edge point information for the first time in the completion task, which also addresses the precise alignment of multi-modal data. In particular, we implement a two-step local-to-global module to achieve better alignment of multi-modal data during the preliminary point cloud generation process. Besides, we introduce a new spatial representation structure capable of extracting a fixed number of edge points. Moreover, with the assistance of edge information, we further design an inverse edge-aware upsampler to refine the point cloud. We evaluate our method on three typical datasets, and the results demonstrate that our IE-PMMA outperforms the existing state-of-the-art methods quantitatively and visually.
3012: Exploring Transferable Homogenous Groups for Compositional Zero-Shot Learning
Authors: Zhijie Rao, Jingcai Guo, Miaoge Li, Yang Chen, Mengzhu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Conditional dependency present one of the trickiest problems in Compositional Zero-Shot Learning, leading to significant property variations of the same state (object) across different objects (states). To address this problem, existing approaches often adopt either all-to-one or one-to-one representation paradigms. However, these extremes create an imbalance in the seesaw between transferability and discriminability, favoring one at the expense of the other. Comparatively, humans are adept at analogizing and reasoning in a hierarchical clustering manner, intuitively grouping categories with similar properties to form cohesive concepts. Motivated by this, we propose Homogeneous Group Representation Learning (HGRL), a new perspective formulates state (object) representation learning as multiple homogeneous sub-group representation learning. HGRL seeks to achieve a balance between semantic transferability and discriminability by adaptively discovering and aggregating categories with shared properties, learning distributed group centers that retain group-specific discriminative features. Our method integrates three core components designed to simultaneously enhance both the visual and prompt representation capabilities of the model. Extensive experiments on three benchmark datasets validate the effectiveness of our method. Code is available at https://github.com/zjrao/HGRL.
3015: Indirect Online Preference Optimization via Reinforcement Learning
Authors: En Wang, Xingyu Lin, Du Su, Chenfu Bao, Zhonghou Lv, Funing Yang, Yuanbo Xu, Wenbin Liu
Location: Guangzhou | Day: TBD
Show Abstract
Human preference alignment (HPA) aims to ensure Large Language Models (LLMs) responding appropriately to meet human moral and ethical requirements. Existing methods, such as RLHF and DPO, rely heavily on high-quality human annotation, which restrict the efficiency of iterative online model refinement.
To address the inefficiencies of human annotation acquisition, iterated online strategy advocates the use of fine-tuned LLMs to self-generate preference data. However, this approach is prone to distribution bias, because of differences between human and model annotations, as well as modeling errors between simulators and real-world contexts. To mitigate the impact of distribution bias, we adopt the principles of adversarial training, framing a zero-sum two-player game with a protagonist agent and an adversarial agent. With the adversarial agent challenging the alignment of protagonist agent, we continuously refine the protagonist’s performance. By utilizing min-max equilibrium and Nash equilibrium strategies, we propose Indirect Online Preference Optimization (IOPO) mechanism that enables the protagonist agent to converge without bias while maintaining linear computational complexity. Extensive experiments across three real-world datasets demonstrate that IOPO outperforms state-of-the-art alignment methods in both offline and online scenarios, evidenced by standard alignment metrics and human evaluations. This innovation reduces the time required for model iterations from months to one week, alleviates distribution shifts, and significantly cuts annotation costs.
3032: Learning to Extrapolate and Adjust: Two-Stage Meta-Learning for Concept Drift in Online Time Series Forecasting
Authors: Weiqi Chen, Zhaoyang Zhu, Yifan Zhang, Lefei Shen, Linxiao Yang, Qingsong Wen, Liang Sun
Location: Guangzhou | Day: TBD
Show Abstract
The inherent non-stationarity of time series in practical applications poses significant challenges for accurate forecasting. This paper tackles the concept drift problem where the underlying distribution or environment of time series changes. To better describe the characteristics and effectively model concept drifts, we first classify them into macro-drift (stable, long-term changes) and micro-drift (sudden, short-term fluctuations). Next, we propose a unified meta-learning framework called LEAF (Learning to Extrapolate and Adjust for Forecasting), where an extrapolation module is first introduced to track and extrapolate the prediction model in latent space considering macro-drift, and then an adjustment module incorporates meta-learnable surrogate loss to capture sample-specific micro-drift patterns. LEAF’s dual-stage approach effectively addresses diverse concept drifts and is model-agnostic which can be compatible with any deep prediction model. We further provide theoretical analysis to justify why the proposed framework can handle macro-drift and micro-drift. To facilitate further research in this field, we release three electric load time series datasets collected from real-world scenarios, exhibiting diverse and typical concept drifts. Extensive experiments on multiple datasets demonstrate the effectiveness of LEAF.
3035: BRIGHT-VO: Brightness-Guided Hybrid Transformer for Visual Odometry with Multi-modality Refinement Module
Authors: Dongzhihan Wang, Yang Yang, Xuyang Chen, Liang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Visual odometry (VO) plays a crucial role in autonomous driving, robotic navigation, and other related tasks by estimating the position and orientation of a camera based on visual input. Significant progress has been made in data-driven VO methods, particularly those leveraging deep learning techniques to extract image features and estimate camera poses. However, these methods often struggle in low-light conditions because of the reduced visibility of features and the increased difficulty of matching keypoints. To address this limitation, we introduce BrightVO, a novel VO model based on Transformer architecture, which not only performs front-end visual feature extraction, but also incorporates a multi-modality refinement module in the back-end that integrates Inertial Measurement Unit (IMU) data. Using pose graph optimization, this module iteratively refines pose estimates to reduce errors and improve both accuracy and robustness. Furthermore, we create a synthetic low-light dataset, KiC4R, which includes a variety of lighting conditions to facilitate the training and evaluation of VO frameworks in challenging environments. Experimental results demonstrate that BrightVO achieves state-of-the-art performance on both the KiC4R dataset and the KITTI benchmarks. Specifically, it provides an average improvement of 20% in pose estimation accuracy in normal outdoor environments and 25% in low-light conditions, outperforming existing methods. This work is open-source at https://github.com/Anastasiawd/BrightVO.
3044: Structure-Aware Handwritten Text Recognition via Graph-Enhanced Cross-Modal Mutual Learning
Authors: Ji Gan, Yupeng Zhou, Yanming Zhang, Jiaxu Leng, Xinbo Gao
Location: Guangzhou | Day: TBD
Show Abstract
Existing handwriting recognition methods only focus on learning visual patterns by modeling low-level relationships of adjacent pixels, while overlooking the intrinsic geometric structures of characters. In this paper, we propose a novel graph-enhanced cross-modal mutual learning network GCM to fully process handwritten text images alongside their corresponding geometric graphs, which consists of one shared cross-modal encoder and two parallel inverse decoders. Specifically, the encoder simultaneously extracts visual and geometric information from the cross-modal inputs, and the decoders fuse the multi-modal features for prediction under the guidance of cross-modal fusion. Moreover, two parallel decoders sequentially aggregate cross-modal features in inverse orders (V→G and G→V) but are enhanced through mutual distillation at each time-step, which involves one-to-one knowledge transfer and fully leverages complementary cross-modal information from both directions. Notably, only one branch of GCM is activated in inference, thus avoiding the increase of the model parameters and computation costs for testing. Experiments show that our method outperforms previous state-of-the-art methods on public benchmarks such as IAM, RIMES, and ICDAR-2013 when no extra training data is utilized.
3048: Richer Semantics, Better Alignment: Aligning Visual Features with Explicit and Enriched Semantics for Visible-Infrared Person Re-Identification
Authors: Neng Dong, Shuanglin Yan, Liyan Zhang, Jinhui Tang
Location: Guangzhou | Day: TBD
Show Abstract
Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities. Existing methods learn visual features solely from images, failing to align them into the modality-invariant semantic space. In this paper, we propose a novel framework, termed Richer Semantics, Better Alignment (RSBA), to align visual features with explicit and enriched semantics. Specifically, we first develop an Explicit Semantics-Guided Feature Alignment (ESFA) module, which supplements textual descriptions for cross-modality images and aligns image-text pairs within each modality, alleviating the distribution discrepancy of visual features. We then devise a Consistent Similarity-Guided Indirect Alignment (CSIA) module, which constrains the similarity between intra-modality image-text pairs to be consistent with that between inter-modality text-text pairs, indirectly aligning visual features with cross-modality semantics. Furthermore, we design a Cross-View Semantics Compensation (CVSC) module, which integrates multi-view texts and improves the image-text matching of one-to-one in ESFA and CSIA to one-to-many, further strengthening the alignment of visual features within the semantic space. Extensive experimental results on three public datasets demonstrate the effectiveness and superiority of our proposed RSBA.
3053: Pixel-wise Divide and Conquer for Federated Vessel Segmentation
Authors: Tian Chen, Wenke Huang, Zhihao Wang, Zekun Shi, He Li, Wenhui Dong, Mang Ye, Bo Du, Yongchao Xu
Location: Guangzhou | Day: TBD
Show Abstract
Accurate vessel segmentation is essential for diagnosing and managing vascular and ophthalmic diseases. Traditional learning-based vessel segmentation methods heavily rely on high-quality, pixel-level annotated datasets. However, segmentation performance suffers significantly when applied in federated learning settings due to vessel morphology inconsistency and vessel-background imbalance. The former limits the ability of models to capture fine-grained vessels, while the latter overemphasizes background pixels and biases the model towards them. To address these challenges, we propose a novel method named Federated Vessel-Aware Calibration (FVAC), which leverages global uncertainty to provide differentiated guidance for clients, focusing on pixels of various morphologies that are difficult to distinguish. Furthermore, we introduce a foreground-background decoupling alignment strategy that utilizes more stable and balanced global features to mitigate semantic drift caused by vessel-background imbalance in local clients. Comprehensive experiments confirm the effectiveness of our method
3057: MMGIA: Gradient Inversion Attack Against Multimodal Federated Learning via Intermodal Correlation
Authors: Lele Zheng, Yang Cao, Leo Yu Zhang, Wei Wang, Yulong Shen, Xiaochun Cao
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal federated learning (MMFL) enables collaborative model training across multiple modalities, such as images and text, without requiring direct data sharing. However, the inherent correlations between modalities introduce new privacy vulnerabilities, making MMFL more susceptible to gradient inversion attacks. In this work, we propose MMGIA, an intermodal correlation-driven gradient inversion attack that systematically exploits multimodal correlation to enhance data reconstruction quality. MMGIA consists of a two-stage optimization framework: the first stage independently reconstructs each modality using traditional gradient inversion techniques, while the second stage refines these reconstructions through pre-trained feature extractors to align modalities in a shared latent space. To further improve reconstruction accuracy, we introduce a quality-weighted fusion strategy, which dynamically integrates multimodal embeddings into a global fused representation that serves as a guiding signal for refining each modality’s reconstruction. This ensures that high-quality reconstructions contribute more to the optimization process, preventing degradation in well-reconstructed modalities while enhancing weaker ones. We conduct extensive experiments on multiple multimodal scenarios, demonstrating that MMGIA outperforms both the only existing multimodal attack and state-of-the-art single-modal attacks, revealing the heightened privacy risks in MMFL.
3079: Secure and Efficient Watermarking for Latent Diffusion Models in Model Distribution Scenarios
Authors: Liangqi Lei, Keke Gai, Jing Yu, Liehuang Zhu, Qi Wu
Location: Guangzhou | Day: TBD
Show Abstract
Latent diffusion models have exhibited considerable potential in generative tasks. Watermarking is considered to be an alternative to safeguard the copyright of generative models and prevent their misuse. However, in the context of model distribution scenarios, the accessibility of models to large scale of model users brings new challenges to the security, efficiency and robustness of existing watermark solutions. To address these issues, we propose a secure and efficient watermarking solution. A new security mechanism is designed to prevent watermark leakage and watermark escape, which considers watermark randomness and watermark-model association as two constraints for mandatory watermark injection. To reduce the time cost of training the security module, watermark injection and the security mechanism are decoupled, ensuring that fine-tuning VAE only accomplishes the security mechanism without the burden of learning watermark patterns. A watermark distribution-based verification strategy is proposed to enhance the robustness against diverse attacks in the model distribution scenarios. Experimental results prove that our watermarking consistently outperforms existing six baselines on effectiveness and robustness against ten image processing attacks and adversarial attacks, while enhancing security in the distribution scenarios. The code is available at https://anonymous.4open.science/r/DistriMark-F11F/.
3118: MiniMal: Hard-Label Adversarial Attack Against Static Malware Detection with Minimal Perturbation
Authors: Chengyi Li, Zhiyuan Jiang, Yongjun Wang, Tian Xia, Yayuan Zhang, Yuhang Mao
Location: Guangzhou | Day: TBD
Show Abstract
Static malware detectors based on machine learning are integral to contemporary antivirus systems, but they are vulnerable to adversarial attacks. While existing research has demonstrated success with adversarial attacks in black-box hard-label scenarios, challenges such as high perturbation rates and incomplete retention of functional integrity remain. To address these issues, we propose a novel black-box hard-label attack method, MiniMal. MiniMal begins with initialized adversarial examples and utilizes binary search and particle swarm optimization algorithms to streamline the perturbation content, significantly reducing the perturbation rate of the adversarial examples. Furthermore, we propose a functionality verification method grounded in file format parsing and control flow graph comparisons to ensure the functional integrity of the adversarial examples. Experimental results indicate that MiniMal achieves an attack success rate of over 98% against three leading machine learning detectors, improving performance by approximately 4.8% to 7.1% compared to state-of-the-art methods. MiniMal reduces perturbation rates to below 40%, making them 9 to 11 times lower than those of previous methods. Additionally, functional verification via Cuckoo Sandbox revealed that the adversarial examples generated by MiniMal retained 100% functional integrity, even with various modifications applied.
3126: CrossVTON: Mimicking the Logic Reasoning on Cross-Category Virtual Try-On Guided by Tri-Zone Priors
Authors: Donghao Luo, Yujie Liang, Xu Peng, Xiaobin Hu, Boyuan Jiang, Chengming Xu, Taisong Jin, Chengjie Wang, Yanwei Fu
Location: Guangzhou | Day: TBD
Show Abstract
Despite remarkable progress in image-based virtual try-on systems, generating realistic and robust fitting images for cross-category virtual try-on remains a challenging task. The primary difficulty arises from the absence of human-like reasoning, which involves addressing size mismatches between garments and models while recognizing and leveraging the distinct functionalities of various regions within the model images. To address this issue, we draw inspiration from human cognitive processes and disentangle the complex reasoning required for cross-category try-on into a structured framework. This framework systematically decomposes the model image into three distinct regions: try-on, reconstruction, and imagination zones. Each zone plays a specific role in accommodating the garment and facilitating realistic synthesis. To endow the model with robust reasoning capabilities for cross-category scenarios, we propose an iterative data constructor. This constructor encompasses diverse scenarios, including intra-category try-on, any-to-dress transformations (replacing any garment category with a dress), and dress-to-any transformations (replacing a dress with another garment category). Utilizing the generated dataset, we introduce a tri-zone priors generator that intelligently predicts the try-on, reconstruction, and imagination zones by analyzing how the input garment is expected to align with the model image. Guided by these tri-zone priors, our proposed method, CrossVTON, achieves state-of-the-art performance, surpassing existing baselines in both qualitative and quantitative evaluations. Notably, it demonstrates superior capability in handling cross-category virtual try-on, meeting the complex demands of real-world applications.
3130: Continuous Diffusive Prediction Network for Multi-Station Weather Prediction
Authors: Chujie Xu, Yuqing Ma, Haoyuan Deng, Yajun Gao, Yudie Wang, Kai Lv, Xianglong Liu
Location: Guangzhou | Day: TBD
Show Abstract
Multi-station weather prediction provides weather forecasts for specific geographical locations, playing an important role in various aspects of daily life. Existing methods consider the relationships between individual stations discretely, making it difficult to model the continuous spatiotemporal processes of atmospheric motion, which results in suboptimal prediction outcomes. This paper proposes the Continuous Diffusive Prediction Network (CDPNet) to model the real-world continuous weather change process from discrete station observation data. CDPNet consists of two core modules: the Continuous Calibrated Initialization (CCI) and the Diffusive Difference Estimation (DDE). The CCI module interpolates data between observation stations to construct a spatially continuous physical field and ensures temporal continuity by integrating directional information from a global perspective. It accurately represents the current physical state and provides a foundation for future weather prediction. Moreover, the DDE module explicitly captures the spatial diffusion process and estimates the diffusive differences between consecutive time steps, effectively modeling spatio-temporally continuous atmospheric motion. Likewise, directional information on weather changes is introduced from the entire historical series to mitigate estimation uncertainty and improve the performance of weather prediction. Extensive experiments on the Weather2K and Global Wind/Temp datasets demonstrate that CDPNet outperforms state-of-the-art models.
3133: DenseSAM: Semantic Enhance SAM for Efficient Dense Object Segmentation
Authors: Linyun Zhou, Jiacong Hu, Shengxuming Zhang, Xiangtong Du, Mingli Song, Xiuming Zhang, Zunlei Feng
Location: Guangzhou | Day: TBD
Show Abstract
Dense object segmentation is essential for various applications, particularly in pathology image and remote sensing image analysis. However, distinguishing numerous similar and densely packed objects in this task presents significant challenges. Several methods, including CNN- and ViT-based approaches, have been proposed to tackle these issues. Yet, models trained on limited datasets exhibit limited generalization ability. The Segment Anything Model (SAM) has recently achieved significant progress in zero-shot segmentation but relies heavily on precise positional guidance. However, providing numerous accurate location prompts in dense scenarios is time-consuming. To overcome this limitation, we conducted an in-depth exploration of the SAM mechanism and found that its strong generalization ability stems from the encoder’s edge detection capability, which is semantically independent, making location prompts essential for segmentation. This insight inspired the development of DenseSAM, which replaces location prompts with semantic guidance for automatic segmentation in dense scenarios. Specifically, it uses local details to weaken the edges of background objects, leverages global context to enhance intra-class feature similarity, while further increasing contrast with the background, and integrates a dual-head decoding process to enable lightweight automatic semantic segmentation. Extensive experiments on pathology images demonstrate that DenseSAM delivers remarkable performance with minimal training parameters, providing a cost-effective and efficient solution. Moreover, experiments on remote sensing images further validate its excellent scalability, making DenseSAM suitable for various dense object segmentation domains. The code is available at https://github.com/imAzhou/DenseSAM.
3135: FedSaaS: Class-Consistency Federated Semantic Segmentation via Global Prototype Supervision and Local Adversarial Harmonization
Authors: Xiaoyang Yu, Xiaoming Wu, Xin Wang, Dongrun Li, Ming Yang, Peng Cheng
Location: Guangzhou | Day: TBD
Show Abstract
Federated semantic segmentation enables pixel-level classification in images through collaborative learning while maintaining data privacy. However, existing research commonly overlooks the fine-grained class relationships within the semantic space when addressing heterogeneous problems, particularly domain shift. This oversight results in ambiguities between class representation. To overcome this challenge, we propose a novel federated segmentation framework that strikes class consistency, termed FedSaaS. Specifically, we introduce class exemplars as a criterion for both local- and global-level class representations. On the server side, the uploaded class exemplars are leveraged to model class prototypes, which supervise global branch of clients, ensuring alignment with global-level representation. On the client side, we incorporate an adversarial mechanism to harmonize contributions of global and local branches, leading to consistent output. Moreover, multilevel contrastive losses are employed on both sides to enforce consistency between two-level representations in the same semantic space. Extensive experiments on five driving scene segmentation datasets demonstrate that our framework outperforms state-of-the-art methods, significantly improving average segmentation accuracy and effectively addressing the class-consistency representation problem.
3136: A Theoretical Perspective on Why Stochastic Population Update Needs an Archive in Evolutionary Multi-objective Optimization
Authors: Shengjie Ren, Zimin Liang, Miqing Li, Chao Qian
Location: Guangzhou | Day: TBD
Show Abstract
Evolutionary algorithms (EAs) have been widely applied to multi-objective optimization due to their population-based nature. Population update, a key component in multi-objective EAs (MOEAs), is usually performed in a greedy, deterministic manner. However, recent studies have questioned this practice and shown that stochastic population update (SPU), which allows inferior solutions have a chance to be preserved, can help MOEAs jump out of local optima more easily. Nevertheless, SPU risks losing high-quality solutions, potentially requiring a large population. Intuitively, a possible solution to this issue is to introduce an archive that stores the best solutions ever found. In this paper, we theoretically show that using an archive allows a small population and may enhance the search performance of SPU-based MOEAs. We examine two classic algorithms, SMS-EMOA and NSGA-II, on the bi-objective problem OneJumpZeroJump, and prove that using an archive can reduce the expected running time upper bound (even exponentially). The comparison between SMS-EMOA and NSGA-II also suggests that the (μ+μ) update mode may be more suitable for SPU than the (μ+1) update mode. We also validate our findings empirically. We hope this work may provide theoretical support to explore different ideas of designing algorithms in evolutionary multi-objective optimization.
3137: Dual-level Fuzzy Learning with Patch Guidance for Image Ordinal Regression
Authors: Chunlai Dong, Haochao Ying, Qibo Qiu, Jinhong Wang, Danny Chen, Jian Wu
Location: Guangzhou | Day: TBD
Show Abstract
Ordinal regression bridges regression and classification by assigning objects to ordered classes. While human experts rely on discriminative patch-level features for decisions, current approaches are limited by the availability of only image-level ordinal labels, overlooking fine-grained patch-level characteristics. In this paper, we propose a Dual-level Fuzzy Learning with Patch Guidance framework, named DFPG that learns precise feature-based grading boundaries from ambiguous ordinal labels, with patch-level supervision. Specifically, we propose patch-labeling and filtering strategies to enable the model to focus on patch-level features exclusively with only image-level ordinal labels available. We further design a dual-level fuzzy learning module, which leverages fuzzy logic to quantitatively capture and handle label ambiguity from both patch-wise and channel-wise perspectives. Extensive experiments on various image ordinal regression datasets demonstrate the superiority of our proposed method, further confirming its ability in distinguishing samples from difficult-to-classify categories. The code is available at https://github.com/ZJUMAI/DFPG-ord.
3142: GraphAD: Interaction Scene Graph for End-to-end Autonomous Driving
Authors: Yunpeng Zhang, Deheng Qian, Ding Li, Yifeng Pan, Yong Chen, Zhenbao Liang, Zhiyao Zhang, Yingzong Liu, Jianhui Mei, Maolei Fu, Yun Ye, Zhujin Liang, Yi Shan, Dalong Du
Location: Guangzhou | Day: TBD
Show Abstract
Modeling complicated interactions among the ego-vehicle, road agents, and map elements has been a crucial part for safety-critical autonomous driving. Previous work on end-to-end autonomous driving relies on the attention mechanism to handle heterogeneous interactions, which fails to capture geometric priors and is also computationally intensive. In this paper, we propose the Interaction Scene Graph (ISG) as a unified method to model the interactions among the ego-vehicle, road agents, and map elements. With the representation of the ISG, the driving agents aggregate essential information from the most influential elements, including the road agents with potential collisions and the map elements to follow. Since a mass of unnecessary interactions are omitted, the more efficient scene-graph-based framework is able to focus on indispensable connections and leads to better performance. We evaluate the proposed method for end-to-end autonomous driving on the nuScenes dataset. Compared with strong baselines, our method significantly outperforms in full-stack driving tasks.
3148: Dyn-D^2P: Dynamic Differentially Private Decentralized Learning with Provable Utility Guarantee
Authors: Zehan Zhu, Yan Huang, Xin Wang, Shouling Ji, Jinming Xu
Location: Guangzhou | Day: TBD
Show Abstract
Most existing decentralized learning methods with differential privacy (DP) guarantee rely on constant gradient clipping bounds and fixed-level DP Gaussian noises for each node throughout the training process, leading to a significant accuracy degradation compared to non-private counterparts. In this paper, we propose a new Dynamic Differentially Private Decentralized learning approach (termed Dyn-D^2P) tailored for general time-varying directed networks. Leveraging the Gaussian DP (GDP) framework for privacy accounting, Dyn-D^2P dynamically adjusts gradient clipping bounds and noise levels based on gradient convergence. This proposed dynamic noise strategy enables us to enhance model accuracy while preserving the total privacy budget. Extensive experiments on benchmark datasets demonstrate the superiority of Dyn-D^2P over its counterparts employing fixed-level noises, especially under strong privacy guarantees. Furthermore, we provide a provable utility bound for Dyn-D^2P that establishes an explicit dependency on network-related parameters, with a scaling factor of 1/sqrt{n} in terms of the number of nodes n up to a bias error term induced by gradient clipping. To our knowledge, this is the first model utility analysis for differentially private decentralized non-convex optimization with dynamic gradient clipping bounds and noise levels.
3150: Cost-Effective On-Device Sequential Recommendation with Spiking Neural Networks
Authors: Di Yu, Changze Lv, Xin Du, Linshan Jiang, Qing Yin, Wentao Tong, Xiaoqing Zheng, Shuiguang Deng
Location: Guangzhou | Day: TBD
Show Abstract
On-device sequential recommendation (SR) systems are designed to make local inferences using real-time features, thereby alleviating the communication burden on server-based recommenders when handling concurrent requests from millions of users.
However, the resource constraints of edge devices, including limited memory and computational capacity, pose significant challenges to deploying efficient SR models.
Inspired by the energy-efficient and sparse computing properties of deep Spiking Neural Networks (SNNs), we propose a cost-effective on-device SR model named SSR, which encodes dense embedding representations into sparse spike-wise representations and integrates novel spiking filter modules to extract temporal patterns and critical features from item sequences, optimizing computational and memory efficiency without sacrificing recommendation accuracy.
Extensive experiments on real-world datasets demonstrate the superiority of SSR. Compared to other SR baselines, SSR achieves comparable recommendation performance while reducing energy consumption by an average of 59.43%. In addition, SSR significantly lowers memory usage, making it particularly well-suited for deployment on resource-constrained edge devices.
3152: Fine-grained Prompt Screening: Defending Against Backdoor Attack on Text-to-Image Diffusion Models
Authors: Yiran Xu, Nan Zhong, Guobiao Li, Anda Cheng, Yinggui Wang, Zhenxing Qian, Xinpeng Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Text-to-image (T2I) diffusion models exhibit impressive generation capabilities in recently studies. However, they are vulnerable to backdoor attacks, where model outputs are manipulated by malicious triggers. In this paper, we propose a novel input-level defense method, called Fine-grained Prompt Screening (GrainPS). Our method is motivated by the phenomenon, i.e., Semantics Misalignment, where the backdoor trigger causes the inconsistency between the cross-attention projections of object words (the key words to determine the main content of the generated image) and their true semantics. In particular, we divide each prompt into pieces and conduct fine-grained analysis by examining the impact of the trigger on object words in the cross-attention layers rather than their global influence on the entire generated image. To assess the impact of each word on object words, we formulate "semantics alignment score” as the metric with a carefully crafted detection strategy to identify the trigger. Therefore, our implementation can detect backdoor input prompts and localize of triggers simultaneously. Evaluations across four advanced backdoor attack scenarios demonstrate the effectiveness of our proposed defense method.
3154: TEST-V: TEst-time Support-set Tuning for Zero-shot Video Classification
Authors: Rui Yan, Jin Wang, Hongyu Qu, Xiaoyu Du, Dong Zhang, Jinhui Tang, Tieniu Tan
Location: Guangzhou | Day: TBD
Show Abstract
Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other’s strengths and propose a novel framework, namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts inquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. TEST-V achieves state-of-the-art results across four benchmarks and shows good interpretability.
3164: SocialMP: Learning Social Aware Motion Patterns via Additive Fusion for Pedestrian Trajectory Prediction
Authors: Tianci Gao, Yuzhen Zhang, Hang Guo, Pei Lv
Location: Guangzhou | Day: TBD
Show Abstract
Accurately capturing social interaction in complex scenarios is essential for pedestrian trajectory prediction task. The uncertainty in pedestrian interactions and the physical constraints imposed by the environment make this task challenging. To solve this problem, existing methods adopt dimensionality reduction algorithms to capture explainable human motions and behaviors. However, these approaches not only suffer from weak social awareness due to the inadequate feature extraction, but also overlook physical constraints, leading to predicted trajectories often cross unwalkable areas. To overcome these problems, we build an attention-based motion pattern representation, named SocialMP, which can effectively enhance the social awareness and environmental perception of motion patterns. Specifically, our method first characterizes the motion patterns through singular value decomposition and defines a visual field-based rule to model environmental social interaction. Then, an attention-based additive fusion mechanism is designed to enhance social awareness and environment perception of motion patterns. Therein, we integrate social interactions into motion patterns through cross-attention mechanism to generate latent motion patterns, and feed them into our devised additive fusion structure with backward connection for multiple iterations. Lastly, we design a map loss function by applying an additional penalty into average displacement error to prevent the pedestrians from passing through the unwalkable area. Extensive experiments on ETH-UCY and SDD datasets demonstrate that our SocialMP can not only improve prediction accuracy but also generate plausible trajectories.
3167: ECC-SNN: Cost-Effective Edge-Cloud Collaboration for Spiking Neural Networks
Authors: Di Yu, Changze Lv, Xin Du, Linshan Jiang, Wentao Tong, Zhenyu Liao, Xiaoqing Zheng, Shuiguang Deng
Location: Guangzhou | Day: TBD
Show Abstract
Most edge-cloud collaboration frameworks rely on the substantial computational and storage capabilities of cloud-based artificial neural networks (ANNs). However, this reliance results in significant communication overhead between edge devices and the cloud, as well as high computational energy consumption, especially when applied to resource-constrained edge devices. To address these challenges, we propose ECC-SNN, a novel edge-cloud collaboration framework that incorporates energy-efficient spiking neural networks (SNNs) to offload more computational workload from the cloud to the edge, thereby improving cost-effectiveness and reducing reliance on the cloud. ECC-SNN employs a joint training approach that integrates ANN and SNN models, enabling edge devices to leverage knowledge from cloud models for enhanced performance while reducing energy consumption and processing latency. Furthermore, ECC-SNN features an on-device incremental learning algorithm that enables edge models to continuously adapt to dynamic environments, reducing the communication overhead and resource consumption associated with frequent cloud update requests. Extensive experimental results on four datasets demonstrate that ECC-SNN improves accuracy by 4.15%, reduces average energy consumption by 79.4%, and lowers average processing latency by 39.1%.
3176: A Priori Estimation of the Approximation, Optimization and Generalization Errors of Random Neural Networks for Solving Partial Differential Equations
Authors: Xianliang Xu, Ye Li, Zhongyi Huang
Location: Guangzhou | Day: TBD
Show Abstract
In recent years, neural networks have achieved remarkable progress in various fields and have also drawn much attention in applying them on scientific problems. A line of methods involving neural networks for solving partial differential equations (PDEs), such as Physics-Informed Neural Networks (PINNs) and the Deep Ritz Method (DRM), has emerged. Although these methods outperform classical numerical methods in certain cases, the optimization problems involving neural networks are typically non-convex and non-smooth, which can result in unsatisfactory solutions for PDEs. In contrast to deterministic neural networks, the hidden weights of random neural networks are sampled from some prior distribution and only the output weights participate in training. This makes training much simpler, but it remains unclear how to select the prior distribution. In this paper, we focus on Barron type functions and approximate them under Sobolev norms by random neural networks with clear prior distribution. In addition to the approximation error, we also derive bounds for the optimization and generalization errors of random neural networks for solving PDEs when the solutions are Barron type functions.
3180: LRGR: Self-Supervised Incomplete Multi-View Clustering via Local Refinement and Global Realignment
Authors: Yanwanyu Xi, Xiao Zheng, Chang Tang, Xingchen Hu, Yuanyuan Liu, Jun-Jie Huang, Xinwang Liu
Location: Guangzhou | Day: TBD
Show Abstract
Incomplete Multi-View Clustering (IMVC) aims to explore comprehensive representations from multiple views with missing samples.
Recent studies have revealed that IMVC methods benefit from Graph Convolutional Network (GCN) in achieving robust feature imputation and effective representation learning. Despite these notable improvements, GCN imputation methods often cause a distribution shift between the imputed and original representations, particularly when the neighbors of the imputed nodes are assigned to different groups. Moreover, GCN learning methods tend to produce homogeneous imputed representations, which blur cluster boundaries and hinder effective discriminative clustering.
To remedy these challenges, the Local Refinement and Global Realignment (LRGR) Self-supervised model is proposed for incomplete multi-view clustering, which includes two stages.
In the first stage, a local imputed refinement module is designed to enhance the versatility of imputed representations through cross-view contrastive learning guided by view-specific prototypes.
In the second stage, a global realignment module is introduced to achieve semantic consistency across views, alleviating distribution shifts by leveraging pseudo-labels and their corresponding confidence scores as guidance.
Experiments on five widely used multi-view datasets demonstrate the competitiveness and superiority of our method compared to state-of-the-art approaches.
3185: Single-Node Trigger Backdoor Attacks in Graph-Based Recommendation Systems
Authors: Runze Li, Di Jin, Xiaobao Wang, Dongxiao He, Bingdao Feng, Zhen Wang
Location: Guangzhou | Day: TBD
Show Abstract
Graph recommendation systems have been widely studied due to their ability to effectively capture the complex interactions between users and items. However, these systems also exhibit certain vulnerabilities when faced with attacks. The prevailing shilling attack methods typically manipulate recommendation results by injecting a large number of fake nodes and edges. However, such attack strategies face two primary challenges: low stealth and high destructiveness. To address these challenges, this paper proposes a novel graph backdoor attack method that aims to enhance the exposure of target items to the target user in a covert manner, without affecting other unrelated nodes. Specifically, we design a single-node trigger generator, which can effectively expose multiple target items to the target user by inserting only one fake user node. Additionally, we introduce constraint conditions between the target nodes and irrelevant nodes to mitigate the impact of fake nodes on the recommendation system’s performance. Experimental results show that the exposure of the target items reaches no less than 50% in 99% of the target users, while the impact on the recommendation system’s performance is controlled within approximately 5%.
3197: CorrDetail: Visual Detail Enhanced Self-Correction for Face Forgery Detection
Authors: Binjia Zhou, Hengrui Lou, Lizhe Chen, Haoyuan Li, Dawei Luo, Shuai Chen, Jie Lei, Zunlei Feng, Yijun Bei
Location: Guangzhou | Day: TBD
Show Abstract
With the swift progression of image generation technology, the widespread emergence of facial deepfakes poses significant challenges to the field of security, thus amplifying the urgent need for effective deepfake detection. Existing techniques for face forgery detection can broadly be categorized into two primary groups: visual-based methods and multimodal approaches. The former often lacks clear explanations for forgery details, while the latter, which merges visual and linguistic modalities, is more prone to the issue of hallucinations.To address these shortcomings, we introduce a visual detail enhanced self-correction framework, designated CorrDetail, for interpretable face forgery detection. CorrDetail is meticulously designed to rectify authentic forgery details when provided with error-guided questioning, with the aim of fostering the ability to uncover forgery details rather than yielding hallucinated responses. Additionally, to bolster the reliability of its findings, a visual fine-grained detail enhancement module is incorporated, supplying CorrDetail with more precise visual forgery details. Ultimately, a fusion decision strategy is devised to further augment the model’s discriminative capacity in handling extreme samples, through the integration of visual information compensation and model bias reduction. Experimental results demonstrate that CorrDetail not only achieves state-of-the-art performance compared to the latest methodologies but also excels in accurately identifying forged details, all while exhibiting robust generalization capabilities.
3203: Antibody Design and Optimization with Multi-scale Equivariant Graph Diffusion Models for Accurate Complex Antigen Binding
Authors: Jiameng Chen, Xiantao Cai, Jia Wu, Wenbin Hu
Location: Guangzhou | Day: TBD
Show Abstract
Antibody design remains a critical challenge in therapeutic and diagnostic development, particularly for complex antigens with diverse binding interfaces. Current computational methods face two main limitations: (1) capturing geometric features while preserving symmetries, and (2) generalizing novel antigen interfaces. Despite recent advancements, these methods often fail to accurately capture molecular interactions and maintain structural integrity. To address these challenges, we propose AbMEGD, an end-to-end framework integrating Multi-scale Equivariant Graph Diffusion for antibody sequence and structure co-design. Leveraging advanced geometric deep learning, AbMEGD combines atomic-level geometric features with residue-level embeddings, capturing local atomic details and global sequence-structure interactions. Its E(3)-equivariant diffusion method ensures geometric precision, computational efficiency, and robust generalizability for complex antigens. Furthermore, experiments using the SAbDab database demonstrate a 10.13% increase in amino acid recovery, 3.32% rise in improvement percentage, and a 0.062 Å reduction in root mean square deviation within the critical CDR-H3 region compared to DiffAb, a leading antibody design model. These results highlight AbMEGD’s ability to balance structural integrity with improved functionality, establishing a new benchmark for sequence-structure co-design and affinity optimization. The code is available at: https://github.com/Patrick221215/AbMEGD.
3214: MPPQ: Enhancing Post-Training Quantization for LLMs via Mixed Supervision, Proxy Rounding, and Pre-Searching
Authors: Mingrun Wei, Yeyu Yan, Dong Wang
Location: Guangzhou | Day: TBD
Show Abstract
Recently, post-training quantization (PTQ) methods for large language models (LLMs) primarily focus on tackling the challenges caused by outliers. Scaling transformation has proven to be effective while how to enhance the performance of extremely low-bitwidth (e.g., 2-bit) PTQ under it remains largely unexplored. In this work, a new PTQ framework, namely MPPQ, is established. Specifically, MPPQ first proposes an enhanced reconstruction loss based on Mixed metric supervision to mitigate the distribution inconsistency caused by quantization while providing strong regularization for learnable parameters.
Secondly, we introduce a Proxy-based adaptive rounding scheme in weight quantization, which replaces the round-to-nearest (RTN) function to minimize the overall quantization errors through element-wise scaling. Furthermore, a factor coarse Pre-searching mechanism is presented to ensure proper coordination between quantization and clipping patterns, while achieving optimal initialization of clipping factors before training.
Extensive experiments show that MPPQ consistently outperforms state-of-the-art methods in low-bit quantization settings. For instance, the perplexity of WikiText2 can be dramatically reduced to 8.85 (3.9 ↓ vs 12.75 of the latest method, LRQuant) for the LLaMA-2-7B model, which is quantized with W4A4.
3217: Exploring the Over-smoothing Problem of Graph Neural Networks for Graph Classification: An Entropy-based Viewpoint
Authors: Feifei Qian, Lu Bai, Lixin Cui, Ming Li, Hangyuan Du, Yue Wang, Edwin Hancock
Location: Guangzhou | Day: TBD
Show Abstract
The over-smoothing has emerged as a major challenge in the development of Graph Neural Networks (GNNs). While existing state-of-the-art methods effectively mitigate the diminishing distance between nodes and improve the performance of node classification, they tend to be elusive for graph-level tasks. This paper introduces a novel entropy-based perspective to explore the over-smoothing problem, simultaneously enhancing the distinguishability of non-isomorphic graphs. We provide a theoretical analysis of the relationship between the smoothness and the entropy for graphs, highlighting how the over-smoothing in high-entropic regions negatively impact the graph classification performance. To tackle this issue, we propose a simple yet effective method to Sample and Discretize node features in high-Entropic regions (SDE), aiming to preserve the critical and complicated structural information. Moreover, we introduce a new evaluation metric to assess the over-smoothing for graph-level tasks, focusing on node distributions. Experimental results demonstrate that the proposed SDE method significantly outperforms existing state-of-the-art methods, establishing a new benchmark in the field of GNNs.
3233: Unlocking the Potential of Lightweight Quantized Models for Deepfake Detection
Authors: Renshuai Tao, Ziheng Qin, Yifu Ding, Chuangchuang Tan, Jiakai Wang, Wei Wang
Location: Guangzhou | Day: TBD
Show Abstract
Deepfake detection is increasingly crucial due to the rapid rise of AI-generated content. Existing methods achieve high performance relying on computationally intensive large models, making real-time detection on resource-constrained edge devices challenging. Given that deepfake detection is a binary classification task, there is potential for model compression and acceleration. In this paper, we propose a low-bit quantization framework for lightweight and efficient deepfake detection. The Connected Quantized Block extracts common forgery features via the quantized path and retains method-specific textures through the shortcut connections. Additionally, the Shifted Logarithmic Redistribution Quantizer mitigates information loss in near-zero domains by unfolding the unbalanced activations, enabling finer quantization granularity. Comprehensive experiments demonstrate that this new framework significantly reduces 10.8x computational costs and 12.4x storage requirements while maintaining high detection performance, even surpassing SOTA methods using less than 5% FLOPs, paving the way for efficient deepfake detection in resource-limited scenarios.
3248: MedualTime: A Dual-Adapter Language Model for Medical Time Series-Text Multimodal Learning
Authors: Jiexia Ye, Weiqi Zhang, Ziyue Li, Jia Li, Meng Zhao, Fugee Tsung
Location: Guangzhou | Day: TBD
Show Abstract
The recent rapid advancements in language models (LMs) have garnered attention in medical time series-text multimodal learning.
However, existing contrastive learning-based and prompt-based LM approaches tend to be biased, often assigning a primary role to time series modality while treating text modality as secondary. We classify these approaches under a temporal-primary paradigm, which may overlook the unique and critical task-relevant information embedded in text modality like clinical reports, thus failing to fully leverage mutual benefits and complementarity of different modalities.
To fill this gap, we propose a novel textual-temporal multimodal learning paradigm that enables either modality to serve as the primary while being enhanced by the other, thereby effectively capturing modality-specific information and fostering cross-modal interaction. In specific, we design MedualTime, a language model composed of dual adapters to implement temporal-primary and textual-primary modeling simultaneously. Within each adapter, lightweight adaptation tokens are injected into the top layers of LM to encourage high-level modality fusion. The shared LM pipeline by dual adapters not only achieves adapter alignment but also enables efficient fine-tuning, reducing computational resources. Empirically, MedualTime demonstrates superior performance on medical data, achieving notable improvements of 8% accuracy and 12% F1 in supervised settings.
Furthermore, MedualTime’s transferability is validated by
few-shot transfer experiments from coarse-grained to fine-grained medical data.
3251: IterMeme: Expert-Guided Multimodal LLM for Interactive Meme Creation with Layout-Aware Generation
Authors: Yaqi Cai, Shancheng Fang, Yadong Qu, Xiaorui Wang, Meng Shao, Hongtao Xie
Location: Guangzhou | Day: TBD
Show Abstract
Meme creation is a creative process that blends images and text. However, existing methods lack critical components, failing to support intent-driven caption-layout generation and personalized generation, making it difficult to generate high-quality memes. To address this limitation, we propose IterMeme, an end-to-end interactive meme creation framework that utilizes a unified Multimodal Large Language Model (MLLM) to facilitate seamless collaboration among multiple components. To overcome the absence of a caption-layout generation component, we develop a robust layout representation method and construct a large-scale image-caption-layout dataset, MemeCap, which enhances the model’s ability to comprehend emotions and coordinate caption-layout generation effectively.
To address the lack of a personalization component, we introduce a parameter-shared dual-LLM architecture that decouples the intricate representations of reference images and text. Furthermore, we incorporate the expert-guided M³OE for fine-grained identity properties (IP) feature extraction and cross-modal fusion. By dynamically injecting features into every layer of the model, we enable adaptive refinement of both visual and semantic information.
Experimental results demonstrate that IterMeme significantly advances the field of meme creation by delivering consistently high-quality outcomes. The code, model, and dataset will be open-sourced to the community.
3263: Flow Matching Based Sequential Recommender Model
Authors: Feng Liu, Lixin Zou, Xiangyu Zhao, Min Tang, Liming Dong, Dan Luo, Xiangyang Luo, Chenliang Li
Location: Guangzhou | Day: TBD
Show Abstract
Generative models, particularly diffusion model, have emerged as powerful tools for sequential recommendation. However, accurately modeling user preferences remains challenging due to the noise perturbations inherent in the forward and reverse processes of diffusion-based methods. Towards this end, this study introduces FMRec, a Flow Matching based model that employs a straight flow trajectory and a modified loss tailored for the recommendation task. Additionally, from the diffusion-model perspective, we integrate a reconstruction loss to improve robustness against noise perturbations, thereby retaining user preferences during the forward process. In the reverse process, we employ a deterministic reverse sampler, specifically an ODE-based updating function, to eliminate unnecessary randomness, thereby ensuring that the generated recommendations closely align with user needs. Extensive evaluations on four benchmark datasets reveal that FMRec achieves an average improvement of 6.53% over state-of-the-art methods. The replication code is available at https://github.com/FengLiu-1/FMRec.
3275: State Revisit and Re-explore: Bridging Sim-to-Real Gaps in Offline-and-Online Reinforcement Learning with An Imperfect Simulator
Authors: Xingyu Chen, Jiayi Xie, Zhijian Xu, Ruixun Liu, Shuai Yang, Zeyang Liu, Lipeng Wan, Xuguang Lan
Location: Guangzhou | Day: TBD
Show Abstract
In reinforcement learning (RL) based robot skill acquisition, a high-fidelity simulator is usually indispensable but unattainable since the real environment dynamics are difficult to model, which leads to severe sim-to-real gaps. Existing methods solve this problem by combining offline and online RL to jointly learn transferable policies from limited offline data and imperfect simulators. However, due to the unrestricted exploration in the imperfect simulator, the hybrid offline-and-online RL methods inevitably suffer from low sample efficiency and insufficient state-action space coverage during training. To solve this problem, we propose a State Revisit and Re-exploration (SR2) hybrid offline-and-online RL framework. In particular, the proposed algorithm employs a meta-policy and a sub-policy, where the meta-policy aims to find high-quality states in the offline trajectories for online exploration, and the sub-policy learns the robot skill using mixed offline and online data. By introducing the state revisit and explore mechanism, our approach efficiently improves performance on a set of sim-to-real robotic tasks. Through extensive simulation and real-world tasks, we demonstrate the superior performance of our approach against other state-of-the-art methods.
3276: A Multi-Granularity Clustering Approach for Federated Backdoor Defense with the Adam Optimizer
Authors: Jidong Yuan, Qihang Zhang, Naiyue Chen, Shengbo Chen, Baomin Xu
Location: Guangzhou | Day: TBD
Show Abstract
Federated learning is vulnerable to backdoor attacks due to its distributed nature and the inability to access local datasets. Meanwhile, the heterogeneity of distributed data further complicates the detection of such attacks. However, existing defense strategies often overlook the presence of non-stationary objectives and noisy gradients across multiple clients, making it challenging to accurately and efficiently identify malicious participants. To address these challenges, we propose a backdoor defense method for Federated Learning with Adam optimizer and multi-granularity Clustering (FLAC), incorporating both coarse-grained and fine-grained clustering mechanisms to neutralize backdoor attacks. First, the Adam optimizer accelerates the learning process by mitigating the impact of noisy gradients and addressing the non-stationary objectives posed by different clients under attack. Second, a multi-granularity clustering process is considered to differentiate between benign clients and potential attackers. This is followed by an adaptive clipping strategy to further alleviate the influence of malicious attackers. Our theoretical analysis demonstrates the consistent convergence of Adam in a federated backdoor defense environment. Extensive experimental results validate the effectiveness of our defense approach.
3279: Multi-Scale Temporal Neural Network for Stock Trend Prediction Enhanced by Temporal Hyepredge Learning
Authors: Lingyun Song, Haodong Li, Siyu Chen, Xinbiao Gan, Binze Shi, Jie Ma, Yudai Pan, Xiaoqi Wang, Xuequn Shang
Location: Guangzhou | Day: TBD
Show Abstract
Existing research in Stock Trend Prediction (STP) focuses on temporal features extracted from a temporal sequence of stock data with a look-back window, which frequently leads to the omission of important periodic patterns, such as weekly and monthly variations in stock prices. Furthermore, these methods examine stocks individually, ignoring the temporal variation patterns among stocks that share higher-order relationships, like those within the same industry. These relationships typically provide contextual insights into market investments influencing stock price fluctuations. To tackle these issues, we propose a Multi-Scale Temporal Neural Network (MSTNN) framework tailored for STP. This architecture explores the periodic fluctuation behaviors of individual stocks through an innovative 3D convolutional neural network, alongside examining temporal variation patterns of stocks linked to specific industries via a temporal hypergraph attention mechanism. Empirical results from two real-world benchmark datasets show that MSTNN significantly outperforms prior state-of-the-art STP methods. The code of our MSTNN is available at https://github.com/sunlitsong/MSTNN.
3295: Volumetric Axial Disentanglement Enabling Advancing in Medical Image Segmentation
Authors: Xingru Huang, Jian Huang, Yihao Guo, Tianyun Zhang, Zhao Huang, Yaqi Wang, Ruipu Tang, Guangliang Cheng, Shaowei Jiang, Zhiwen Zheng, Jin Liu, Renjie Ruan, Xiaoshuai Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Information retrieved from three dimensions is treated uniformly in CNN-based volumetric segmentation methods. However, such neglect of axial disparities fails to capture true spatio-temporal variations. This paper introduces the volumetric axial disentanglement to address the disparities in spatial information along different axial dimensions. Building on this concept, we propose the Post-Axial Refiner (PaR) module to refine segmentation masks by implementing axial disentanglement on the specific axis of the volumetric medical sequences. As a plug-and-play enhancement to existing volumetric segmentation architecture, PaR further utilizes specialized attention approaches to learn disentangled post-decoding features, enhancing spatial representation and structural detail. Validation on various datasets demonstrates PaR’s consistent elevation of segmentation precision and boundary clarity across 11 baselines and different imaging modalities, achieving state-of-the-art performance on multiple datasets. Experimental tests demonstrate the ability of volumetric axial disentanglement to refine the segmentation of volumetric medical images. Code is released at https://github.com/IMOP-lab/PaR-Pytorch.
3314: SSTrack: Sample-interval Scheduling for Lightweight Visual Object Tracking
Authors: Yutong Kou, Shubo Lin, Liang Li, Bing Li, Weiming Hu, Jin Gao
Location: Guangzhou | Day: TBD
Show Abstract
In recent years, CPU real-time object tracking has gained significant attention due to its broad applications such as UAV-tracking. To maintain computational efficiency, most existing CPU real-time object trackers rely on lightweight backbones and employ a single initial template image without intermediate online templates. Although the appearance variance between the template and the search is larger under this single template setting, the representation ability of lightweight backbones is weaker which poses a challenge when training lightweight object trackers. To address this issue, we propose SSTrack, a new easier-to-harder training schedule for the lightweight object tracker. From the data perspective, our method designed a success-aware sample scheduler that gradually increases difficult training samples with longer template-search time intervals and reduces the amount of the easier samples so the training cost remains unchanged. From the optimization perspective, we utilized a gradient scaling strategy that retains the original training objective of easier samples despite the reduction in their quantities. With the collective effort from both perspectives, our method achieves State-of-the-Art CPU-real-time accuracy on 5 UAV-tracking benchmarks and 5 general object tracking benchmarks. Codes and models will be available at https://github.com/Kou-99/SSTrack.
3315: UniCT Depth: Event-Image Fusion Based Monocular Depth Estimation with Convolution-Compensated ViT Dual SA Block
Authors: Luoxi Jing, Dianxi Shi, Zhe Liu, Songchang Jin, Chunping Qiu, Ziteng Qiao, Yuxian Li, Jianqiang Xia
Location: Guangzhou | Day: TBD
Show Abstract
Depth estimation plays a crucial role in 3D scene understanding and is extensively used in a wide range of vision tasks. Image-based methods struggle in challenging scenarios, while event cameras offer high dynamic range and temporal resolution but face difficulties with sparse data. Combining event and image data provides significant advantages, yet effective integration remains challenging. Existing CNN-based fusion methods struggle with occlusions and depth disparities due to limited receptive fields, while Transformer-based fusion methods often lack deep modality interaction. To address these issues, we propose UniCT Depth, an event-image fusion method that unifies CNNs and Transformers to model local and global features. We propose the Convolution-compensated ViT Dual SA (CcViT-DA) Block, designed for the encoder, which integrates Context Modeling Self-Attention (CMSA) to capture spatial dependencies and Modal Fusion Self-Attention (MFSA) for effective cross-modal fusion. Furthermore, we design the tailored Detail Compensation Convolution (DCC) Block to improve texture details and enhances edge representations. Extensive experiments show that UniCT Depth outperforms existing image, event, and fusion-based monocular depth estimation methods across key metrics.
3322: ElaD-Net: An Elastic Semantic Decoupling Network for Lesion Segmentation in Breast Ultrasound Images
Authors: Lijuan Xu, Kai Wang, Fuqiang Yu, Fenghua Tong, Mengran Li, Dawei Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Breast diseases pose a significant threat to women’s health. Automatic lesion segmentation in breast ultrasound images (BUSI) plays a crucial role in fast diagnosis. While various enhanced U-Net-based models have achieved success in multi-scale feature analysis and handling blurred boundaries, two key challenges persist that could guide the improvement of BUSI segmentation networks: 1) significant fluctuations in pixel intensity distribution similarity between the lesion and surrounding tissues, and 2) inconsistent transmission of spatial detail due to multi-scale lesion sampling. These issues highlight the necessity of semantic elasticity understanding and consistency control. To this end, we propose ElaD-Net, an Elastic Semantic Decoupling Network for lesion segmentation in BUSI. This network uses the pre-trained EfficientNet-B2 for multi-scale encoding of BUSI. The decoding stage features two key modules: Elastic Semantic Decoupling (ESD) and Spatial Semantic Reconstruction (SSR). ESD learns and decouples multi-frequency semantics in multi-scale channels with a self-calibration mechanism, enabling dynamic adjustment of receptive depth to resist similarity fluctuations. SSR further optimizes ESD outputs via feature branching, compression, and excitation to ensure spatial semantic consistency, thereby separately reconstructing edge and body.
3390: PanComplex: Leveraging Complex-Valued Neural Networks for Enhanced Pansharpening
Authors: Chunhui Luo, Dong Li, Xiaoliang Ma, Xin Lu, Zhiyuan Wang, Jiangtong Tan, Xueyang Fu
Location: Guangzhou | Day: TBD
Show Abstract
Pansharpening combines panchromatic and low-resolution multispectral images to generate high-resolution multispectral images. Previous studies have explored the connection between pansharpening and the frequency domain, but mostly in the real-valued domain, leaving the complex domain relatively unexplored. To redefine the pansharpening task, we propose a complex-valued spatial-frequency dual-domain framework, PanComplex. To achieve this, we first establish complex representations and introduce basic complex operators tailored to pansharpening, enabling the transformation of multispectral real-valued signals into the complex domain for learning. We then model both spatial and frequency branches to capture global frequency features and local spatial features comprehensively. Finally, we employ a complex-based interaction module to fuse the spatial and frequency features, achieving complementary information across both domains. By using the representation power of the complex domain, PanComplex effectively extracts complementary features from PAN and MS images, thereby enhancing pansharpening performance. Experiments on multiple datasets demonstrate that our method achieves optimal performance with the fewest parameters and exhibits strong generalization ability to other tasks. The source code for this work is publicly available at https://github.com/lch-ustc/PanComplex.
3399: EfficientPIE: Real-Time Prediction on Pedestrian Crossing Intention with Sole Observation
Authors: Fang Qu, Pengzhan Zhou, Yuepeng He, Kaixin Gao, Youyu Luo, Xin Feng, Yu Liu, Songtao Guo
Location: Guangzhou | Day: TBD
Show Abstract
Present Advanced Driving Assistance System (ADAS) responds to the dangerous crossing of pedestrians after the occurrence of the incident, occasionally causing severe accidents due to the stringent response window. Inference of pedestrian crossing intention may help vehicles operate in advance and enhance the safety of the vehicle by predicting the crossing probability. Recent studies usually ignore the demand of real-time forecast that required in the realistic driving scenario, and mainly focus on improving the model representation capacity on public datasets by increasing modality and observation time. Consequently, a new framework named EfficientPIE is proposed to predict the pedestrian crossing intention in real time with sole observation of the incident. To achieve reliable predictions, we propose incremental learning based on intention domain to relieve forgetting and promote performance with a progressive perturbation method. Our EfficientPIE outperforms all the SOTA models on two datasets PIE and JAAD, running nearly 7.4x faster than the previously fastest model. Our code is available at https://github.com/heinideyibadiaole/EfficientPIE.
3425: FedDLAD: A Federated Learning Dual-Layer Anomaly Detection Framework for Enhancing Resilience Against Backdoor Attacks
Authors: Binbin Ding, Penghui Yang, Sheng-Jun Huang
Location: Guangzhou | Day: TBD
Show Abstract
In Federated Learning (FL), the decentralized nature of client training introduces vulnerabilities, notably backdoor attacks. Prevailing anomaly detection approaches typically perform binary classification, dividing clients into trusted and untrusted groups. However, these methods face two critical challenges: the insider threat, where malicious clients concealed within the trusted group compromise the global model, and the benign exclusion, where legitimate contributions from benign clients are mistakenly classified as untrusted and disregarded. These issues weaken both the robustness and fairness of FL systems, exposing inherent defense vulnerabilities. To address these challenges,
we propose FedDLAD, a Federated Learning Dual-Layer Anomaly Detection framework designed to enhance resilience against backdoor attacks. The framework leverages the Connectivity-Based Outlier Factor (COF) module to perform a robust initial classification of clients by analyzing structural data connectivity. The Interquartile Range (IQR) module further reinforces this by mitigating the insider threat through the removal of residual malicious influences within the trusted group. Furthermore, the Pardon module dynamically reintegrates misclassified benign clients from the untrusted group, thereby preserving their valuable contributions and addressing the benign exclusion. We conduct extensive evaluations of FedDLAD against state-of-the-art defenses on real-world datasets, demonstrating its superior ability to reduce backdoor attack success rates while maintaining robust model performance. Code is available at: https://github.com/dingbinb/FedDLAD.
3426: T2S: High-resolution Time Series Generation with Text-to-Series Diffusion Models
Authors: Yunfeng Ge, Jiawei Li, Yiji Zhao, Haomin Wen, Zhao Li, Meikang Qiu, Hongyan Li, Ming Jin, Shirui Pan
Location: Guangzhou | Day: TBD
Show Abstract
Text-to-Time Series generation holds significant potential to address challenges such as data sparsity, imbalance, and limited availability of multimodal time series data across domains. While diffusion models have achieved remarkable success in Text-to-X (e.g., vision and audio data) generation, their use in time series generation remains limit. Existing approaches face two critical limitations: (1) reliance on domain-specific captions that generalize poorly, and (2) inability to generate time series of arbitrary length, limiting real-world use. In this work, we first introduce a new multimodal dataset containing over 600,000 high-resolution text-time series pairs. Second, we propose Text-to-Series (T2S), a diffusion-based framework that bridges the gap between natural language and time series in a domain-agnostic manner. It employs a length-adaptive VAE to encode time series of varying lengths into consistent latent embeddings. On top of that, T2S effectively aligns textual representations with latent embeddings by utilizing Flow Matching and employing DiT as the denoiser. We train T2S in an interleaved paradigm across multiple lengths, allowing it to generate sequences of arbitrary lengths. Extensive evaluations demonstrate that T2S achieves state-of-the-art performance across 13 datasets spanning 12 domains.
3436: MMET: A Multi-Input and Multi-Scale Transformer for Efficient PDEs Solving
Authors: Yichen Luo, Jia Wang, Dapeng Lan, Yu Liu, Zhibo Pang
Location: Guangzhou | Day: TBD
Show Abstract
Partial Differential Equations (PDEs) are fundamental for modeling physical systems, yet solving them in a generic and efficient manner using machine learning-based approaches remains challenging due to limited multi-input and multi-scale generalization capabilities, as well as high computational costs. This paper proposes the Multi-input and Multi-scale Efficient Transformer (MMET), a novel framework designed to address the above challenges. MMET decouples mesh and query points as two sequences and feeds them into the encoder and decoder, respectively, and uses a Gated Condition Embedding (GCE) layer to embed input variables or functions with varying dimensions, enabling effective solutions for multi-scale and multi-input problems. Additionally, a Hilbert curve-based reserialization and patch embedding mechanism decrease the input length. This significantly reduces the computational cost when dealing with large-scale geometric models. These innovations enable efficient representations and support multi-scale resolution queries for large-scale and multi-input PDE problems. Experimental evaluations on diverse benchmarks spanning different physical fields demonstrate that MMET outperforms SOTA methods in both accuracy and computational efficiency. This work highlights the potential of MMET as a robust and scalable solution for real-time PDE solving in engineering and physics-based applications, paving the way for future explorations into pre-trained large-scale models in specific domains. This work is open-sourced at https://github.com/YichenLuo-0/MMET.
3483: GSDet: Gaussian Splatting for Oriented Object Detection
Authors: Zeyu Ding, Jiaqi Zhao, Yong Zhou, Wen-liang Du, Hancheng Zhu, Rui Yao
Location: Guangzhou | Day: TBD
Show Abstract
Oriented object detection has advanced with the development of convolutional neural networks (CNNs) and transformers. However, modern detectors still rely on predefined object candidates, such as anchors in CNN-based methods or queries in transformer-based methods, which struggle to capture spatial information effectively. To address the limitations, we propose GSDet, a novel framework that formulates oriented object detection as Gaussian splatting. Specifically, our approach performs detection within a 3D feature space constructed from image features, where 3D Gaussians are employed to represent oriented objects. These 3D Gaussians are projected onto the image plane to form 2D Gaussians, which are then transformed into oriented boxes. Furthermore, we optimize the mean, anisotropic covariance, and confidence scores of these randomly initialized 3D Gaussians, using a decoder that incorporates 3D Gaussian sampling. Moreover, our method exhibits flexibility, enabling adaptive control and a dynamic number of Gaussians during inference. Experiments on 3 datasets indicate that GSDet achieves AP50 gains of 0.7% on DIOR-R, 0.3% on DOTA-v1.0, and 0.55% on DOTA-v1.5 when evaluated with adaptive control and outperforms mainstream detectors.
3486: Distributed Cascaded Manifold Hashing Network for Compact Image Set Representation
Authors: Xiaxin Wang, Haoyu Cai, Xiaobo Shen, Xia Wu
Location: Guangzhou | Day: TBD
Show Abstract
Conventional image set methods typically learn from image sets stored in a single location. However, in real-world applications, image sets are often distributed across different locations. Learning from such distributed sets using deep neural networks poses challenges for efficient image set classification and retrieval. To address this, we propose Distributed Cascade Manifold Hashing Network (DCMHN) for compact image set representation. DCMHN represents each image set using an SPD manifold and utilizes a manifold hashing network to generate hash codes, enabling efficient classification and retrieval. The network is trained in a cascaded manner, where the bilinear mapping in the BiMap layer is learned first, followed by joint learning of the hash function and classifier in the hash layer. DCMHN enforces local consistency on global variables across neighboring nodes, allowing parallel optimization. Extensive experiments on three benchmark image set datasets demonstrate that the proposed DCMHN achieves competitive accuracies in distributed settings, and outperforms state-of-the-arts in terms of computation and storage efficiency.
3493: Rotation Invariant Spatial Networks for Single-View Point Cloud Classification
Authors: Feng Luan, Jiarui Hu, Changshi Zhou, Zhipeng Wang, Jiguang Yue, Yanmin Zhou, Bin He
Location: Guangzhou | Day: TBD
Show Abstract
Point cloud classification is critical for three-dimensional scene understanding. However, in real-world scenarios, depth cameras often capture partial, single-view point clouds of objects with different poses, making their accurate classification a challenge. In this paper, we propose a novel point cloud classification network that captures the detailed spatial structure of objects by constructing tetrahedra, which is different from point-wise operations. Specifically, we propose a RISpaNet block to extract rotation-invariant features. A rotation-invariant property generation module is designed in RISpaNet for constructing rotation-invariant tetrahedron properties (RITPs). Meanwhile, a multi-scale pooling module and a hybrid encoder are used to process RITPs to generate integrated rotation-invariant features. Further, for single-view point clouds, a complete point cloud auxiliary branch and a part-whole correlation module are jointly employed to obtain complete point cloud features from partial point clouds. Experimental results show that this network performs better than other state-of-the-art methods, evaluated on four public datasets. We achieved an overall accuracy of 94.7% (+2.0%) on ModelNet40, 93.4% (+5.9%) on MVP, 94.7% (+6.3%) on PCN and 94.8% (+1.7%) on ScanObjectNN. Our project website is https://luxurylf.github.io/RISpaNet_project/.
3500: Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution
Authors: Zihang Liu, Zhenyu Zhang, Hao Tang
Location: Guangzhou | Day: TBD
Show Abstract
Diffusion-based image super-resolution (SR) methods have demonstrated remarkable performance. Recent advancements have introduced deterministic sampling processes that reduce inference from 15 iterative steps to a single step, thereby significantly improving the inference speed of existing diffusion models. However, their efficiency remains limited when handling complex semantic regions due to the single-step inference.
To address this limitation, we propose SAMSR, a semantic-guided diffusion framework that incorporates semantic segmentation masks into the sampling process. Specifically, we introduce the SAM-Noise Module, which refines Gaussian noise using segmentation masks to preserve spatial and semantic features. Furthermore, we develop a pixel-wise sampling strategy that dynamically adjusts the residual transfer rate and noise strength based on pixel-level semantic weights, prioritizing semantically rich regions during the diffusion process. To enhance model training, we also propose a semantic consistency loss, which aligns pixel-wise semantic weights between predictions and ground truth.
Extensive experiments on both real-world and synthetic datasets demonstrate that SAMSR significantly improves perceptual quality and detail recovery, particularly in semantically complex images.
3502: Counterfactual Knowledge Maintenance for Unsupervised Domain Adaptation
Authors: Yao Li, Yong Zhou, Jiaqi Zhao, Wen-liang Du, Rui Yao, Bing Liu
Location: Guangzhou | Day: TBD
Show Abstract
Traditional unsupervised domain adaptation (UDA) struggles to extract rich semantics due to backbone limitations. Recent large-scale pre-trained visual-language models (VLMs) have shown strong zero-shot learning capabilities in UDA tasks. However, directly using VLMs results in a mixture of semantic and domain-specific information, complicating knowledge transfer. Complex scenes with subtle semantic differences are prone to misclassification, which in turn can result in the loss of features that are crucial for distinguishing between classes. To address these challenges, we propose a novel counterfactual knowledge maintenance UDA framework. Specifically, we employ counterfactual disentanglement to separate the representation of semantic information from domain features, thereby reducing domain bias. Furthermore, to clarify ambiguous visual information specific to classes, we maintain the discriminative knowledge of both visual and textual information. This approach synergistically leverages multimodal information to preserve modality-specific distinguishable features. We conducted extensive experimental evaluations on several public datasets to demonstrate the effectiveness of our method. The source code is available at https://github.com/LiYaolab/CMKUDA
3503: Where and How to Enhance: Discovering Bit-Width Contribution for Mixed Precision Quantization
Authors: Haidong Kang, Lianbo Ma, Guo Yu, Shangce Gao
Location: Guangzhou | Day: TBD
Show Abstract
Mixed precision quantization (MPQ) is an effective quantization approach to achieve accuracy-complexity trade-off of neural network, through assigning different bit-widths to network activations and weights in each layer. The typical way of existing MPQ methods is to optimize quantization policies (i.e., bit-width allocation) in a gradient descent manner, termed as Differentiable MPQ (DMPQ). At the end of the search, the bit-width associated to the quantization parameters which has the largest value will be selected to form the final mixed precision quantization policy, with the implicit assumption that the values of quantization parameters reflect the operation contribution to the accuracy improvement. While much has been discussed about the MPQ’s improvement, the bit-width selection process has received little attention. We study this problem and argue that the magnitude of quantization parameters does not necessarily reflect the actual contribution of the bit-width to the task performance. Then, we propose a Shapley-based MPQ (SMPQ) method, which measures the bit-width operation’s direct contribution on the MPQ task. To reduce computation cost, a Monte Carlo sampling-based approximation strategy is proposed for Shapley computation. Extensive experiments on mainstream benchmarks demonstrate that our SMPQ consistently achieves state-of-the-art performance than gradient-based competitors.
3505: Dual-Agent Reinforcement Learning for Automated Feature Generation
Authors: Wanfu Gao, Zengyao Man, Hanlin Pan, Kunpeng Liu
Location: Guangzhou | Day: TBD
Show Abstract
Feature generation involves creating new features from raw data to capture complex relationships among the original features, improving model robustness and machine learning performance. Current methods using reinforcement learning for feature generation have made feature exploration more flexible and efficient. However, several challenges remain: first, during feature expansion, a large number of redundant features are generated. When removing them, current methods only retain the best features each round, neglecting those that perform poorly initially but could improve later. Second, the state representation used by current methods fails to fully capture complex feature relationships. Third, there are significant differences between discrete and continuous features in tabular data, requiring different operations for each type. To address these challenges, we propose a novel dual-agent reinforcement learning method for feature generation. Two agents are designed: the first generates new features, and the second determines whether they should be preserved. A self-attention mechanism enhances state representation, and diverse operations distinguish interactions between discrete and continuous features. The experimental results on multiple datasets demonstrate that the proposed method is effective.
3512: LLM-enhanced Score Function Evolution for Causal Structure Learning
Authors: Zidong Wang, Fei Liu, Qi Feng, Qingfu Zhang, Xiaoguang Gao
Location: Guangzhou | Day: TBD
Show Abstract
Causal structure learning (CSL) plays a pivotal role in causality and is often formulated as an optimization problem within score-and-search methods. Under the assumption of an infinite dataset and a predefined distribution, several well-established and consistent score functions have been shown to be both optimal and reliable for identifying ground-truth causal graphs. However, in practice, these idealized assumptions are often infeasible, which can result in CSL algorithms learning suboptimal structures. In this paper, we introduce L-SFE, a framework designed to automatically discover effective score functions by exploring the "score function space". L-SFE addresses this task from a bi-level optimization perspective. First, it leverages a Large Language Model (LLM) to interpret the characteristics of score functions and generate the corresponding code implementations. Next, L-SFE employs evolutionary algorithms along with carefully designed operators, to search for solutions with higher fitness. Additionally, we take the BIC as example and prove the consistency of the generated score functions. Experimental evaluations, conducted on discrete, continuous, and real datasets, demonstrate the high stability, generality and effectiveness of L-SFE.
3514: Aggregation Mechanism Based Graph Heterogeneous Networks Distillation
Authors: Xiaobin Hong, Mingkai Lin, Xiangkai Ma, Wenzhong Li, Sanglu Lu
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness across various tasks but are often hindered by their high computational overhead. GNN-to-MLP distillation provides a promising remedy by transferring knowledge from complex GNNs to lightweight MLPs. However, existing methods largely overlook the differences in aggregation mechanisms and heterogeneous architectures. Simplifying such intricate information into MLP potentially causes information loss or distortion, ultimately resulting in suboptimal performance. This paper proposes an aggregation mechanism enhanced GNN distillation framework (AMEND). AMEND introduces multi-scope aggregation context preservation to replicate the teacher’s broad aggregation scopes and an aggregation-enhanced centered kernel alignment method to match the teacher’s aggregation patterns. To ensure efficient and robust knowledge transfer, we integrate a manifold mixup strategy, enabling the student to capture the teacher’s insights into mixed data distributions. Experimental results on 8 standard and 4 large-scale datasets demonstrate that AMEND consistently outperforms state-of-the-art distillation methods.
3518: Frequency-Aware Deep Depth from Focus
Authors: Tao Yan, Yingying Wang, Jiangfeng Zhang, Yuhua Qian, Jieru Jia, Lu Chen, Feijiang Li
Location: Guangzhou | Day: TBD
Show Abstract
In large aperture imaging, the shallow depth of field (DoF) phenomenon requires capturing multiple images at different focal levels, allowing us to infer depth information using depth from focus (DFF) techniques. However, most previous works design convolutional neural networks from a time domain perspective, often leading to blurred fine details in depth estimation. In this work, we propose a frequency-aware deep DFF network (FAD) that couples multi-scale spatial domain local features with frequency domain global structural features. Our main innovations include two key points: First, we introduce a frequency domain feature extraction module that uses the Fourier transform to transfer latent focus features into the frequency domain. This module adaptively captures essential frequency information for focus changes through element-wise multiplication, enhancing fine details in depth results while preserving global structural integrity. Second, the time-frequency joint module of FAD improves the consistency of depth information in sparse texture regions and the continuity in transition areas from both local and global complementary perspectives. Comprehensive experiments demonstrate that our model achieves compelling generalization and state-of-the-art depth prediction across various datasets. Additionally, it can be quickly adapted to real-world applications as a pre-trained model.
3531: Towards Debiased Generalized Category Discovery
Authors: Pengcheng Guo, Yonghong Song, Boyu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Generalized Category Discovery (GCD) aims at classifying unlabeled training data coming from old and novel classes by leveraging the information of partially labeled old classes. In this paper, we reveal that existing methods often suffer from competition between new and old classes, where the focus on learning new classes often results in a notable performance degradation on the old classes. Moreover, we delve into the reason behind this problem: the GCD classifier can be overconfident and biased towards the new class. With this insight, we propose Debiased GCD (DeGCD), a simple but effective approach that mitigates the bias caused by the overconfidence from new categories by a debiased head. Specifically, we first propose semantic calibration loss that aids the GCD classifier in debiasing by enforcing neighborhood prediction consistency with the latent representation of the debiased head. Furthermore, a debiased contrastive objective is proposed to refine the similarity matrix from the GCD classifier and the debiased classifier, suppressing the overconfidence in new classes in unlabeled data. In addition, an alignment constraint loss is designed to prevent damaging the distribution of the old categories caused by overconfidence in the new categories. Experiments on various datasets shows DeGCD achieves state-of-the-art performance and maintains a good balance between new and old classes. In addition, this method can be seamlessly adapted to other GCD methods, not only to achieve further performance gains but also to effectively balance the performance of the new class with that of the old class.
3532: Boosting Zero-shot Stereo Matching Using Large-Scale Mixed Images Sources in the Real World
Authors: Yuran Wang, Yingping Liang, Ying Fu
Location: Guangzhou | Day: TBD
Show Abstract
Stereo matching methods rely on dense pixel-wise ground truth labels, which are laborious to obtain, especially for real-world datasets. The scarcity of labeled data and domain gaps between synthetic and real-world images also pose notable challenges. In this paper, we propose a novel framework, BooSTer, that leverages both vision foundation models and large-scale mixed image sources, including synthetic, real, and single-view images. First, to fully unleash the potential of large-scale single-view images, we design a data generation strategy combining monocular depth estimation and diffusion models to generate dense stereo matching data from single-view images. Second, to tackle sparse labels in real-world datasets, we transfer knowledge from monocular depth estimation models, using pseudo-mono depth labels and a dynamic scale- and shift-invariant loss for additional supervision. Furthermore, we incorporate vision foundation model as an encoder to extract robust and transferable features, boosting accuracy and generalization. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach, achieving significant improvements in accuracy over existing methods, particularly in scenarios with limited labeled data and domain shifts.
3533: Going Beyond Consistency: Target-oriented Multi-view Graph Neural Network
Authors: Sujia Huang, Lele Fu, Shuman Zhuang, Yide Qiu, Bo Huang, Zhen Cui, Tong Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Multi‐view learning has emerged as a pivotal research area driven by the growing heterogeneity of real‐world data, and graph neural network-based models, modeling multi-view data as multi-view graphs, have achieved remarkable performance by revealing its deep semantics. However, by assuming cross‐view consistency, most approaches collect not only task-relevant (determinative) semantics but also symbiotic yet task-irrelevant (incidental) factors are collected to obscure model inference. Furthermore, these approaches often lack rigorous theoretical analysis that bridges training data to test data. To address these issues, we propose Target-oriented Graph Neural Network (TGNN), a novel framework that goes beyond traditional consistency by prioritizing task-relevant information, ensuring alignment with the target. Specifically, TGNN employs a class-level dual-objective loss to minimize the classification similarity between determinative and incidental factors, accentuating the former while suppressing the latter during model inference. Meanwhile, to ensure consistency between the learned semantics and predictions in representation learning, we introduce a penalty term that aims to amplify the divergence between these two types of factors. Furthermore, we derive an upper bound on the loss discrepancy between training and test data, providing formal guarantees for generalization to test domains. Extensive experiments conducted on three types of multi-view datasets validate the superiority of TGNN.
3551: BMIP: Bi-directional Modality Interaction Prompt Learning for VLM
Authors: Song-Lin Lv, Yu-Yang Chen, Zhi Zhou, Ming Yang, Lan-Zhe Guo
Location: Guangzhou | Day: TBD
Show Abstract
Vision-language models (VLMs) have exhibited remarkable generalization capabilities, and prompt learning for VLMs has attracted great attention for the ability to adapt pre-trained VLMs to specific downstream tasks. However, existing studies mainly focus on single-modal prompts or uni-directional modality interaction, overlooking the powerful alignment effects resulting from the interaction between the vision and language modalities. To this end, we propose a novel prompt learning method called Bi-directional Modality Interaction Prompt (BMIP), which dynamically weights bi-modal information through learning the information of the attention layer, enhancing trainability and inter-modal consistency compared to simple information aggregation methods. To evaluate the effectiveness of prompt learning methods, we propose a more realistic evaluation paradigm called open-world generalization complementing the widely adopted cross-dataset transfer and domain generalization tasks. Comprehensive experiments on various datasets reveal that BMIP not only outperforms current state-of-the-art methods across all three evaluation paradigms but is also flexible enough to be combined with other prompt-based methods for consistent performance enhancement.
3575: Learn from Global Rather Than Local: Consistent Context-Aware Representation Learning for Multi-View Graph Clustering
Authors: Lele Fu, Bowen Deng, Sheng Huang, Tianchi Liao, Chuanfu Zhang, Chuan Chen
Location: Guangzhou | Day: TBD
Show Abstract
Multi-view graph clustering (MVGC) has been of widespread interest owing to the ability of capturing the complementary information among views, thereby enhancing the performance of node clustering. Despite the impressive achievements of existing methods, they are limited by a common deficiency, namely, the curse of local manifold while failing to perceive the global manifold structure. In light of this drawback, we propose a Consistent Context-Aware Representation Learning (CCARL) method for MVGC, aiming to learn node representations from global space rather than just local topology. Concretely, we define a set of anchors to establish the global coordinate, which are optimally mapped to multi-view graphs with minimal cost via fused Gromov-Wasserstein optimal transport. To fuse the complementary information in various views, the attention mechanism is employed to integrate multiple graph embeddings into a consistent representation. By transforming to the global coordinate connecting with anchors, the consistent representation captures the contextual information, and its clustering-friendliness is further enhanced through a self-training strategy. Finally, extensive experiments on four multi-view graph datasets demonstrate the effectiveness of the proposed CCARL over existing MVGC methods.
3579: On the Learning with Augmented Class via Forests
Authors: Fan Xu, Wuyang Chen, Wei Gao
Location: Guangzhou | Day: TBD
Show Abstract
Decision trees and forests have achieved successes in various real applications, most working with all testing classes known in training data. In this work, we focus on learning with augmented class via forests, where an augmented class may appear in testing data yet not in training data. We incorporate information of augmented class into trees’ splitting, that is, augmented Gini impurity, a new splitting criterion is introduced to exploit some unlabeled data from testing distribution. We then develop the Learning with Augmented Class via Forests (short for LACForest) approach, which constructs shallow forests according to the augmented Gini impurity and then splits forests with pseudo-labeled augmented instances for better performance. We also develop deep neural forests via an optimization objective based on our augmented Gini impurity, which essentially utilizes the representation power of neural networks for forests. Theoretically, we present the convergence analysis for our augmented Gini impurity, and we finally conduct experiments to evaluate our approaches. The code is available at https://github.com/nju-xuf/LACForest.
3581: Noise-Resistant Label Reconstruction Feature Selection for Partial Multi-Label Learning
Authors: Wanfu Gao, Hanlin Pan, Qingqi Han, Kunpeng Liu
Location: Guangzhou | Day: TBD
Show Abstract
The "Curse of dimensionality" is prevalent across various data patterns, which increases the risk of model overfitting and leads to a decline in model classification performance. However, few studies have focused on this issue in Partial Multi-label Learning (PML), where each sample is associated with a set of candidate labels, at least one of which is correct. Existing PML methods addressing this problem are mainly based on the low-rank assumption. However, low-rank assumption is difficult to be satisfied in practical situations and may lead to loss of high-dimensional information. Furthermore, we find that existing methods have poor ability to identify positive labels, which is important in real-world scenarios. In this paper, a PML feature selection method is proposed considering two important characteristics of dataset: label relationship’s noise-resistance and label connectivity. Our proposed method utilizes label relationship’s noise-resistance to disambiguate labels. Then the learning process is designed through the reformed low-rank assumption. Finally, representative labels are found through label connectivity, and the weight matrix is reconstructed to select features with strong identification ability to these labels. The experimental results on benchmark datasets demonstrate the superiority of the proposed method.
3585: FS-KEN: Few-shot Knowledge Graph Reasoning by Adversarial Negative Enhancing
Authors: Lingyuan Meng, Ke Liang, Zeyu Zhu, Xinwang Liu, Wenpeng Lu
Location: Guangzhou | Day: TBD
Show Abstract
Few-shot knowledge graph reasoning (FS-KGR) try to infer missing facts in a knowledge graphs using limited data (such as only 3/5 samples).Existing strategies have shown good performance by mining more supervised information for few-shot learning through meta-learning and self-supervised learning. However, the problem of insufficient samples has not been fundamentally solved. In this paper, we propose a novel algorithm based on adversarial learning for Enhancing Negative samples in few-shot scenarios of FS-KGR, termed FS-KEN. Specifically, we are the first to use GAN to conduct data augmentation on FS-KGR scenario. FS-KEN uses policy gradient GANs for negative sample augmentation, solving the gradient back-propagation issue in traditional GANs. The generator aims to produce high-quality negative entities. while the objective of the discriminator is to distinguish between generated entities and real entities. Comprehensive experiments conducted on two few-shot knowledge graph completion datasets reveal that FS-KEN surpasses other baseline models, achieving state-of-the-art results.
3591: Two-Stage Feature Generation with Transformer and Reinforcement Learning
Authors: Wanfu Gao, Zengyao Man, Zebin He, Yuhao Tang, Jun Gao, Kunpeng Liu
Location: Guangzhou | Day: TBD
Show Abstract
Feature generation is a critical step in machine learning, aiming to enhance model performance by capturing complex relationships within the data and generating meaningful new features. Traditional feature generation methods heavily rely on domain expertise and manual intervention, making the process labor-intensive and challenging to adapt to different scenarios. Although automated feature generation techniques address these issues to some extent, they often face challenges such as feature redundancy, inefficiency in feature space exploration, and limited adaptability to diverse datasets and tasks. To address these problems, we propose a Two-Stage Feature Generation (TSFG) framework, which integrates a Transformer-based encoder-decoder architecture with Proximal Policy Optimization (PPO). The encoder-decoder model in TSFG leverages the Transformer’s self-attention mechanism to efficiently represent and transform features, capturing complex dependencies within the data. PPO further enhances TSFG by dynamically adjusting the feature generation strategy based on task-specific feedback, optimizing the process for improved performance and adaptability. TSFG dynamically generates high-quality feature sets, significantly improving the predictive performance of machine learning models. Experimental results demonstrate that TSFG outperforms existing state-of-the-art methods in terms of feature quality and adaptability.
3600: R2DQG: A Quality Meets Diversity Framework for Question Generation over Knowledge Bases
Authors: Yimeng Ren, Yanhua Yu, Lizi Liao, Yuhu Shang, Kangkang Lu, Mingliang Yan
Location: Guangzhou | Day: TBD
Show Abstract
The task of Knowledge-Based Question Generation (KBQG) involves generating natural language questions from structured knowledge sources, posing unique challenges in balancing linguistic diversity and semantic relevance. Existing models often focus on maximizing surface-level similarity to ground-truth questions, neglecting the need for diverse syntactic forms and leading to semantic drift during generation. To overcome these challenges, we propose Refine-Reinforced Diverse Question Generation (R2DQG), a two-phase framework leveraging a generation-then-refinement paradigm. The Generator first constructs a diverse set of expressive templates using dependency parse tree similarity, capturing a wide range of syntactic patterns and styles. These templates guide the creation of question drafts, ensuring both diversity and semantic relevance. In the second phase, a Corrector module refines the drafts to mitigate semantic drift and enhance overall coherence and quality. Experiments on public datasets show that R2DQG outperforms state-of-the-art models in generating diverse, contextually accurate questions. Moreover, synthetic datasets generated by R2DQG enhance downstream QA performance, underscoring the practical utility of our approach.
3621: PNAct: Crafting Backdoor Attacks in Safe Reinforcement Learning
Authors: Weiran Guo, Guanjun Liu, Ziyuan Zhou, Ling Wang
Location: Guangzhou | Day: TBD
Show Abstract
Reinforcement Learning (RL) is widely used in tasks where agents interact with an environment to maximize rewards. Building on this foundation, Safe Reinforcement Learning (Safe RL) incorporates a cost metric alongside the reward metric, ensuring that agents adhere to safety constraints during decision-making. In this paper, we identify that Safe RL is vulnerable to backdoor attacks, which can manipulate agents into performing unsafe actions. First, we introduce the relevant concepts and evaluation metrics for backdoor attacks in Safe RL. It is the first attack framework in the Safe RL field that involves both Positive and Negative Action sample (PNAct) is to implant backdoors, where positive action samples provide reference actions and negative action samples indicate actions to be avoided. We theoretically point out the properties of PNAct and design an attack algorithm. Finally, we conduct experiments to evaluate the effectiveness of our proposed backdoor attack framework, evaluating it with the established metrics. This paper highlights the potential risks associated with Safe RL and underscores the feasibility of such attacks. Our code and supplementary material are available at https://github.com/azure-123/PNAct.
3632: Optimal Policy Adaptation Under Covariate Shift
Authors: Xueqing Liu, Qinwei Yang, Zhaoqing Tian, Ruocheng Guo, Peng Wu
Location: Guangzhou | Day: TBD
Show Abstract
Transfer learning of prediction models has been extensively studied, while the corresponding policy learning approaches are rarely discussed. In this paper, we propose principled approaches for learning the optimal policy in the target domain by leveraging two datasets: one with full information from the source domain and the other from the target domain with only covariates. First, in the setting of covariate shift, we formulate the problem from a perspective of causality and present the identifiability assumptions for the reward induced by a given policy. Then, we derive the efficient influence function and the semiparametric efficiency bound for the reward. Based on this, we construct a doubly robust and semiparametric efficient estimator for the reward and then learn the optimal policy by optimizing the estimated reward. Moreover, we theoretically analyze the bias and the generalization error bound for the learned policy. Furthermore, in the presence of both covariate and concept shifts, we propose a novel sensitivity analysis method to evaluate the robustness of the proposed policy learning approach. Extensive experiments demonstrate that the approach not only estimates the reward more accurately but also yields a policy that closely approximates the theoretically optimal policy.
3645: TSTAI: A Time-varying Brain Effective Connectivity Network Construction Method Combining with Brain Active Information
Authors: Qi Chen, Zhiqiong Wang, Jiaxin Li, Jinying Tao, Junchang Xin
Location: Guangzhou | Day: TBD
Show Abstract
More accurate construction of brain effective conncetivity networks remains a great challenge to achieve accurate auxiliary diagnosis of brain diseases and in-depth exploration of brain function. However, existing methods only consider higher-order or non-stationary assumptions, rather than simultaneously constructing higher-order and non-stationary networks. Among many existing methods, Bayesian network methods demonstrate superior network structure learning ability. In this work, the forward-backward search (FBS) method is optimized by using brain active information, which is improved to a higher-order network structure learning method, called TSTAI. Firstly, in the process of non-stationary network structure learning, two-stage idea is used to search the change points. Then, in the process of learning higher-order network structure, FBS method is combined with two kinds of brain active information to improve the condition set filtering process and scoring function, respectively. Finally, the pruning strategy is used to reduce the search space. Extensive experiments on simulated and real data demonstrate the effectiveness of TSTAI. Through experiments, the TSTAI is compared with state-of-the-art higher-order network construction methods, and the proposed method achieves an improvement of 3.6% and 17.4% respectively in the network construction accuracy.
3647: Efficient Inter-Operator Scheduling for Concurrent Recommendation Model Inference on GPU
Authors: Shuxi Guo, Zikang Xu, Jiahao Liu, Jinyi Zhang, Qi Qi, Haifeng Sun, Jun Huang, Jianxin Liao, Jingyu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Deep learning-based recommendation systems are increasingly important in the industry. To meet strict SLA requirements, serving frameworks must efficiently handle concurrent queries. However, current serving systems fail to serve concurrent queries due to the following problems: (1) inefficient operator (op) scheduling due to the query-wise op launching mechanism, and (2) heavy contention caused by the mutable nature of recommendation model inference. This paper presents RecOS, a system designed to optimize concurrent recommendation model inference on GPUs. RecOS efficiently schedules ops from different queries by monitoring GPU workloads and assigning ops to the most suitable streams. This approach reduces contention and enhances inference efficiency by leveraging inter-op parallelism and op characteristics. To maintain correctness across multiple CUDA streams, RecOS introduces a unified asynchronous tensor management mechanism. Evaluations demonstrate that RecOS improves online service performance, reducing latency by up to 68%.
3663: Distilling A Universal Expert from Clustered Federated Learning
Authors: Zeqi Leng, Chunxu Zhang, Guodong Long, Riting Xia, Bo Yang
Location: Guangzhou | Day: TBD
Show Abstract
Clustered Federated Learning (CFL) addresses the challenges posed by non-IID data by training multiple group- or cluster-specific expert models. However, existing methods often overlook the shared information across clusters, which represents the generalizable knowledge valuable to all participants in the Federated Learning (FL) system. To overcome this limitation, this paper introduces a novel FL framework that distills a universal expert model from the knowledge of multiple clusters. This universal expert captures globally shared information across all clients and is subsequently distributed to each client as the initialization for the next round of model training. The proposed FL framework operates in three iterative steps: (1) local model training at each client, (2) cluster-specific model aggregation, and (3) universal expert distillation. This three-step learning paradigm ensures the preservation of fine-grained non-IID characteristics while effectively incorporating shared knowledge across clusters. Compared to traditional gradient-based aggregation methods, the distillation-based model aggregation introduces greater flexibility in handling model heterogeneity and reduces conflicts among cluster-specific experts. Extensive experimental results demonstrate the superior performance of the proposed method across various scenarios, highlighting its potential to advance the state of CFL by balancing personalized and shared knowledge more effectively.
3672: Exploring Efficient and Effective Sequence Learning for Visual Object Tracking
Authors: Dongdong Li, Zhinan Gao, Yangliu Kuai, Rui Chen
Location: Guangzhou | Day: TBD
Show Abstract
Sequence learning based tracking frameworks are popular in the tracking community. In practice, its auto-regressive sequence generation manner leads to inferior performance and high latency compared with latest advanced trackers. In this paper, to mitigate this issue, we propose an efficient and effective sequence-to-sequence tracking framework named FastSeqTrack. FastSeqTrack differs from previous sequence learning based trackers in terms of token initialization and sequence generation manner. Four tracking tokens are appended to patch embeddings and generated in the encoder as initial guesses for the bounding box sequence, which improves the tracking accuracy compared with randomly initialized tokens. Tracking tokens are then parallelly fed into the decoder in a one-pass manner and greatly boost the forward inference speed compared with the auto-regressive manner. Inspired by the early-exit mechanism, we inject internal classifiers after each decoder layer to early terminate forward inference when the softmax confidence is sufficiently reliable. In easy tracking frames, early exits avoid network overthinking and unnecessary computation. Extensive experiments on multiple benchmarks demonstrate that FastSeqTrack runs over 100 fps and showcases superior performance against state-of-the-art trackers. Codes and models are available at https://github.com/vision4drones/FastSeqTrack.
3680: RAMer: Reconstruction-based Adversarial Model for Multi-party Multi-modal Multi-label Emotion Recognition
Authors: Xudong Yang, Yizhang Zhu, Hanfeng Liu, Zeyi Wen, Nan Tang, Yuyu Luo
Location: Guangzhou | Day: TBD
Show Abstract
Conventional Multi-modal multi-label emotion recognition (MMER) assumes complete access to visual, textual, and acoustic modalities. However, real-world multi-party settings often violate this assumption, as non-speakers frequently lack acoustic and textual inputs, leading to a significant degradation in model performance. Existing approaches also tend to unify heterogeneous modalities into a single representation, overlooking each modality’s unique characteristics. To address these challenges, we propose RAMer (Reconstruction-based Adversarial Model for Emotion Recognition), which refines multi-modal representations by not only exploring modality commonality and specificity but crucially by leveraging reconstructed features, enhanced by contrastive learning, to overcome data incompleteness and enrich feature quality. RAMer also introduces a personality auxiliary task to complement missing modalities using modality-level attention, improving emotion reasoning. To further strengthen the model’s ability to capture label and modality interdependency, we propose a stack shuffle strategy to enrich correlations between labels and modality-specific features. Experiments on three benchmarks, i.e., MEmoR, CMU-MOSEI, and M³ED, demonstrate that RAMer achieves state-of-the-art performance in dyadic and multi-party MMER scenarios.
3700: Where and When: Predict Next POI and Its Explicit Timestamp in Sequential Recommendation
Authors: Yuanbo Xu, Hongxu Shen, Yiheng Jiang, En Wang
Location: Guangzhou | Day: TBD
Show Abstract
Sequential point-of-interest (POI) recommendation aims to recommend the next POI for users in accordance with their historical check-in information. However, few attempts treat timestamps of check-ins as a core factor for sequence models, leading to insufficient insight into user behavior and subsequently suboptimal recommendations. To address these limitations, we propose to assign equal importance to both POIs and their timestamps, shifting the point of view to recommend the next POI and predict the corresponding timestamp. Along these lines, we present the Time-Aware POI Recommender with Timestamp Prediction (TAPT), a multi-task learning framework for explainable POI recommendations. Specifically, we begin by decoupling timestamps into multi-dimensional vectors and propose a timestamp encoding module to explicitly encode these vectors. Additionally, we design a specialized timestamp prediction module built on the traditional sequence-based POI recommender backbone, effectively learning the strong correlation between POIs and their corresponding timestamps through these two modules. We evaluated the proposed model with three real-world LBSN datasets and demonstrated that TAPT achieves comparable or superior performance in POI recommendation compared to the baseline backbone. Besides, TAPT can not only recommend the next POI, but predict the corresponding timestamp in the future.
3716: Metapath and Hypergraph Structure-based Multi-Channel Graph Contrastive Learning for Student Performance Prediction
Authors: Lingyun Song, Xiaofan Sun, Xinbiao Gan, Yudai Pan, Xiaolin Han, Jie Ma, Jun Liu, Xuequn Shang
Location: Guangzhou | Day: TBD
Show Abstract
Considerable attention has been paid to predicting student performance on exercises. The performance of prior studies is determined by the quality of the trait features of students and exercises. Nevertheless, most of the prior study primarily examines simple pairwise interactions in learning trait features, like those between students and exercises or exercises and concepts, while disregarding the complex higher-order interactions that typically exist among these components, which in turn hinders the prediction results. In this paper, we using an innovative Multi-Channel Graph Contrastive Learning (MCGCL) framework that integrates various high-order interactions for predicting student performance. MCGCL characterizes graph structures reflecting various high-order relationships among students, exercises, and concepts through multiple channels, thereby enhancing the trait features of both students and exercises. Moreover, graph contrastive learning is employed to enhance the representation of trait features acquired from high-order graph structures in diverse views. Extensive experiments on real-world datasets show that MCGCL achieves state-of-the-art results on the task of predicting student performance. The code is available at https://github.com/sunlitsong/MCGCL.
3768: Efficient Diversity-based Experience Replay for Deep Reinforcement Learning
Authors: Kaiyan Zhao, Yiming Wang, Yuyang Chen, Yan Li, Leong Hou U, Xiaoguang Liu
Location: Guangzhou | Day: TBD
Show Abstract
Experience replay is widely used to improve learning efficiency in reinforcement learning by leveraging past experiences. However, existing experience replay methods, whether based on uniform or prioritized sampling, often suffer from low efficiency, particularly in real-world scenarios with high-dimensional state spaces. To address this limitation, we propose a novel approach, Efficient Diversity-based Experience Replay (EDER). EDER employs a determinantal point process to model the diversity between samples and prioritizes replay based on the diversity between samples. To further enhance learning efficiency, we incorporate Cholesky decomposition for handling large state spaces in realistic environments. Additionally, rejection sampling is applied to select samples with higher diversity, thereby improving overall learning efficacy. Extensive experiments are conducted on robotic manipulation tasks in MuJoCo, Atari games, and realistic indoor environments in Habitat. The results demonstrate that our approach not only significantly improves learning efficiency but also achieves superior performance in high-dimensional, realistic environments.
3776: Graph Random Walk with Feature-Label Space Alignment: A Multi-Label Feature Selection Method
Authors: Wanfu Gao, Jun Gao, Qingqi Han, Hanlin Pan, Kunpeng Liu
Location: Guangzhou | Day: TBD
Show Abstract
The rapid growth in feature dimension may introduce implicit associations between features and labels in multi-label datasets, making the relationships between features and labels increasingly complex. Moreover, existing methods often adopt low-dimensional linear decomposition to explore the associations between features and labels. However, linear decomposition struggles to capture complex nonlinear associations and may lead to misalignment between the feature space and the label space. To address these two critical challenges, we propose innovative solutions. First, we design a random walk graph that integrates feature-feature, label-label, and feature-label relationships to accurately capture nonlinear and implicit indirect associations, while optimizing the latent representations of associations between features and labels after low-rank decomposition. Second, we align the variable spaces by leveraging low-dimensional representation coefficients, while preserving the manifold structure between the original high-dimensional multi-label data and the low-dimensional representation space. Extensive experiments and ablation studies conducted on seven benchmark datasets and three representative datasets using various evaluation metrics demonstrate the superiority of the proposed method.
3777: Unveiling Maternity and Infant Care Conversations: A Chinese Dialogue Dataset for Enhanced Parenting Support
Authors: Bo Xu, Liangzhi Li, Junlong Wang, Xuening Qiao, Erchen Yu, Yiming Qian, Linlin Zong, Hongfei Lin
Location: Guangzhou | Day: TBD
Show Abstract
The rapid development of large language models has greatly advanced human-computer dialogue research. However, applying these models to specialized fields like maternity and infant care often leads to subpar performance due to a lack of domain-specific datasets. To address this problem, we have created MicDialogue, a Chinese dialogue dataset for maternity and infant care. MicDialogue involves a wide range of specialized topics, including gynecological health, pediatric care, pregnancy preparation, emotional counseling and other related topics. This dataset is curated from two types of Chinese social media: short videos and blog posts. Short videos capture real-time interactions and pragmatic dialogue patterns, while blog posts offer comprehensive coverage of various topics within the domain. We have also included detailed annotations for topics, diseases, symptoms, and causes, enabling in-depth research. Additionally, we developed a knowledge-driven benchmark model using LLM-based prompt learning and multiple knowledge graphs to address diverse dialogue topics. Experiments validate MicDialogue’s usability, providing benchmarks for future research and essential data for fine-tuning language models in maternity and infant care.
3796: Unlocking Dark Vision Potential for Medical Image Segmentation
Authors: Hongpeng Yang, Xiangyu Hu, Yingxin Chen, Siyu Chen, Srihari Nelakuditi, Yan Tong, Shiqiang Ma, Fei Guo
Location: Guangzhou | Day: TBD
Show Abstract
Accurate segmentation of lesions is crucial for disease diagnosis and treatment planning. However, blurring and low contrast in the imaging process can affect segmentation results. We have observed that noninvasive medical imaging shares considerable similarities with natural images under low light conditions and that nocturnal animals possess extremely strong night vision capabilities. Inspired by the dark vision of these nocturnal animals, we proposed a novel plug-and-play dark vision network (DVNet) to enhance the model’s perception for low-contrast medical images. Specifically, by employing the wavelet transform, we decompose medical images into subbands of varying frequencies, mimicking the sensitivity of photoreceptor cells to different light intensities. To simulate the antagonistic receptive fields of horizontal cells and bipolar cells, we design a Mamba-Enhanced Fusion Module to achieve global information correlation and enhance contrast between lesions and surrounding healthy tissues. Extensive experiments demonstrate that the DVNet achieves SOTA performance in various medical image segmentation tasks.
3814: Endowing Interpretability for Neural Cognitive Diagnosis by Efficient Kolmogorov-Arnold Networks
Authors: Shangshang Yang, Linrui Qin, Xiaoshan Yu, Ziwen Wang, Xueming Yan, Haiping Ma, Ye Tian
Location: Guangzhou | Day: TBD
Show Abstract
Cognitive diagnosis is crucial for intelligent education because of its ability to reveal students’ proficiency in knowledge concepts. Although neural network-based neural cognitive diagnosis models (CDMs) have exhibited significantly better performance than traditional models, neural cognitive diagnosis is criticized for the poor model interpretability due to the multi-layer perceptron(MLP) employed, even with the monotonicity assumption. Therefore, this paper proposes to empower the interpretability of neural cognitive diagnosis models through efficient Kolmogorov-Arnold networks (KANs), named KAN2CD, where KANs are used to enhance interpretability in two manners. Specifically, in the first manner, KANs are directly used to replace the used MLPs in existing neural CDMs; while in the second manner, the student embedding, exercise embedding, and concept embedding are directly processed by several KANs, and then their outputs are further combined and learned in a unified KAN to get final predictions. Besides, the implementation of original KANs is modified without affecting the interpretability to overcome the problem of training KANs slowly. Extensive experiments show KAN2CD outperforms traditional CDMs and slightly surpasses existing neural CDMs, and its learned structures ensure interpretability on par with traditional CDMs and better than neural CDMs. The datasets, associated code, and more experimental results are available at https://github.com/null233QAQ/KAN2CD.
3822: Reliable Disentanglement Multi-view Learning Against View Adversarial Attacks
Authors: Xuyang Wang, Siyuan Duan, Qizhi Li, Guiduo Duan, Yuan Sun, Dezhong Peng
Location: Guangzhou | Day: TBD
Show Abstract
Trustworthy multi-view learning has attracted extensive attention because evidence learning can provide reliable uncertainty estimation to enhance the credibility of multi-view predictions. Existing trusted multi-view learning methods implicitly assume that multi-view data is secure. However, in safety-sensitive applications such as autonomous driving and security monitoring, multi-view data often faces threats from adversarial perturbations, thereby deceiving or disrupting multi-view models. This inevitably leads to the adversarial unreliability problem (AUP) in trusted multi-view learning. To overcome this tricky problem, we propose a novel multi-view learning framework, namely Reliable Disentanglement Multi-view Learning (RDML). Specifically, we first propose evidential disentanglement learning to decompose each view into clean and adversarial parts under the guidance of corresponding evidences, which is extracted by a pretrained evidence extractor. Then, we employ the feature recalibration module to mitigate the negative impact of adversarial perturbations and extract potential informative features from them. Finally, to further ignore the irreparable adversarial interferences, a view-level evidential attention mechanism is designed. Extensive experiments on multi-view classification tasks with adversarial attacks show that RDML outperforms the state-of-the-art methods by a relatively large margin. Our code is available at https://github.com/Willy1005/2025-IJCAI-RDML.
3825: RePST: Language Model Empowered Spatio-Temporal Forecasting via Semantic-Oriented Reprogramming
Authors: Hao Wang, Jindong Han, Wei Fan, Leilei Sun, Hao Liu
Location: Guangzhou | Day: TBD
Show Abstract
Spatio-temporal forecasting is pivotal in numerous real-world applications, including transportation planning, energy management, and climate monitoring.
In this work, we aim to harness the reasoning and generalization abilities of Pre-trained Language Models (PLMs) for more effective spatio-temporal forecasting, particularly in data-scarce scenarios.
However, recent studies uncover that PLMs, which are primarily trained on textual data, often falter when tasked with modeling the intricate correlations in numerical time series, thereby limiting their effectiveness in comprehending spatio-temporal data.
To bridge the gap, we propose RePST, a semantic-oriented PLM reprogramming framework tailored for spatio-temporal forecasting.
Specifically, we first propose a semantic-oriented decomposer that adaptively disentangles spatially correlated time series into interpretable sub-components, which facilitates PLM to understand sophisticated spatio-temporal dynamics via a divide-and-conquer strategy.
Moreover, we propose a selective discrete reprogramming scheme, which introduces an expanded spatio-temporal vocabulary space to project spatio-temporal series into discrete representations. This scheme minimizes the information loss during reprogramming and enriches the representations derived by PLMs.
Extensive experiments on real-world datasets show that the proposed RePST outperforms twelve state-of-the-art baseline methods, particularly in data-scarce scenarios, highlighting the effectiveness and superior generalization capabilities of PLMs for spatio-temporal forecasting.
Codes and Appendix can be found at https://github.com/usail-hkust/REPST.
3838: Graph Prompts: Adapting Video Graph for Video Question Answering
Authors: Yiming Li, Xiaoshan Yang, Bing-Kun Bao, Changsheng Xu
Location: Guangzhou | Day: TBD
Show Abstract
Due to the dynamic nature in videos, it is evident that perceiving and reasoning about temporal information are the key focus of Video Question Answering (VideoQA). In recent years, several methods have explored relationship-level temporal modeling with graph-structured video representation. Unfortunately, these methods heavily rely on the question text, thus making it challenging to perceive and reason about video content that is not explicitly mentioned in the question. To address the above challenge, we propose Graph Prompts-based VideoQA (GP-VQA), which adopts a video-based graph structure for enhanced video understanding. The proposed GP-VQA contains two stages, i.e., pre-training and prompt tuning. In pre-training, we define the pretext task that requires GP-VQA to reason about the randomly masked nodes or edges in the video graph, thus prompting GP-VQA to learn the reasoning ability with video-guided information. In prompt-tuning, we organize the textual question into question graph and implement message passing from video graph to question graph, therefore inheriting the video-based reasoning ability from video graph completion to VideoQA. Extensive experiments on various datasets have demonstrated the promising performance of GP-VQA.
3845: From Individual to Universal: Regularized Multi-view Joint Representation for Multi-view Subspace-Preserving Recovery
Authors: Libin Wang, Yulong Wang, Xinwei He, Qiwei Xie, Kit Ian Kou, Yuan Yan Tang
Location: Guangzhou | Day: TBD
Show Abstract
Recent years have witnessed an explosion of Multi- view Subspace Classification (MSCla) and Multi-view Subspace Clustering (MSClu) methods for various applications. However, their theoretical foundation have not been well explored and understood. In this paper, we investigate the multi-view subspace-preserving recovery theory, which is the theoretical underpinnings for MSCla and MSClu methods. Specifically, we derive novel geometrically interpretable conditions for the success of multi-view subspace-preserving recovery. Compared with prior related works, we make the following innovations: First, our theory does not require the equality constraint, which is a common requirement in prior theoretical works and may be too restrictive in reality. Second, we provide both Individual Theoretical Guarantee (ITG) and Universal Theoretical Guarantee (UTG) for multi-view subspace-preserving recovery while prior works only give the UTG. Third, we also apply the proposed theory to establish theoretical guarantees for MSCla and MSClu, respectively. Numerical results validate the proposed theory for multi-view subspace-preserving recovery.
3856: HIPP: Protecting Image Privacy via High-Quality Reversible Protected Version
Authors: Xi Ye, Lina Wang, Run Wang, Jiatong Liu, Geying Yang
Location: Guangzhou | Day: TBD
Show Abstract
With the rapid development of the internet, sharing photos through Social Network Platforms (SNPs) has become a new way for people to socialize, which poses serious threats to personal privacy. Recently, a thumbnail-preserving image privacy protection technique has emerged and garnered widespread attention. However, the existing schemes based on this technique often introduce noticeable noise into the protected image, resulting in poor visual quality. Motivated by the observation that a latent vector can be decoupled into the detail and contour components, in this paper, we propose HIPP, a thumbnail-preserving image privacy protection scheme that decouples the detail and contour information contained in the latent vector corresponding to the original image and reconstructs details by generation model. As a result, the generated protected image appears natural and has a thumbnail similar to the original one. Moreover, the protected images can be restored to versions that are indistinguishable from the original images. Experiments on CelebA, Helen, and LSUN datasets show that the SSIM between the restored and original images achieves 0.9899. Furthermore, compared to the previous works, HIPP achieves the lowest runtime and file expansion rate, with values of 0.07 seconds and 1.1046, respectively.
3867: Human-Imperceptible, Machine-Recognizable Images
Authors: Fusheng Hao, Fengxiang He, Yikai Wang, Fuxiang Wu, Jing Zhang, Dacheng Tao, Jun Cheng
Location: Guangzhou | Day: TBD
Show Abstract
Massive human-related data is collected to train neural networks for computer vision tasks. A major conflict is exposed relating to software engineers between better developing AI systems and distancing from the sensitive training data. To reconcile this conflict, the paper proposes an efficient privacy-preserving learning paradigm, where images are encrypted to become “human-imperceptible, machine-recognizable” via one of the two encryption strategies: (1) random shuffling equally-sized patches and (2) mixing-up sub-patches. Then, minimal adaptations are made to vision transformer to enable it to learn on the encrypted images for vision tasks, including image classification and object detection. Extensive experiments on ImageNet and COCO show that the proposed paradigm achieves comparable accuracy with the competitive methods. Decrypting the encrypted images requires solving an NP-hard jigsaw puzzle or ill-posed inverse problem, which is empirically shown intractable to be recovered by various attackers, including the powerful vision transformer-based attacker. We thus show that the proposed paradigm can ensure the encrypted images have become human-imperceptible while preserving machine-recognizable information.
3881: DANCE: Resource-Efficient Neural Architecture Search with Data-Aware and Continuous Adaptation
Authors: Maolin Wang, Tianshuo Wei, Sheng Zhang, Ruocheng Guo, Wangyu Wang, Shanshan Ye, Lixin Zou, Xuetao Wei, Xiangyu Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Neural Architecture Search (NAS) has emerged as a powerful approach for automating neural network design. However, existing NAS methods face critical limitations in real-world deployments: architectures lack adaptability across scenarios, each deployment context requires costly separate searches, and performance consistency across diverse platforms remains challenging. We propose DANCE (Dynamic Architectures with Neural Continuous Evolution), which reformulates architecture search as a continuous evolution problem through learning distributions over architectural components. DANCE introduces three key innovations: a continuous architecture distribution enabling smooth adaptation, a unified architecture space with learned selection gates for efficient sampling, and a multi-stage training strategy for effective deployment optimization. Extensive experiments across five datasets demonstrate DANCE’s effectiveness. Our method consistently outperforms state-of-the-art NAS approaches in terms of accuracy while significantly reducing search costs. Under varying computational constraints, DANCE maintains robust performance while smoothly adapting architectures to different hardware requirements. The code and appendix can be found at https://github.com/Applied-Machine-Learning-Lab/DANCE.
3882: Towards Generalizable Neural Simulators: Addressing Distribution Shifts Induced by Environmental and Temporal Variations
Authors: Jiaqi Liu, Jiaxu Cui, Shiang Sun, Yizhu Zhao, Bo Yang
Location: Guangzhou | Day: TBD
Show Abstract
With advancements in deep learning, neural simulators have become increasingly important for improving the efficiency and effectiveness of simulating complex dynamical systems in various scientific and technological fields. This paper presents a novel neural simulator called Context-informed Polymorphic Neural ODE Processes (CoPoNDP), aimed at addressing the challenges of modeling dynamical systems encountering concurrent environmental and temporal distribution shifts, which are common in real-world scenarios. CoPoNDP employs a context-driven neural stochastic process governed by a combination of basic differential equations in a time-sensitive manner to adaptively modulate the evolution of system states. This allows for flexible adaptation to changing temporal dynamics and generalization across different environments. Extensive experiments conducted on dynamical systems from ecology, chemistry, physics, and energy demonstrate that by effectively utilizing contextual information, CoPoNDP outperforms the state-of-the-art models in handling joint distribution shifts. It also shows robustness in sparse and noisy settings, making it a promising approach for modeling dynamical systems in complex real-world applications.
3883: Dual Encoder Contrastive Learning with Augmented Views for Graph Anomaly Detection
Authors: Nannan Wu, Hongdou Dong, Wenjun Wang, Yiming Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Graph anomaly detection (GAD), which aims to identify patterns that deviate significantly from normal nodes in attributed networks, is widely used in financial fraud, cybersecurity, and bioinformatics. The paradigms of jointly optimizing contrastive learning and reconstruction learning have shown significant potential in this field. However, when using GNNs as an encoder, it still faces the problem of over-smoothing, and it is difficult to effectively capture the fine-grain topology information of the graph. In this paper, we introduce an innovative approach: Dual Encoder Contrastive Learning with Augmented Views for Graph Anomaly Detection, named DECLARE. Specifically, the dual encoder integrates the strengths of GNNs and Graph Transformers to learn graph representation from multiple perspectives comprehensively. Although contrastive learning enhances the model’s ability to learn discriminative features, it cannot directly identify anomalous patterns. To address this, the reconstruction module independently reconstructs graph structures and attributes, helping the model focus on learning the normal patterns of both structure and attributes. Through extensive experimental analysis, we demonstrate the superiority of DECLARE over the state-of-the-art baselines on six benchmark datasets.
3894: Graph OOD Detection via Plug-and-Play Energy-based Evaluation and Propagation
Authors: Yunxia Zhang, Mingchen Sun, Yutong Zhang, Funing Yang, Ying Wang
Location: Guangzhou | Day: TBD
Show Abstract
Existing graph neural network (GNN) methods are typically built upon the i.i.d. assumption, emphasizing the enhancement of the test performance for in-distribution (ID) data. However, there has been limited exploration of their adaptability to scenarios involving unknown distribution data. On the one hand, in real-world application scenarios, graph data often expands continuously with the acquisition of external knowledge, which means that new nodes with unknown categories may be added to the graph data. The gap between the new node distribution and the original node distribution can make existing GNN methods less effective. On the other hand, existing out-of-distribution (OOD) detection methods often rely on the softmax confidence score, which makes the OOD data suffer from overconfident posterior distributions. To address the above issues, we propose an Energy Propagation-based Graph Neural Network (EPGNN), which improves the OOD generalization ability by endowing GNN with the capacity to detect the OOD nodes in the graph. Specifically, we first construct GNN encoder to obtain node embedding that incorporates neighborhood structural information. Then, we design a plug-and-play energy-based OOD evaluator by assigning corresponding energy values to different nodes. Finally, we construct a plug-and-play structure-aware energy propagation module and joint alignment regularization, which make the node energy more flexible during the training process. Extensive experiments on benchmark datasets demonstrate the superiority of our method.
3899: Efficient Quantum Approximate kNN Algorithm via Granular-Ball Computing
Authors: Shuyin Xia, Xiaojiang Tian, Suzhen Yuan, Jeremiah D. Deng
Location: Guangzhou | Day: TBD
Show Abstract
High time complexity is one of the biggest challenges faced by k-Nearest Neighbors (kNN). Although current classical and quantum kNN algorithms have made some improvements, they still have a speed bottleneck when facing large amounts of data. To address this issue, we propose an innovative algorithm called Granular-Ball based Quantum kNN(GB-QkNN). This approach achieves higher efficiency by first employing granular-balls, which reduces the data size needed to processed. The search process is then accelerated by adopting a Hierarchical Navigable Small World (HNSW) method. Moreover, we optimize the time-consuming steps, such as distance calculation, of the HNSW via quantization, further reducing the time complexity of the construct and search process. By combining the use of granular-balls and quantization of the HNSW method, our approach manages to take advantage of these treatments and significantly reduces the time complexity of the kNN-like algorithms, as revealed by a comprehensive complexity analysis.
3909: A Reduction-Based Algorithm for the Clique Interdiction Problem
Authors: Chenghao Zhu, Yi Zhou, Haoyu Jiang
Location: Guangzhou | Day: TBD
Show Abstract
The Clique Interdiction Problem (CIP) aims to minimize the size of the largest clique in a given graph by removing a given number of vertices.
The CIP models a special Stackelberg game and has important applications in fields such as pandemic control and terrorist identification.
However, the CIP is a bilevel graph optimization problem, making it very challenging to solve. Recently, data reduction techniques have been successfully applied in many (single-level) graph optimization problems like vertex cover.
Motivated by this, we investigate a set of novel reduction rules and design a reduction-based algorithm, RECIP, for practically solving the CIP.
RECIP enjoys an effective preprocessing procedure that systematically reduces the input graph, making the problem much easier to solve.
Extensive experiments on 124 large real-world networks demonstrate the superior performance of RECIP and validate the effectiveness of the proposed reduction rules.
3914: State Feedback Enhanced Graph Differential Equations for Multivariate Time Series Forecasting
Authors: Jiaxu Cui, Qipeng Wang, Yiming Zhao, Bingyi Sun, Pengfei Wang, Bo Yang
Location: Guangzhou | Day: TBD
Show Abstract
Multivariate time series forecasting holds significant theoretical and practical importance in various fields, including web analytics and transportation. Recently, graph neural networks and graph differential equations have shown exceptional capabilities in modeling spatio-temporal features. However, existing methods often suffer from over-smoothing, hindering real-world problem-solving. In this work, we analyze the graph propagation process as a dynamical system and propose a novel feedback mechanism to enhance representation power, adaptively adjusting the representations to align with desired performance outcomes, thereby fundamentally mitigating the issue of over-smoothing. Moreover, we introduce an effective multivariate time series forecasting model called SF-GDE, based on the proposed graph propagation with the feedback mechanism. Intensive experiments are conducted on three real-world datasets from diverse fields. Results show that SF-GDE outperforms the state of the arts, and the feedback mechanism can serve as a universal booster to improve performance for graph propagation models.
3926: LoD: Loss-difference OOD Detection by Intentionally Label-Noisifying Unlabeled Wild Data
Authors: Chuanxing Geng, Qifei Li, Xinrui Wang, Dong Liang, Songcan Chen, Pong C. Yuen
Location: Guangzhou | Day: TBD
Show Abstract
Using unlabeled wild data containing both in-distribution (ID) and out-of-distribution (OOD) data to improve the safety and reliability of models has recently received increasing attention. Existing methods either design customized losses for labeled ID and unlabeled wild data then perform joint optimization, or first filter out OOD data from the latter then learn an OOD detector. While achieving varying degrees of success, two potential issues remain: (i) Labeled ID data typically dominates the learning of models, inevitably making models tend to fit OOD data as IDs; (ii) The selection of thresholds for identifying OOD data in unlabeled wild data usually faces dilemma due to the unavailability of pure OOD samples. To address these issues, we propose a novel loss-difference OOD detection framework (LoD) by intentionally label-noisifying unlabeled wild data. Such operations not only enable labeled ID data and OOD data in unlabeled wild data to jointly dominate the models’ learning but also ensure the distinguishability of the losses between ID and OOD samples in unlabeled wild data, allowing the classic clustering technique (e.g., K-means) to filter these OOD samples without requiring thresholds any longer. We also provide theoretical foundation for LoD’s viability, and extensive experiments verify its superiority.
3931: Rethinking Removal Attack and Fingerprinting Defense for Model Intellectual Property Protection: A Frequency Perspective
Authors: Cheng Zhang, Yang Xu, Tingqiao Huang, Zixing Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Training deep neural networks is resource-intensive, making it crucial to protect their intellectual property from infringement. However, current model ownership resolution (MOR) methods predominantly address general removal attacks that involve weight modifications, with limited research considering alternative attack perspectives. In this work, we propose a frequency-based model ownership removal attack, grounded in a key observation: modifying a model’s high-frequency coefficients does not significantly impact its performance but does alter its weights and decision boundary. This change invalidates the existing MOR methods. We further propose a frequency-based fingerprinting technique as a defense mechanism. By extracting frequency-domain characteristics instead of decision boundary or model weights, our fingerprinting defense effectively against the proposed frequency-based removal attack and demonstrates robustness against existing general removal attacks. The experimental results show that the frequency-based removal attack can easily defeat state-of-the-art white-box watermarking and fingerprinting schemes while preserving model performance, and the proposed defense method is also effective. Our code is released at: https://github.com/huangtingqiao/RRA-IJCAI25.
3935: Enabling Visual Foundation Models to Teach Compact Students via Mixture of Distillation
Authors: Xinye Yang, Shang Wang, Li Luking, Yipeng Chen
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we present a novel Mixture of Distillation (MoD) framework for distilling lightweight student models using Visual Foundation Models (VFMs) as teachers. Knowledge distillation (KD) is a crucial training strategy for improving model performance. However, conventional KD methods face two main challenges: (1) selecting \& training appropriate teacher models and (2) designing effective knowledge distillation techniques. To address the first challenge, we leverage recent VFMs like CLIP, Grounding DINO, and SAM as teachers, capitalizing on their remarkable zero-shot generalization abilities and low fine-tuning requirements for new tasks, thereby avoiding expensive retraining of teachers. For the second challenge, our MoD framework focuses on extracting and decomposing the feature and logit knowledge from VFMs into multiple knowledge experts, which capture modality-specific information across batches, channels, and instances. Each knowledge expert undergoes separate projections, reshaping, normalization, and learnable magnitude operations. Then, we employ sparse knowledge gates with a softmax function followed by a KeepTopK operation for different knowledge experts. In this way, our MoD not only bridges the distillation gap between VFMs and students but also allows the adaptive transfer of useful knowledge across different domains. Extensive experiments on various classification, detection, and medical segmentation tasks validate the effectiveness of our approach with other models. Moreover, our MoD framework demonstrates the potential for transferring zero-shot abilities from VFMs without relying on ground-truth labels. Notably, our MoD achieves impressive performance, attaining 72.48% for RepViT with 76.20% CLIP teacher on ImageNet-1K without annotations.
3946: Priority Guided Explanation for Knowledge Tracing with Dual Ranking and Similarity Consistency
Authors: Fan Li, Tiancheng Zhang, Yifang Yin, Minghe Yu, Mengxiang Wang, Ge Yu
Location: Guangzhou | Day: TBD
Show Abstract
Knowledge tracing plays a pivotal role in enabling personalized learning on online platforms. While deep learning-based approaches have achieved impressive predictive performance, their limited interpretability poses a significant barrier to practical adoption. Existing explanation methods primarily focus on specific model architectures and fall short in 1) explicitly prioritizing critical interactions to generate fine-grained explanations, and 2) maintaining similarity consistency across interaction importance. These limitations hinder actionable insights for improving student outcomes. To bridge the gap, we propose a model-agnostic approach that provides enhanced explanations applicable to diverse knowledge tracing methods. Specifically, we propose a novel ranking loss designed to explicitly optimize the importance ranking of past interactions by comparing their corresponding perturbed outputs. Furthermore, we introduce a similarity loss to capture temporal dependencies, ensuring consistency in the assigned importance scores for conceptually similar interactions. Extensive experiments conducted on various knowledge tracing models and benchmark datasets demonstrate substantial enhancements in explanation quality.
3958: Public Signaling in Markets with Information Asymmetry Using a Limited Number of Signals
Authors: Xu Zhao, Ren Liu, Weiran Shen
Location: Guangzhou | Day: TBD
Show Abstract
Consider a market with a seller and many buyers. The seller has a kind of item for sale to the buyers. The items have a quality and each buyer has a private type. The quality is only known to the seller, and the buyers only have a prior belief of the quality. A third party (e.g., intermediaries or product reviewers) is able to reveal information about the actual quality by using a so-called signaling scheme. After receiving the information, buyers can update their beliefs accordingly and decide whether to buy the items.
We consider the third party’s problem of maximizing the purchasing probability by sending signals. However, the optimal signaling scheme has implementation issues, as the number of signals in the optimal scheme is the same as the number of buyer types, which can be exceedingly large or even infinite. We therefore investigate whether a finite and limited set of signals could still approximate the performance of the optimal signaling scheme. Unfortunately, our results show that with a finite number of signals, no signaling scheme can achieve a certain fraction of the performance of the optimal signaling scheme. This limitation persists even with the regularity or the monotone hazard rate assumption. Nevertheless, we identify a mild technical condition under which the third party can approximate the optimal performance within a constant factor by employing only two signals.
We also conduct extensive experiments to substantiate our theoretic results. These experiments compare the performance of using a small signal set across different value distributions. Despite the negative results, our experiment results show that using only a small number of signals is able to achieve a fairly reasonable performance in average cases.
3991: Dynamic Anchor-based Ensemble Clustering via Hypergraph Reconstruction
Authors: Jiaxuan Xu, Lei Duan, Xinye Wang, Liang Du
Location: Guangzhou | Day: TBD
Show Abstract
Ensemble clustering learns a consensus result by integrating a set of base clustering results. Recently, anchor-based methods construct an anchor similarity matrix to represent the affinity relationships among samples, significantly improving computational efficiency. However, these methods struggle with fixed anchors generated by static anchor learning strategies, which lead to low-quality anchor similarity matrix and poor clustering accuracy. To address this issue, we propose a novel method named dynamic anchor-based ensemble clustering via hypergraph reconstruction (YACHT). Specifically, YACHT first transforms the base clustering results into a hypergraph and designs a novel hypergraph enhancement strategy to improve the reliability of the initial hypergraph. YACHT reconstructs the hypergraph through matrix factorization and introduces a mapping matrix to filter out redundant information, capturing a high-quality anchor similarity matrix. Then, YACHT attempts to incorporate the hypergraph into the optimization objective to achieve hypergraph updates. To ensure the accuracy of hypergraph updates, we impose a hypergraph regularizer and a local consensus information alignment term. The alignment term is implemented by minimizing the discrepancy between the label partition derived from the hypergraph regularizer and the local consensus information indicator matrix extracted from the base clustering results. Extensive experimental results demonstrate the outstanding performance of the proposed YACHT. The code is available at https://github.com/scu-kdde/YACHT.
4015: EVICheck: Evidence-Driven Independent Reasoning and Combined Verification Method for Fact-Checking
Authors: Lingxiao Wang, Lei Shi, Feifei Kou, Ligu Zhu, Chen Ma, Pengfei Zhang, Mingying Xu, Zeyu Li
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) have demonstrated significant potential in automated fact-checking. However, existing methods face limitations in insufficient evidence utilization and lack of explicit verification criteria. Specifically, these approaches aggregate evidence for collective reasoning without independently analyzing each piece, hindering their ability to leverage the available information thoroughly. Additionally, they rely on simple prompts or few-shot learning for verification, which makes truthfulness judgments less reliable, especially for complex claims. To address these limitations, we propose a novel method to enhance evidence utilization and introduce explicit verification criteria, named EVICheck. Our approach independently reasons each evidence piece and synthesizes the results to enable more thorough exploration and enhance interpretability. Additionally, by incorporating fine-grained truthfulness criteria, we make the model’s verification process more structured and reliable, especially when handling complex claims. Experimental results on the public RAWFC dataset demonstrate that EVICheck achieves state-of-the-art performance across all evaluation metrics. Our method demonstrates strong potential in fake news verification, significantly improving the accuracy.
4054: DPMamba: Distillation Prompt Mamba for Multimodal Remote Sensing Image Classification with Missing Modalities
Authors: Yueguang Yang, Jiahui Qu, Ling Huang, Wenqian Dong
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal remote sensing image classification (RSIC) has emerged as a key focus in Earth observation, driven by its capacity to extract complementary information from diverse sources. Existing methods struggle with modality absence caused by weather or equipment failures, leading to performance degradation. As a solution, knowledge distillation-based methods train student networks (SN) using a full-modality teacher, but they usually require training separate SN for each modality absence scenario, increasing complexity. To this end, we propose a unified Distillation Prompt Mamba (DPMamba) framework for multimodal RSIC with missing modalities. DPMamba leverages knowledge distillation in a shared text semantic space to optimize learnable prompts, transforming them from “placeholder" to “adaptation" states by enriching missing modality information with full-modality knowledge. To achieve this, we focus on two main aspects: first, we propose a new modality-aware Mamba for dynamically and hierarchically extracting cross-modality interactive features, providing richer, contextually relevant representations for backpropagation-based optimization of prompts; and second, we introduce a novel text-bridging distillation method to efficiently transfer full-modality knowledge, guiding the inclusion of missing modality information into prompts. Extensive evaluations demonstrate the effectiveness and robustness of the proposed DPMamba.
4056: 2D Gaussian Splatting for Outdoor Scene Decomposition and Relighting
Authors: Wei Feng, Kangrui Ye, Qi Zhang, Qian Zhang, Nan Li
Location: Guangzhou | Day: TBD
Show Abstract
Gaussian splatting techniques have recently revolutionized outdoor scene decomposition and relighting through multi-view images. However, achieving high rendering quality still requires a fixed lighting condition among all input views, which is costly or even impractical to capture in outdoor scenes. In this paper, we propose outdoor scene decomposition and relighting with 2D Gaussian splatting (OSDR-GS), a novel inverse rendering strategy under outdoor changing and unknown lighting conditions. Firstly, we present a lighting-based group learning framework that categorizes input images into multiple lighting groups, to learn the separate lighting from each group individually. Secondly, OSDR-GS introduces a fine-grained outdoor lighting component to represent sun-light and sky-light, respectively, which are also adjusted via the correlative exposure factors adaptively. Finally, we construct a visibility-driven shadow module to characterize the nuanced interplay of light and occlusion realistically, for eliminating the uncertainty of dark pixels on lighting-based group learning. Extensive experiments on multiple challenging outdoor datasets validate the effectiveness of OSDR-GS, which achieves the state-of-the-art performance in changing lighting scene inverse rendering.
4065: Multi-Source Collaborative Style Augmentation and Domain-Invariant Learning for Federated Domain Generalization
Authors: Yikang Wei
Location: Guangzhou | Day: TBD
Show Abstract
Federated domain generalization aims to learn a generalizable model from multiple decentralized source domains for deploying on the unseen target domain. The style augmentation methods have achieved great progress on domain generalization. However, the existing style augmentation methods either explore the data styles within isolated source domain or interpolate the style information across existing source domains under the data decentralization scenario, which leads to limited style space. To address this issue, we propose a Multi-source Collaborative Style Augmentation and Domain-invariant learning method (MCSAD) for federated domain generalization. Specifically, we propose a multi-source collaborative style augmentation module to generate data in the broader style space. Furthermore, we conduct domain-invariant learning between the original data and augmented data by cross-domain feature alignment within the same class and classes relation ensemble distillation between different classes to learn a domain-invariant model. By alternatively conducting collaborative style augmentation and domain-invariant learning, the model can generalize well on unseen target domain. Extensive experiments on multiple domain generalization datasets indicate that our method significantly outperforms the state-of-the-art federated domain generalization methods.
4066: CAN-ST: Clustering Adaptive Normalization for Spatio-temporal OOD Learning
Authors: Min Yang, Yang An, Jinliang Deng, Xiaoyu Li, Bin Xu, Ji Zhong, Xiankai Lu, Yongshun Gong
Location: Guangzhou | Day: TBD
Show Abstract
Spatio-temporal data mining is crucial for decision-making and planning in diverse domains. However, in real-world scenarios, training and testing data are often not independent or identically distributed due to rapid changes in data distributions over time and space, resulting in spatio-temporal out-of-distribution (OOD) challenges. This non-stationarity complicates accurate predictions and has motivated research efforts focused on mitigating non-stationarity through normalization operations. Existing methods, nonetheless, often address individual time series in isolation, neglecting correlations across series, which limits their capacity to handle complex spatio-temporal dynamics and results in suboptimal solutions. To overcome these challenges, we propose Clustering Adaptive Normalization (CAN-ST), a general and model-agnostic method that mitigates non-stationarity by capturing both localized distributional changes and shared patterns across nodes via adaptive clustering and a parameter register. As a plugin, CAN-ST can be easily integrated into various spatio-temporal prediction models. Extensive experiments on multiple datasets with diverse forecasting models demonstrate that CAN-ST consistently improves performance by over 20% on average and outperforms state-of-the-art normalization methods.
4075: CD^2: Constrained Dataset Distillation for Few-Shot Class-Incremental Learning
Authors: Kexin Bao, Daichi Zhang, Hansong Zhang, Yong Li, Yutao Yue, Shiming Ge
Location: Guangzhou | Day: TBD
Show Abstract
Few-shot class-incremental learning (FSCIL) receives significant attention from the public to perform classification continuously with a few training samples, which suffers from the key catastrophic forgetting problem. Existing methods usually employ an external memory to store previous knowledge and treat it with incremental classes equally, which cannot properly preserve previous essential knowledge. To solve this problem and inspired by recent distillation works on knowledge transfer, we propose a framework termed Constrained Dataset Distillation (CD^2) to facilitate FSCIL, which includes a dataset distillation module (DDM) and a distillation constraint module (DCM). Specifically, the DDM synthesizes highly condensed samples guided by the classifier, forcing the model to learn compacted essential class-related clues from a few incremental samples. The DCM introduces a designed loss to constrain the previously learned class distribution, which can preserve distilled knowledge more sufficiently. Extensive experiments on three public datasets show the superiority of our method against other state-of-the-art competitors.
4090: PatternCIR Benchmark and TisCIR: Advancing Zero-Shot Composed Image Retrieval in Remote Sensing
Authors: Zhechun Liang, Tao Huang, Fangfang Wu, Shiwen Xue, Zhenyu Wang, Weisheng Dong, Xin Li, Guangming Shi
Location: Guangzhou | Day: TBD
Show Abstract
Remote sensing composed image retrieval
(RSCIR) is a new vision-language task that takes
a composed query of an image and text, aiming to
search for a target remote sensing image satisfying
two conditions from intricate remote sensing
imagery. However, the existing attribute-based
benchmark Patterncom in RSCIR has significant
flaws, including the lack of query text sentences
and paired triplets, thus making it unable to evaluate the latest methods. To address this, we propose
the Zero-Shot Query Text Generator (ZS-QTG)
that can generate full query text sentences based on
attributes, and then, by capitalizing on ZS-QTG,
we develop the PatternCIR benchmark. PatternCIR rectifies Patterncom’s deficiencies and enables
the evaluation of existing methods. Additionally,
we explore zero-shot composed image retrieval
methods that do not rely on massive pre-collected
triplets for training. Existing methods use only
the text during retrieval, performing poorly in
RSCIR. To improve this, we propose Text-image
Sequential Training of Composed Image Retrieval
(TisCIR). TisCIR undergoes sequential training of
multiple self-masking projection and fine-grained
image attention modules, which endows it with
the capacity to filter out conflicting information
between the image and text, enhancing the retrieval
by utilizing both modalities in harmony. TisCIR
outperforms existing methods by 12.40% to
62.03% on PatternCIR, achieving state-of-the-art
performance in RSCIR. The data and code are
available here.
4106: Spatio-temporal Prototype-based Hierarchical Learning for OD Demand Prediction
Authors: Shilu Yuan, Xiaoyu Li, Wenqian Mu, Ji Zhong, Meng Chen, Haoliang Sun, Yongshun Gong
Location: Guangzhou | Day: TBD
Show Abstract
Origin-Destination (OD) demand prediction is a pivotal yet highly challenging task in intelligent transportation systems, aiming to accurately forecast cross-region ridership flows within urban networks. While previous studies have focused on modeling node-to-node relationships, most of them neglect the fact that nodes (regions/stations) exhibit similar spatio-temporal (ST) patterns, which are termed as spatio-temporal prototypes. Capturing these prototypes is crucial for understanding the unified ST dependencies across the network. To bridge this gap, we propose STPro, an ST prototype-based hierarchical model with a dual-branch structure that extracts ST features from the micro and macro perspectives. At the micro level, our model learns unified ST features of individual nodes, while at the macro level, it employs dynamic clustering to identify city-wide ST prototypes, thereby uncovering latent patterns of urban mobility. Besides, we leverage different roles of nodes as origins and destinations by constructing dual O and D branches and learn the mutual information to model their intricate interactions and correlations. Extensive experiments on two public datasets demonstrate that our STPro outperforms recent state-of-the-art baselines, achieving remarkable predictive improvements in OD demand prediction.
4129: Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency
Authors: Ruixiao Li, Fahao Chen, Peng Li
Location: Guangzhou | Day: TBD
Show Abstract
Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding. When token acceptance rate is dynamic and execution time is difficult to estimate, LAPS-SD maintains multiple priority queues and allows request execution preemption across different queues. Once the token acceptance rate becomes stable, LAPS-SD can accurately estimate the execution time and schedule requests accordingly. Extensive experiments show that LAPS-SD reduces inference latency by approximately 39% compared to state-of-the-art scheduling methods.
4146: Squeezing Context into Patches: Towards Memory-Efficient Ultra-High Resolution Semantic Segmentation
Authors: Wang Liu, Puhong Duan, Xudong Kang, Shutao Li
Location: Guangzhou | Day: TBD
Show Abstract
Segmenting ultra-high-resolution (UHR) images poses a significant challenge due to constraints on GPU memory, leading to a trade-off between detailed local information and a comprehensive contextual understanding. Current UHR methods often employ a multi-branch encoder to handle local and contextual information, which can be memory-intensive. To address the need for both high accuracy and low memory usage in processing UHR images, we introduce a memory-efficient semantic segmentation approach by squeezing context information into local patches (SCPSeg). Our method integrates the processing of local and contextual information within a single-branch encoder. Specifically, we introduce a context squeezing module (CSM) designed to compress global context details into local patches, enabling segmentation networks to perceive broader image contexts. Additionally, we propose a super-resolution guided local feature alignment (LFA) technique to improve segmentation precision by aligning local feature relationships. This approach calculates similarities within sliding windows, avoiding heavy computational costs during the training phase. We evaluate the effectiveness of our proposed method on four widely used UHR segmentation benchmarks. Experimental results demonstrate that our approach enhances UHR segmentation accuracy without incurring additional memory overhead during the inference stage. The code is available at https://github.com/StuLiu/SCPSeg.
4147: Identifying Causal Mechanism Shifts Under Additive Models with Arbitrary Noise
Authors: Yewei Xia, Xueliang Cui, Hao Zhang, Yixin Ren, Feng Xie, Jihong Guan, Ruxin Wang, Shuigeng Zhou
Location: Guangzhou | Day: TBD
Show Abstract
In many real-world scenarios, the goal is to identify variables whose causal mechanisms change across related datasets. For example, detecting abnormal root nodes in manufacturing, and identifying key genes that influence cancer by analyzing differences in gene regulatory mechanisms between healthy individuals and cancer patients. This can be done by recovering the causal structure for each dataset independently and then comparing them to identify differences, but the performance is often suboptimal. Typically, existing methods directly identify causal mechanism shifts based on linear additive noise models (ANMs) or by imposing restrictive assumptions on the noise distribution. In this paper, we introduce CMSI, a novel and more general algorithm based on nonlinear ANMs that identifies variables with shifting causal mechanisms under arbitrary noise distributions. Evaluated on various synthetic datasets, CMSI consistently outperforms existing baselines in terms of F1 score. Additionally, we demonstrate CMSI’s applicability on gene expression datasets of ovarian cancer patients at different disease stages.
4173: Breaking the Self-Evaluation Barrier: Reinforced Neuro-Symbolic Planning with Large Language Models
Authors: Jie-Jing Shao, Hong-Jie You, Guohao Cai, Quanyu Dai, Zhenhua Dong, Lan-Zhe Guo
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in language understanding and commonsense reasoning, yet they often struggle with constraint satisfaction in planning problems. Previous studies relying on test-time improvement with self-evaluation fail to address this limitation effectively. In this work, we identify this critical gap and propose a novel neuro-symbolic framework, Reinforced Neuro-Symbolic Planning (\algo), that enhances LLM-powered planning by incorporating a symbolic verifier. The verifier provides explicit feedback on constraint satisfaction, enabling iterative refinement of the state evaluation. Specifically, we utilize the outcome feedback from each logical goal to update the process value along planning paths through a reinforcement value function maximization objective. We further employ T-norms to aggregate the satisfaction levels of multiple constraints, which provided more effective guidance for the test-time search. Our framework bridges the strengths of neural and symbolic methods, leveraging the generative power of LLMs while ensuring rigorous adherence to constraints through symbolic verification. Extensive experiments demonstrate that our approach significantly improves planning accuracy and constraint satisfaction across various domains, outperforming traditional self-evaluation methods. It highlights the potential of hybrid neuro-symbolic systems to address complex constrained planning tasks.
4183: MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces
Authors: Shaojun E, Yuchen Yang, Jiaheng Wu, Yan Zhang, Tiejun Zhao, Ziyan Chen
Location: Guangzhou | Day: TBD
Show Abstract
In the latest advancements in multimodal learning, effectively addressing the spatial and semantic losses of visual data after encoding remains a critical challenge. This is because the performance of large multimodal models is positively correlated with the coupling between visual encoders and large language models. Existing approaches often face issues such as vector gaps or semantic disparities, resulting in information loss during the propagation process. To address these issues, we propose MAGE (Multimodal Alignment and Generation Enhancement), a novel framework that bridges the semantic spaces of vision and text through an innovative alignment mechanism. By introducing the Intelligent Alignment Network (IAN), MAGE achieves dimensional and semantic alignment. To reduce the gap between synonymous heterogeneous data, we employ a training strategy that combines cross-entropy and mean squared error, significantly enhancing the alignment effect. Moreover, to enhance MAGE’s “Any-to-Any” capability, we developed a fine-tuning dataset for multimodal tool-calling instructions to expand the model’s output capability boundaries. Finally, our proposed multimodal large model architecture, MAGE, achieved significantly better performance compared to similar works across various evaluation benchmarks, including MME, MMBench, and SEED. Complete code and appendix are available at: https://github.com/GTCOM-NLP/MAGE
4190: Capturing Individuality and Commonality Between Anchor Graphs for Multi-View Clustering
Authors: Zhoumin Lu, Yongbo Yu, Linru Ma, Feiping Nie, Rong Wang
Location: Guangzhou | Day: TBD
Show Abstract
The use of anchors often leads to better efficiency and scalability, making them highly favored. However, there is a challenge in anchor-based multi-view subspace learning. A unified anchor graph overly emphasize the commonality between views, failing to adequately capture the view-specific individuality. This has led some models to independently explore the individuality of each view before aligning and integrating them, often achieving better performance but making the process more cumbersome. Therefore, this paper proposes a new model, simultaneously capturing the individuality and commonality between anchor graphs for multi-view clustering. The model has three notable advantages: First, it allows view-specific anchor graphs to align in real-time with a common anchor graph as a reference, eliminating the need for post-alignment. Second, it enforces a cluster-wise structure among anchors and balances sample distribution among them, providing strong discriminative power. Lastly, it maintains linear complexity with respect to the numbers of samples and anchors, avoiding the significant time costs associated with their increase. Comprehensive experiments demonstrate the effectiveness and efficiency of our method compared to various state-of-the-art algorithms.
4194: Balancing User-Item Structure and Interaction with Large Language Models and Optimal Transport for Multimedia Recommendation
Authors: Haodong Li, Lianyong Qi, Weiming Liu, Xiaolong Xu, Wanchun Dou, Yang Cao, Xuyun Zhang, Amin Beheshti, Xiaokang Zhou
Location: Guangzhou | Day: TBD
Show Abstract
The rapid growth of multimedia content has driven the development of recommender systems. Most previous work focuses on uncovering latent relationships among items to learn better representations. However, this approach does not sufficiently account for user affinities, potentially leading to an imbalance in the structure modeling of users and items. Moreover, the sparsity and imbalance of user-item interactions further hinder effective representation learning. To address these challenges, we propose a framework called BLAST, which balances structures and interactions via large language models and optimal transport for multimodal recommendation. Specifically, we utilize large language models to summarize side information and generate user profiles. Based on these profiles, we design an intra- and inter-entity structure balancing module to capture item-item and user-user relationships, integrating these affinities into the final representations. Furthermore, we impose constraints on negative sample selection, augment the training data with false negative items and the optimal transport algorithm, thereby leading to smoother interactions. We evaluate BLAST on three real-world datasets, and the results demonstrate that our method significantly outperforms state-of-the-art baselines, which validates the superiority and effectiveness of BLAST.
4202: Soft Reasoning Paths for Knowledge Graph Completion
Authors: Yanning Hou, Sihang Zhou, Ke Liang, Lingyuan Meng, Xiaoshu Chen, Ke Xu, Siwei Wang, Xinwang Liu, Jian Huang
Location: Guangzhou | Day: TBD
Show Abstract
Reasoning paths are reliable information in knowledge graph completion (KGC) in which algorithms can find strong clues of the actual relation between entities. However, in real-world applications, it is difficult to guarantee that computationally affordable paths exist toward all candidate entities. According to our observation, the prediction accuracy drops significantly when paths are absent. To make the proposed algorithm more stable against the missing path circumstances, we introduce soft reasoning paths. Concretely, a specific learnable latent path embedding is concatenated to each relation to help better model the characteristics of the corresponding paths. The combination of the relation and the corresponding learnable embedding is termed a soft path in our paper. By aligning the soft paths with the reasoning paths, a learnable embedding is guided to learn a generalized path representation of the corresponding relation. In addition, we introduce a hierarchical ranking strategy to make full use of information about the entity, relation, path, and soft path to help improve both the efficiency and accuracy of the model. Extensive experimental results illustrate that our algorithm outperforms the compared state-of-the-art algorithms by a notable margin. Our code will be released at https://github.com/7HHHHH/SRP-KGC.
4221: ADPFedGNN: Adaptive Decoupling Personalized Federated Graph Neural Network
Authors: Zeli Guan, Yawen Li, Junping Du, Runqing Tang, Xiaolong Meng
Location: Guangzhou | Day: TBD
Show Abstract
Personalized federated graph neural networks (PFGNN) are an emerging technology that allows multiple graph data owners to collaboratively train personalized models without sharing raw data. However, the Non-IID nature of graph data can cause the coupling of global and local knowledge parameters, which disrupts the optimization in personalized federated learning. Additionally, node neighbors may carry global and local knowledge, and their direct inclusion in training may introduce noise, degrading federated model performance. In this work, we propose the Adaptive Decoupling Personalized Federated Graph Neural Network (ADPFedGNN), which leverages multi-party collaboration to train personalized models for classifying local client graph nodes. We use two automatically updated masks and mutual information minimization to decouple global and local parameters in FGNN. We employ reinforcement learning to adaptively select appropriate neighbors for training global or local knowledge-related parameters while filtering out irrelevant nodes. We also design a personalized federated masked parameter aggregation mechanism that efficiently updates local personalized model parameters and aggregates the masked parameters. Experimental results on three public datasets demonstrate that ADPFedGNN outperforms existing methods, achieving average improvements of 5.66 percent, 5.83 percent, and 12.45 percent in ACC, F1, and Recall, respectively.
4227: Multimodal Prior Learning with Double Constraint Alignment for Snapshot Spectral Compressive Imaging
Authors: Mingjin Zhang, Longyi Li, Fei Gao, Qiming Zhang, Jie Guo
Location: Guangzhou | Day: TBD
Show Abstract
The objective of snapshot spectral compressive imaging reconstruction is to recover the 3D hyperspectral image (HSI) from a 2D measurement. Existing methods either focus on network architecture design or simply introduce image-level prior to the model. However, these methods lack guiding information for accurate reconstruction. Recognizing that textual description contain rich semantic information that can significantly enhance details, this paper introduces a novel framework, CAMM, which integrates text information into the model to improve the performance. The framework comprises two key components: Fine-grained Alignment Module (FAM) and Multimodal Fusion Mamba (MFM). Specifically, FAM is used to reduce the knowledge gap between the RGB domain obtained by the pre-trained vision-language model and the HSI domain. Through the double constraints of distribution similarity and entropy, the adaptive alignment of different complexity features is realized, which makes the encoded features more accurate. MFM aims to identify the guiding effect of RGB features and text features on HSI in space and channel dimensions. Instead of fusing features directly, it integrates prior at image-level and text-level prior into Mamba’s state-space equation, so that each scanning step can be accurately guided. This kind of positive feedback adjustment ensures the authenticity of the guiding information. To our knowledge, this is the first text-guided model for compressive spectral imaging. Extensive experimental results the public datasets demonstrate the superior performance of CAMM, validating the effectiveness of our proposed method.
4263: AccCtr: Accelerating Training-Free Conditional Control For Diffusion Models
Authors: Longquan Dai, He Wang, Yiming Zhang, Shaomeng Wang, Jinhui Tang
Location: Guangzhou | Day: TBD
Show Abstract
In current training-free Conditional Diffusion Models (CDM), the sampling process is steered by the gradient, which measures the discrepancy between the guidance and the condition extracted by a pre-trained condition extraction network. These methods necessitate small guidance steps, resulting in longer sampling times.
To address the issue of slow sampling, we introduce AccCtr, a method that simplifies the conditional sampling algorithm by maximizing the sum of two objectives. The local maximum set of one objective is contained within the local maximum set of the other. Leveraging this relationship, we decompose the joint optimization into two parts, alternately maximizing each objective. By analyzing the steps involved in optimizing these objectives, we identify the most time-consuming steps and recommend retraining condition extraction network—a relatively simple task—to reduce its computational cost.
Integrating AccCtr into current CDMs is a seamless task that does not impose a significant computational burden. Extensive testing has demonstrated that AccCtr offers superior sample quality and faster generation times.
4269: Multimodal Regression for Enzyme Turnover Rates Prediction
Authors: Bozhen Hu, Cheng Tan, Siyuan Li, Jiangbin Zheng, Jun Xia, Stan Z. Li
Location: Guangzhou | Day: TBD
Show Abstract
The enzyme turnover rate is a fundamental parameter in enzyme kinetics, reflecting the catalytic efficiency of enzymes. However, enzyme turnover rates remain scarce across most organisms due to the high cost and complexity of experimental measurements. To address this gap, we propose a multimodal framework for predicting the enzyme turnover rate by integrating enzyme sequences, substrate structures, and environmental factors. Our model combines a pre-trained language model and a convolutional neural network to extract features from protein sequences, while a graph neural network captures informative representations from substrate molecules. An attention mechanism is incorporated to enhance interactions between enzyme and substrate representations. Furthermore, we leverage symbolic regression via Kolmogorov-Arnold Networks to explicitly learn mathematical formulas that govern the enzyme turnover rate, enabling interpretable and accurate predictions. Extensive experiments demonstrate that our framework outperforms both traditional and state-of-the-art deep learning approaches. This work provides a robust tool for studying enzyme kinetics and holds promise for applications in enzyme engineering, biotechnology, and industrial biocatalysis.
4274: MASTER: A Multi-granularity Invariant Structure Clustering Scheme for Multi-view Clustering
Authors: Suixue Wang, Shilin Zhang, Qingchen Zhang, Peng Li, Weiliang Huo
Location: Guangzhou | Day: TBD
Show Abstract
Deep multi-view clustering has attracted increasing attention in the pattern mining of data. However, most of them perform self-learning mechanisms in a single space, ignoring the fruitful structural information hidden in different-level feature spaces. Meanwhile, they conduct the reconstruction constraint to learn generalized representations of samples, failing to explore the discriminative ability of complementary and consistent information. To address the challenges, a multi-granularity invariant structure clustering scheme (MASTER) is proposed to define a bottom-up process that extracts multi-level information in sample, neighborhood, and category granularities from low-level, high-level, and semantics feature space, respectively. Specifically, it leverages the self-learning reconstruction with information-theoretic overclustering to capture invariant sample structure in the low-level feature space. Then, it models data diffusion of the clustering process in the reliable neighborhood to capture invariant local structure in the high-level feature space. Meanwhile, it defines dual divergences induced by the space geometry to capture invariant global structure in the semantics space. Finally, extensive experiments on 8 real-world datasets show that MASTER achieves state-of-the-art performance compared to 11 baselines.
4280: Towards Region-Adaptive Feature Disentanglement and Enhancement for Small Object Detection
Authors: Yanchao Bi, Yang Ning, Xiushan Nie, Xiankai Lu, Yongshun Gong, Leida Li
Location: Guangzhou | Day: TBD
Show Abstract
Current feature fusion strategies often fail to adequately account for the influence of activation intensity across different scales on small object features, which impedes the effective detection of small objects. To address this limitation, we propose the Region-Adaptive Feature Disentanglement and Enhancement (RAFDE) strategy, which improves both downsampling and feature fusion by leveraging activation intensity variations at multiple scales. First, we introduce the Boundary Transitional Region-enhanced Downsampling (BTRD) module, which enhances boundary transitional regions containing both strongly and weakly activated features, thereby mitigating the loss of crucial boundary information for small objects. Second, we present the Regional-Adaptive Feature Fusion (RAFF) module, which adaptively disentangles and fuses co-activated and uni-activated regions from adjacent levels into the current level, effectively reducing the risk of small objects being overwhelmed. Extensive experiments on several public datasets demonstrate that the RAFDE strategy is highly effective and outperforms state-of-the-art methods. The code is available at https://github.com/b-yanchao/RAFDE.git.
4295: TOTF: Missing-Aware Encoders for Clustering on Multi-View Incomplete Attributed Graphs
Authors: Mengyao Li, Xu Zhou, Jiapeng Zhang, Zhibang Yang, Cen Chen, Kenli Li
Location: Guangzhou | Day: TBD
Show Abstract
As the network data in real life become multi-modal and multi-relational, multi-view attributed graphs have garnered significant attention. Numerous methods have achieved excellent performance in multi-view attributed graph clustering; however, they cannot efficiently handle incomplete attribute scenarios, which are prevalent in many real-life applications. Inspired by this, we investigate the problem of multi-view incomplete attributed graph clustering for the first time. In particular, the TOTF (Train Once Then Freeze) framework is designed to train missing-aware encoders that capture view-specific information while ignoring the impact of incomplete attributes, and then employs frozen encoders to uncover common information driven by clustering. After that, we propose a correlation strength-aware graph neural network on the basis of the inherent relationships among attributes to enhance accuracy. It is proven theoretically that traditional Generative Adversarial Networks (GANs) are unable to generate the unique real distribution. To address this issue, we further introduce the missing-position reminder mechanism into our intra-view adversarial games for better clustering results. Extensive experimental results demonstrate that our method achieves up to a 17% improvement in accuracy over the state-of-the-art methods. The source code is available at https://anonymous.4open.science/r/TOTF-main.
4297: SecV: LLM-based Secure Verilog Generation with Clue-Guided Exploration on Hardware-CWE Knowledge Graph
Authors: Fanghao Fan, YingJie Xia, Li Kuang
Location: Guangzhou | Day: TBD
Show Abstract
Verilog is specified as the primary Register Transfer Level (RTL) hardware description language, which designs the logical functions between registers for digital circuit systems. Recently, there emerges much cutting-edge research in leveraging Large Language Models (LLMs) to generate Verilog, aiming at effectively reducing errors and costs in the logic design of chips. However, these works mainly focus on logical correctness or PPA (Power, Performance, Area) measurement of the generated results, while neglecting the security problems in Verilog. In this study, we propose SecV, a novel and unified framework to generate secure Verilog by clue-guided exploration on Common Weakness Enumeration (CWE) knowledge graph (KG) for chips. First, the builder of the KG utilizes the instance-adapted chain of thought (COT) to extract entities and their relationships from raw Hardware-CWE corpora. Then, a fine-tuned BERT model is employed to verify the Hardware-CWE KG and collaborate with builder iteratively to achieve the precise KG. Based on Hardware-CWE KG, a clue-guided graph exploration paradigm is designed to facilitate collaborative inference of knowledge to generate secure Verilog by LLMs. Experiments demonstrate that SecV achieves 82.6% secure Verilog code without specified CWE in the generated functionally correct Verilog, with superior performance of a 21.7% performance improvement compared to SOTA.
4310: Not All Layers of LLMs Are Necessary During Inference
Authors: Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang
Location: Guangzhou | Day: TBD
Show Abstract
Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. However, not all requests posed to LLMs are equally difficult to handle. Through analysis, we show that for some tasks, LLMs can achieve results comparable to the final output at some intermediate layers. That is, not all layers of LLMs are necessary during inference. If we can predict at which layer the inferred results match the final results (produced by evaluating all layers), we could significantly reduce the inference cost. To this end, we propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance. AdaInfer relies on easily obtainable statistical features and classic classifiers like SVM. Experiments on well-known LLMs like the Llama2 series and OPT, show that AdaInfer can achieve an average of 17.8% pruning ratio, and up to 43% on sentiment tasks, with nearly no performance drop (<1%). Because AdaInfer does not alter LLM parameters, the LLMs incorporated with AdaInfer maintain generalizability across tasks.
4319: High-Fidelity Road Network Generation with Latent Diffusion Models
Authors: Jinming Wang, Hongkai Wen, Geyong Min, Man Luo
Location: Guangzhou | Day: TBD
Show Abstract
Road networks are the vein of modern cities. Yet, maintaining up-to-date and accurate road network information is a persistent challenge, especially in areas with rapid urban changes or limited surveying resources. Crowdsourced trajectories, e.g., from GPS records collected by mobile devices and vehicles, have emerged as a powerful data source for continuously mapping the urban areas. However, the inherent noise, irregular and often sparse sampling rates, and the vast variability in movement patterns make the problem of road network generation from trajectories a non-trivial task. Existing methods often approach this from an appearance-based perspective: they typically render trajectories as 2D density maps and then employ heuristic algorithms to extract road networks – leading to inevitable information loss and thus poor performance especially when trajectories are sparse or ambiguities present, e.g. flyovers. In this paper, we propose a novel approach, called GraphWalker, to generate high-fidelity road network graphs from raw trajectories in an end-to-end manner. We achieve this by designing a bespoke latent diffusion transformer T2W-DiT, which treats input trajectories as generation conditions, and gradually denoises samples from a latent space to obtain the corresponding walks on the underlying road network graph – then assemble them together as the final road network. Extensive experiments on multiple datasets demonstrate the proposed GraphWalker can effectively generate high quality road networks from noisy and sparse trajectories, showcasing significant improvements over state-of-the-art.
4334: Global Information Compensation Network for Image Denoising
Authors: Shifei Ding, Qidong Wang, Lili Guo
Location: Guangzhou | Day: TBD
Show Abstract
In image denoising research, discriminative models have achieved impressive results which mainly owes to the powerful ability of convolutional networks in local feature extraction. However, there is still room for improvement due to insufficient utilization of global information. Although using fully connected layers or increasing network depth can supplement global information, this results in a significant increase in parameters and computational cost. To address these issues, we propose a global information compensation network (GICN) for image denoising in this paper. Firstly, at the shallow network part, we propose a global feature mining block that enhances the network’s ability to extract global information by combining non-local blocks and the Fourier transform while improving the interpretability of the model. Secondly, between the encoder and decoder, we propose a cross-scale feature aggregation block to fuse information at different scales. Finally, we employ attention blocks to improve skip connections to better capture long-distance dependencies. Extensive experimental results show that our proposed GICN effectively compensates for global information, achieves a balance between denoising efficiency and effect, and surpasses mainstream methods in multiple benchmark tests.
4336: Modular Deep Reinforcement Learning for Multi-Workload Offloading in Edge Networks
Authors: Hongchang Ke, Yan Ding, Lin Pan, Yang Chen, Jia Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Dynamic edge networks revolutionize mobile edge computing by enabling real-time applications in intelligent transportation, augmented reality, and industrial Internet of Things (IoT). Efficient workload offloading in dynamic edge networks is crucial for addressing the increasing demands of time-varying workloads while contending with limited computational and communication resources. Existing deep reinforcement learning (DRL)-based offloading decision-making schemes are inadequate for managing scenarios involving multiple workloads and edge servers, particularly when faced with time-varying workload arrivals and fluctuating channel states. To this end, we propose a flexible module weighted fusion DRL framework (DRL-MWF) for scalable and robust multi-workload offloading in edge environments. Unlike traditional monolithic networks, DRL-MWF employs a weighted fusion modular architecture that adapts flexibly to diverse workload distributions. Specifically, DRL-MWF introduces a state representation and normalization strategy to model state and workload characteristics, enabling precise and adaptive decision-making. Furthermore, we design two key mechanisms: a weighted policy correction method to stabilize learning and a prioritized experience replay with weighted importance sampling to accelerate convergence by emphasizing critical transitions. Extensive evaluations on real-world datasets demonstrate that DRL-MWF consistently outperforms state-of-the-art baselines. These results reveal DRL-MWF’s potential to transform workload offloading in next-generation edge computing systems, ensuring high performance in dynamic scenarios.
4362: CoderAgent: Simulating Student Behavior for Personalized Programming Learning with Large Language Models
Authors: Yi Zhan, Qi Liu, Weibo Gao, Zheng Zhang, Tianfu Wang, Shuanghong Shen, Junyu Lu, Zhenya Huang
Location: Guangzhou | Day: TBD
Show Abstract
Personalized programming tutoring, such as exercise recommendation, can enhance learners’ efficiency, motivation, and outcomes, which is increasingly important in modern digital education. However, the lack of sufficient and high-quality programming data, combined with the mismatch between offline evaluation and real-world learning, hinders the practical deployment of such systems. To address this challenge, many approaches attempt to simulate learner practice data, yet they often overlook the fine-grained, iterative nature of programming learning, resulting in a lack of interpretability and granularity. To fill this gap, we propose a LLM-based agent, CoderAgent, to simulate students’ programming processes in a fine-grained manner without relying on real data. Specifically, we equip each human learner with an intelligent agent, the core of which lies in capturing the cognitive states of the human programming practice process. Inspired by ACT-R, a cognitive architecture framework, we design the structure of CoderAgent to align with human cognitive architecture by focusing on the mastery of programming knowledge and the application of coding ability. Recognizing the inherent patterns in multi-layered cognitive reasoning, we introduce the Programming Tree of Thought (PTOT), which breaks down the process into four steps: why, how, where, and what. This approach enables a detailed analysis of iterative problem-solving strategies. Finally, experimental evaluations on real-world datasets demonstrate that CoderAgent provides interpretable insights into learning trajectories and achieves accurate simulations, paving the way for personalized programming education.
4367: Fast Second-Order Online Kernel Learning Through Incremental Matrix Sketching and Decomposition
Authors: Dongxie Wen, Xiao Zhang, Zhewei Wei, Chenping Hou, Shuai Li, Weinan Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Second-order Online Kernel Learning (OKL) has attracted considerable research interest due to its promising predictive performance in streaming environments. However, existing second-order OKL approaches suffer from at least quadratic time complexity with respect to the pre-set budget, rendering them unsuitable for large-scale datasets. Moreover, the singular value decomposition required to obtain explicit feature mapping is computationally expensive due to the complete decomposition process. To address these issues, we propose FORKS, a fast incremental matrix sketching and decomposition approach tailored for second-order OKL. FORKS constructs an incremental maintenance paradigm for second-order kernelized gradient descent, which includes incremental matrix sketching for kernel approximation and incremental matrix decomposition for explicit feature mapping construction. Theoretical analysis demonstrates that FORKS achieves a logarithmic regret guarantee on par with other second-order approaches while maintaining a linear time complexity w.r.t. the budget, significantly enhancing efficiency over existing methods. We validate the performance of our method through extensive experiments conducted on real-world datasets, demonstrating its superior scalability and robustness against adversarial attacks.
4373: Balance-Aware Sequence Sampling Makes Multi-Modal Learning Better
Authors: Zhi-Hao Guan, Qing-Yuan Jiang, Yang Yang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-modal learning (MML) is frequently hindered by modality imbalance, leading to suboptimal performance in real-world applications. To address this issue, existing approaches primarily focus on rebalancing MML from the perspective of optimization or architecture design. However, almost all existing methods ignore the impact of sample sequences, i.e., an inappropriate training order tends to trigger learning bias in the model, further exacerbating modality imbalance. In this paper, we propose Balance-aware Sequence Sampling (BSS) to enhance the robustness of MML. Specifically, we first define a multi-perspective measurer to evaluate the balance degree of each sample in terms of correlation and information criteria. Via this evaluation, we employ a heuristic scheduler based on curriculum learning (CL) that incrementally provides training subsets, progressing from balanced to imbalanced samples to alleviate the imbalance. Moreover, we propose a learning-based probabilistic sampling method to dynamically update the training sequence in a more fine-grained manner, further improving MML performance. Extensive experiments on widely used datasets demonstrate the superiority of our method compared with state-of-the-art (SOTA) baselines. The code is available at https://github.com/njustkmg/IJCAI25-BSS.
4387: Hierarchy Knowledge Graph for Parameter-Efficient Entity Embedding
Authors: Hepeng Gao, Funing Yang, Yongjian Yang, Ying Wang
Location: Guangzhou | Day: TBD
Show Abstract
Traditional knowledge graphs (KGs) provide each entity with a unique embedding as a representation, which contains a lot of redundant information. Meanwhile, the space complexities of the KGs are positively related to the number of entities. In this work, we propose a hierarchical representation learning method, namely HRL, which is a parameter-efficient model where the number of model parameters is independent of dataset scales. Specifically, we propose a hierarchical model comprising a Meta Encoder and a Context Encoder to generate the representation of entities and relations. The Meta Encoder captures the common representations shared across entities, while the Context Encoder learns entity-specific representations. We further provide a theoretical analysis of model design by constructing a structural causal model (SCM) when completing a knowledge graph. The SCM outlines the relationships between nodes, where entity embeddings are conditioned on both common and entity-specific representations. Note that our model is designed to reduce model scale while maintaining competitive performance. We evaluate HRL on the knowledge graph completion task using three real-world datasets. The results demonstrate that HRL significantly outperforms existing parameter-efficient baselines, as well as traditional state-of-the-art baselines of similar scale.
4391: EF1 and EFX Orientations
Authors: Argyrios Deligkas, Eduard Eiben, Tiger-Lily Goldsmith, Viktoriia Korchemna
Location: Guangzhou | Day: TBD
Show Abstract
We study the problem of finding fair allocations — EF1 and EFX — of indivisible goods with orientations. In an orientation, every agent gets items from their own predetermined set. For EF1, we show that EF1 orientations always exist when agents have monotone valuations, via a pseudopolynomial-time algorithm. This surprisingly positive result is the main contribution of our paper. We complement this result with a comprehensive set of scenarios where our algorithm, or a slight modification of it, finds an EF1 orientation in polynomial time. For EFX, we focus on the recently proposed graph instances, where every agent corresponds to a vertex on a graph and their allowed set of items consists of the edges incident to their vertex. It was shown that finding an EFX orientation is NP-complete in general. We prove that it remains intractable even when the graph has a vertex cover of size 8, or when we have a multigraph with only 10 vertices. We essentially match these strong negative results with a fixed-parameter tractable algorithm that is virtually the best someone could hope for.
4424: SeqPose: An End-to-End Framework to Unify Single-frame and Video-based RGB Category-Level Pose Estimation
Authors: Yuzhu Ji, Mingshan Sun, Jianyang Shi, Xiaoke Jiang, Yiqun Zhang, Haijun Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Category-level object pose estimation is a longstanding and fundamental task crucial for augmented reality and robotic manipulation applications. Existing RGB-based approaches struggle with multi-stage settings and heavily rely on off-the-shelf techniques, such as object detectors, depth estimators, non-differentiable NOCS shape alignment, etc. Extra dependencies lead to the accumulation of errors and complicate the whole pipeline, limiting the deployment of these approaches in practical applications. This paper streamlined an end-to-end framework unifying the single-frame and video-based category-level pose estimation. Specifically, instead of explicitly introducing extra dependencies, the DINOv2 encoder and depth decoder, as robust semantic and geometric prior extractors, are leveraged to produce intra-frame hierarchical semantic and geometric features. A spatial-temporal sparse query network is developed to model the implicit correspondence and inter-frame correlations between a set of implicit 3D query anchors and intra-frame features. Finally, a pose prediction head is employed using the bipartite matching algorithm. Experimental results demonstrate that our model achieves state-of-the-art performance compared with RGB-based categorical pose estimation methods on the REAL275 and CAMERA25 datasets. Our code is available at https://andrewchiyz.github.io/vision.3dv.seqpose/.
4433: Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning
Authors: Xudong Yan, Songhe Feng, Yang Zhang, Jian Yang, Yueguan Lin, Haojun Fei
Location: Guangzhou | Day: TBD
Show Abstract
Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attributes and objects by extracting shared and exclusive parts between the image pair sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) The efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attributes with objects in the same parts. (2) Existing word embeddings fail to capture complex multimodal semantic information. (3) Overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named multimodal large language model (MLLM) embeddings and attribute smoothing guided disentanglement for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multi-granularity features for disentanglement. Moreover, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Furthermore, we propose attribute smoothing with auxiliary attributes generated by the large language model (LLM) for seen compositions to address the overconfidence challenge. Extensive experiments demonstrate that our method achieves state-of-the-art performance on three challenging datasets. The supplementary material and source code will be available at https://github.com/xud-yan/Trident.
4437: Hybrid Local Causal Discovery
Authors: Zhaolong Ling, Honghui Peng, Yiwen Zhang, Debo Cheng, Xingyu Wu, Peng Zhou, Kui Yu
Location: Guangzhou | Day: TBD
Show Abstract
Local causal discovery aims to identify and distinguish the direct causes and effects of a target variable from observational data. Due to the inherent incompleteness of local information, popular methods from global causal discovery often face new challenges in local causal discovery tasks, such as 1) erroneous symmetry constraint tests and the resulting cascading errors in constraint-based methods, and 2) confusion within score-based approaches caused by local spurious equivalence classes. To address the above issues, we propose a Hybrid Local Causal Discovery algorithm, called HLCD. Specifically, HLCD initially utilizes a constraint-based approach with the OR rule to obtain a candidate skeleton, which is subsequently refined using a score-based method to eliminate redundant structures. Furthermore, during the local causal orientation phase, HLCD distinguishes between V-structures and equivalence classes by comparing local structure scores between the two, thereby avoiding orientation interference caused by local equivalence class ambiguities. Comprehensive experiments on 14 benchmark Bayesian networks and two real datasets validate that the proposed algorithm outperforms the existing local causal discovery methods.
4439: FreqLLM: Frequency-Aware Large Language Models for Time Series Forecasting
Authors: Shunnan Wang, Min Gao, Zongwei Wang, Yibing Bai, Feng Jiang, Guansong Pang
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) have recently shown promise in Time Series Forecasting (TSF) by effectively capturing intricate time-domain dependencies. However, our preliminary experiments reveal that standard LLM-based approaches often fail to capture global correlations, limiting predictive performance. We found that embedding frequency-domain signals smooths weight distributions and enhances structured correlations by clearly separating global trends (low-frequency components) from local variations (high-frequency components). Building on these insights, we propose FreqLLM, a novel framework that integrates frequency-domain semantic alignment into LLMs to refine prompts for improved time series analysis. By bridging the gap between frequency signals and textual embeddings, FreqLLM effectively captures multi-scale temporal patterns and provides more robust forecasting results. Extensive experiments on benchmark datasets demonstrate that FreqLLM outperforms state-of-the-art TSF methods in both accuracy and generalization. The code is available at https://github.com/biya0105/FreqLLM.
4461: GCTAM: Global and Contextual Truncated Affinity Combined Maximization Model For Unsupervised Graph Anomaly Detection
Authors: Xiong Zhang, Hong Peng, Zhenli He, Cheng Xie, Xin Jin, Hua Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Anomalies often occur in real-world information networks/graphs, such as malevolent users, malicious comments, banned users, and fake news in social graphs.
The latest graph anomaly detection methods use a novel mechanism called truncated affinity maximization (TAM) to detect anomaly nodes without using any label information and achieve impressive results.
TAM maximizes the affinities among the normal nodes while truncating the affinities of the anomalous nodes to identify the anomalies.
However, existing TAM-based methods truncate suspicious nodes according to a rigid threshold that ignores the specificity and high-order affinities of different nodes.
This inevitably causes inefficient truncations from both normal and anomalous nodes, limiting the effectiveness of anomaly detection.
To this end, this paper proposes a novel truncation model combining contextual and global affinity to truncate the anomalous nodes.
The core idea of the work is to use contextual truncation to decrease the affinity of anomalous nodes, while global truncation increases the affinity of normal nodes.
Extensive experiments on massive real-world datasets show that our method surpasses peer methods in most graph anomaly detection tasks.
In highlights, compared with previous state-of-the-art methods, the proposed method has +15% ~ +20% improvements in two famous real-world datasets, Amazon and YelpChi.
Notably, our method works well in large datasets, Amazin-all and YelpChi-all, and achieves the best results, while most previous models cannot complete the tasks.
4479: KnowMDD: Knowledge-guided Cross Contrastive Learning for Major Depressive Disorder Diagnosis
Authors: Anchen Lin, Weikun Wang, Haijun Han, Fanwei Zhu, Qi Ma, Zengwei Zheng, Binbin Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Major Depressive Disorder (MDD) is a prevalent and severe mental disease. Functional Magnetic Resonance Imaging (fMRI)-based diagnostic methods, which analyze Functional Connectivity (FC) to identify abnormal functional connections, have shown promise as biomarker-based approaches for diagnosing depression. However, the high costs of fMRI data result in small sample sizes, hindering the effective identification of abnormal FC patterns. Moreover, existing methods often overlook the potential benefits of incorporating domain knowledge into their models. In this paper, we propose KnowMDD, a novel knowledge-guided cross contrastive learning framework for MDD diagnosis. By incorporating domain knowledge and employing data augmentation, KnowMDD addresses data sparsity while improving robustness and interpretability. Specifically, multiple atlases are used to construct complementary brain graph representations. The default mode network, closely associated with depression, is introduced into the contrastive learning paradigm for diverse subgraph augmentations, while an attention mechanism captures global semantic relationships between brain regions. Based on them, a cross contrastive learning is designed to learn robust representations for accurate diagnosis. Extensive experiments demonstrate the effectiveness, robustness, and interpretability of KnowMDD, which outperforms state-of-the-art methods. We also develop a demonstration system to show its practical application.
4485: Coming Out of the Dark: Human Pose Estimation in Low-light Conditions
Authors: Yong Su, Defang Chen, Meng Xing, Changjae Oh, Xuewei Liu, Jieyang Li
Location: Guangzhou | Day: TBD
Show Abstract
Human pose estimation in low-light conditions is vital for applications such as surveillance and autonomous systems, yet the severe visual distortions hinder both manual annotation and estimation precision. Existing approaches typically rely on additional reference information to mitigate these issues, however, customized data collection equipment poses limitations on their scalability. To alleviate the issue, we construct a Low-Light Images and Poses (LLIP) dataset, which includes only paired low-light images and pose annotations obtained using off-the-shelf motion capture devices. Furthermore, we propose a Multi-grained High-frequency Feature Consistency Learning framework (MHFCL), which does not rely on additional reference information. MHFCL employs a Retinex-inspired restoration stream to recover high-frequency details and integrates them into pose estimation using a multi-grained consistency mechanism. Experiments demonstrate that our approach achieves a new benchmark in low-light pose estimation, while maintaining competitive performance in well-lit conditions.
4505: Fusion of Granular-Ball Visual Spatial Representations for Enhanced Facial Expression Recognition
Authors: Shuaiyu Liu, Qiyao Shen, Yunxi Wang, Yazhou Ren, Guoyin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Facial Expression Recognition (FER) is a fundamental problem in computer vision. Despite recent advances, significant challenges remain. Current methods primarily focus on extracting visual representations while overlooking other valuable information. To address this limitation, we propose a novel method called Component Separation and Granular-ball Space Bootstrap Fusion (CS-GBSBF), which leverages granular balls to transform visual images to spatial graphs, thereby enlarging the spatial information embedded in images. Our method separates the face into different components and utilizes the spatial information to bootstrap the fusion. More specifically, CS-GBSBF mainly consists of three crucial networks: Represent Extraction Network (REN), Represent Separation Network (RSN) and Represent Fusion Network (RFN). First, granular balls are used to represent expression images as graphs, which are fed into REN along with images. Then, RSN separates basic visual/spatial representations extracted from REN into a set of component visual/spatial representations. Next, RFN utilizes spatial representations to bootstrap component visual integration. A significant challenge in two-stream models is feature alignment, for which we have developed Attention Guidance Module (AGM) and Bootstrap Alignment Loss (L_BA) in REN and RFN, respectively. Results of experiment on eight databases show that CS-GBSBF consistently achieves higher recognition accuracy than several state-of-the-art methods. The code is available at https://github.com/Lsy235/CS-GBSBF.
4507: Efficient Constraint-based Window Causal Graph Discovery in Time Series with Multiple Time Lags
Authors: Yewei Xia, Yixin Ren, Hong Cheng, Hao Zhang, Jihong Guan, Minchuan Xu, Shuigeng Zhou
Location: Guangzhou | Day: TBD
Show Abstract
We address the identification of direct causes in time series with multiple time lags, and propose a constraint-based window causal graph discovery method. A key advantage of our method is that the number of required conditional independence (CI) tests scales quadratically with the number of sub-series. The method first uses CI tests to find the minimum trek lag between two arbitrary sub-series, followed by designing an efficient CI testing strategy to identify the direct causes between them. We show that the method is both sound and complete under some graph constraints. We compare the proposed method with typical baselines on various datasets. Experimental results show that our method outperforms all the counterparts in both accuracy and running speed.
4509: Multi-view Clustering via Multi-granularity Ensemble
Authors: Jie Yang, Wei Chen, Feng Liu, Peng Zhou, Zhongli Wang, Xinyan Liang, Bingbing Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-view clustering aims to integrate complementary information from multiple views to improve clustering performance. However, existing ensemble-based methods suffer from information loss due to their reliance on single-granularity labels, limiting the discriminative capability of learned representations. Meanwhile, representation and graph fusion-based approaches face challenges such as explicit view alignment and manual weight tuning, making them less effective for heterogeneous views with varying data distributions. To address these limitations, we propose a novel multi-view clustering framework via Multi-granularity Ensemble (MGE), fully using the multi-granularity information across diverse views for accurate and consistent clustering. Specifically, MGE first modifies the hierarchical clustering and then leverages it on each view (including the fused view) to achieve multi-granularity labels. Moreover, the cross-view and cross-granularity fusion strategy is designed to learn a robust co-association similarity matrix, which effectively preserves the fine-grained and coarse-grained structures of multi-view data and facilitates subsequent clustering. Therefore, MGE can provide a comprehensive representation of local and global patterns within data, eliminating the requirement for view alignment and weight tuning. Experiments demonstrate that MGE consistently outperforms state-of-the-art methods across multiple datasets, validating its effectiveness and superiority in handling heterogeneous views.
4516: Let’s Group: A Plug-and-Play SubGraph Learning Method for Memory-Efficient Spatio-Temporal Graph Modeling
Authors: Wenchao Weng, Hanyu Jiang, Mei Wu, Xiao Han, Haidong Gao, Guojiang Shen, Xiangjie Kong
Location: Guangzhou | Day: TBD
Show Abstract
Spatio-temporal graph modeling is widely applied to spatio-temporal data, analyzing the relationships between data to achieve accurate predictions. However, despite the excellent predictive performance of increasingly complex models, their intricate architectures result in significant memory overhead and computational complexity when handling spatio-temporal data, which limits their practical applications. To address these challenges, we propose a plug-and-play SubGraph Learning (SGL) method to reduce the memory overhead without compromising performance. Specifically, we introduce a SubGraph Partition Module (SGPM), which leverages a set of learnable memory vectors to select node groups with similar features from the graph, effectively partitioning the graph into smaller subgraphs. Noting that partitioning the graph may lead to feature redundancy, as overlapping information across subgraphs can occur. To overcome this, we design a SubGraph Feature Aggregation Module (SGFAM), which mitigates redundancy by averaging node features from different subgraphs. Experiments on four traffic network datasets of various scales demonstrate that SGL can significantly reduce memory overhead, achieving up to a 56.4\% reduction in average GPU memory overhead, while maintaining robust prediction performance. The source code is available at https://github.com/wengwenchao123/SubGraph-Learning.
4523: Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization
Authors: Yuanyuan Chang, Yinghua Yao, Tao Qin, Mengmeng Wang, Ivor Tsang, Guang Dai
Location: Guangzhou | Day: TBD
Show Abstract
Text-to-image diffusion models have emerged as powerful tools for high-quality image generation and editing. Many existing approaches rely on text prompts as editing guidance. However, these methods are constrained by the need for manual prompt crafting, which can be time-consuming, introduce irrelevant details, and significantly limit editing performance. In this work, we propose optimizing semantic embeddings guided by attribute classifiers to steer text-to-image models toward desired edits, without relying on text prompts or requiring any training or fine-tuning of the diffusion model. We utilize classifiers to learn precise semantic embeddings at the dataset level. The learned embeddings are theoretically justified as the optimal representation of attribute semantics, enabling disentangled and accurate edits. Experiments further demonstrate that our method achieves high levels of disentanglement and strong generalization across different domains of data. Code is available at https://github.com/Chang-yuanyuan/CASO.
4530: DERI: Cross-Modal ECG Representation Learning with Deep ECG-Report Interaction
Authors: Jian Chen, Xiaoru Dong, Wei Wang, Shaorui Zhou, Lequan Yu, Xiping Hu
Location: Guangzhou | Day: TBD
Show Abstract
Electrocardiogram (ECG) is widely used to diagnose cardiac conditions via deep learning methods. Although existing self-supervised learning (SSL) methods have achieved great performance in learning representation for ECG-based cardiac conditions classification, the clinical semantics can not be effectively captured. To overcome this limitation, we proposed to learn cross-modal ECG representations that contain more clinical semantics via a novel framework with \textbf{D}eep \textbf{E}CG-\textbf{R}eport \textbf{I}nteraction (\textbf{DERI}). Specifically, we design a novel framework combining multiple alignments and mutual feature reconstructions to learn effective representation of the ECG with the clinical report, which fuses the clinical semantics of the report. An RME-Module inspired by masked modeling is proposed to improve the ECG representation learning. Furthermore, we extend ECG representation learning to report generation with a language model, which is significant for evaluating clinical semantics in the learned representations and even clinical applications. Comprehensive experiments with various settings are conducted on various datasets to show the superior performance of our DERI. Our code is released on https://github.com/cccccj-03/DERI.
4531: Multimodal Image Matching Based on Cross-Modality Completion Pre-training
Authors: Meng Yang, Fan Fan, Jun Huang, Yong Ma, Xiaoguang Mei, Zhanchuan Cai, Jiayi Ma
Location: Guangzhou | Day: TBD
Show Abstract
The differences in imaging devices cause multimodal images to have modal differences and geometric distortions, complicating the matching task. Deep learning-based matching methods struggle with multimodal images due to the lack of large annotated multimodal datasets. To address these challenges, we propose XCP-Match based on cross-modality completion pre-training. XCP-Match has two phases. (1) Self-supervised cross-modality completion pre-training based on real multimodal image dataset. We develop a novel pre-training model to learn cross-modal semantic features. The pre-training uses masked image modeling method for cross-modality completion, and introduces an attention-weighted contrastive loss to emphasize matching in overlapping areas. (2) Supervised fine-tuning for multimodal image matching based on the augmented MegaDepth dataset. XCP-Match constructs a complete matching framework to overcome geometric distortions and achieve precise matching. Two-phase training encourages the model to learn deep cross-modal semantic information, improving adaptation to modal differences without needing large annotated datasets. Experiments demonstrate that XCP-Match outperforms existing algorithms on public datasets.
4545: NAAST-GNN: Neighborhood Adaptive Aggregation and Spectral Tuning for Graph Anomaly Detection
Authors: Ronghui Guo, Xiaowang Zhang, Zhizhi Yu, Minghui Zou, Sai Zhang, Zhiyong Feng
Location: Guangzhou | Day: TBD
Show Abstract
Heterophily emerges as a critical challenge in Graph Anomaly Detection (GAD). Recent studies reveal that neighborhood distributions, rather than heterophily itself, are the fundamental factor for the expressive power of Graph Neural Networks (GNNs). However, two key challenges remain unresolved. First, the overlap in neighborhood distributions between anomalous and normal nodes poses significant difficulties in distinguishing them effectively. Second, the dispersion in neighborhood distributions within the same class prevents the application of a fixed aggregation strategy to accommodate the diverse patterns within the class. To tackle the aforementioned challenges, we propose a novel Graph Neural Network model called Neighborhood Adaptive Aggregation and Spectral Tuning (NAAST-GNN). Specifically, we first design a neighborhood adaptive aggregation module that adjusts the message passing mechanism based on the predicted probabilities for different node classes, ensuring that nodes from distinct classes but with similar neighborhood distributions derive unique aggregated neighborhood information. We then present a spectral tuning module that dynamically selects and combines spectral filters based on the predicted neighborhood distribution, ensuring adaptability to the diverse neighborhood distributions of nodes within the same class. Comprehensive experimental results demonstrate that our method outperforms state-of-the-art baselines.
4551: FedCM: Client Clustering and Migration in Federated Learning via Gradient Path Similarity and Update Direction Deviation
Authors: Peng Wang, Shoupeng Lu, Hao Yin, Banglie Yang, Tianli Zhu, Cheng Dai
Location: Guangzhou | Day: TBD
Show Abstract
Federated learning (FL) enables collaborative training among multiple clients while preserving data privacy. However, its practical application is significantly limited by two major challenges: statistical heterogeneity and data distribution drift. Statistical heterogeneity causes the direction of local model updates to deviate from the global training objective, while data distribution drift leads to a mismatch between local models and their cluster models. To address these challenges, this paper proposes an adaptive clustered federated learning framework, Fed-CM. Initially, by capturing the dynamic patterns of personalized layer parameters in clients’ models, Fed-CM effectively characterizes the correlations and distributional similarities among clients, reflecting the underlying statistical heterogeneity. Subsequently, this framework leverages client similarities to construct an undirected graph and adaptively performs effective cluster discovery with minimal dependence on hyperparameters. Furthermore, a monitoring strategy tracks the deviation between clients’ update directions and the dominant update direction of their clusters and then adaptively migrates clients experiencing data drift. Such a dynamic strategy helps maintain intra-cluster homogeneity and addresses the mismatch between local models and their cluster models. Compared to other state-of-the-art methods, experimental results on multiple datasets demonstrate that the proposed Fed-CM framework effectively addresses the challenges posed by statistical heterogeneity and data drift, significantly improving the performance and robustness of federated learning models.
4560: DiffSQL: Leveraging Diffusion Model for Zero-Shot Self-Supervised Monocular Depth Estimation
Authors: Heyuan Zheng, Yunji Liang, Lei Liu, Zhiwen Yu
Location: Guangzhou | Day: TBD
Show Abstract
Self-supervised monocular depth estimation has attracted significant attention due to its broad applications in autonomous driving and robotics. Although significant performance improvements have been achieved by learning the relative distance of objects with the introduction of Self Query Layer (SQL), it struggles with zero-shot generalization due to the lack of geometric features and the fixed number of query sizes. To address these problems, we propose a diffusion-augmented self-supervised depth estimation framework, named DiffSQL, to learn geometric priors for feature augmentation. Additionally, we introduce a dynamic self-query layer that implicitly computes the relative distances between objects by adjusting the query size according to the feature distribution. Experimental results on the KITTI dataset show that DiffSQL outperforms SQLdepth by 1.03% in terms of AbsRel and 2.79% in terms of SqRel. Furthermore, our experiments demonstrate that DiffSQL is superior in zero-shot generalization.
4573: RRG-Mamba: Efficient Radiology Report Generation with State Space Model
Authors: Xiaodi Hou, Xiaobo Li, Mingyu Lu, Simiao Wang, Yijia Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Recent advancements in radiology report generation have utilized deep neural networks such as CNNs and Transformers, achieving notable improvements in generating accurate and detailed reports. However, their practical adoption is hindered by the challenge of balancing global dependency modeling with computational efficiency. The state space model, particularly its enhanced variant Mamba, offers promising linear-complexity solutions for long-range dependency modeling. Despite its strengths, Mamba’s fixed positional encoding limits its ability to effectively capture complex spatial dependencies. To address this gap, we propose RRG-Mamba, an advanced framework for efficient radiology report generation. Within the RRGMamba, we enhance the vanilla Mamba by integrating rotary position encoding (RoPE), enabling dynamic modeling of relative positional information in visual feature sequences. Furthermore, we design a global dependency learning module to optimize long-range visual feature sequence modeling. Extensive experiments on publicly available datasets, including IU X-Ray and MIMIC-CXR, demonstrate that RRG-Mamba achieves a 3.7% improvement in BLEU-4 score over existing models, along with significant gains in computational and memory efficiency. Our code is available at https://github.com/Eleanorhxd/RRG-Mamba.
4582: MSCI: Addressing CLIP’s Inherent Limitations for Compositional Zero-Shot Learning
Authors: Yue Wang, Shuai Xu, Xuelin Zhu, Yicong Li
Location: Guangzhou | Day: TBD
Show Abstract
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-modal Interaction (MSCI) model that effectively explores and utilizes intermediate-layer information from CLIP’s visual encoder. Specifically, we design two self-adaptive aggregators to extract local information from low-level visual features and integrate global information from high-level visual features, respectively. These key information are progressively incorporated into textual representations through a stage-by-stage interaction mechanism, significantly enhancing the model’s perception capability for fine-grained local visual information. Additionally, MSCI dynamically adjusts the attention weights between global and local visual information based on different combinations, as well as different elements within the same combination, allowing it to flexibly adapt to diverse scenarios. Experiments on three widely used datasets fully validate the effectiveness and superiority of the proposed model. Data and code are available at https://github.com/ltpwy/MSCI.
4624: APIMig: A Project-Level Cross-Multi-Version API Migration Framework Based on Evolution Knowledge Graph
Authors: Li Kuang, Qi Xie, HaiYang Yang, Yang Yang, Xiang Wei, HaoYue Kang, YingJie Xia
Location: Guangzhou | Day: TBD
Show Abstract
API migration is essential for software maintenance due to the rapid evolution of third-party libraries where API elements may change continuously through updates. There are two main challenges for API migration at the project level, especially across multiple versions: 1) lack of specific library evolution knowledge across multi-version; 2) difficulty in identifying the chain of changes at the project level. This paper proposes a project-level cross-multi-version API migration framework APIMig. We first construct an API evolution knowledge graph (KG) to capture changes between adjacent library versions and then derive coherent cross-version API evolution knowledge by KG reasoning. Second, we design a chain exploration algorithm to track the chain of changes and aggregate the affected code segments. Finally, a large language model is employed in completing API migration by providing the API evolution knowledge and the chain of changes. We construct an evolution KG for the Lucene library from version 4.0.0 to 10.1.0 and evaluate our approach through project migration pairs that depend on different major versions. Our framework shows improvements over the baseline in migrating projects across 7 major versions, achieving average increases of 16.52% in CodeBLEU scores and 28.49% in VCEU scores in GPT-4o.
4681: Hyper-graph Video Object Segmentation via Text-depth Collaborative Reasoning
Authors: Jiaqing Fan, Yifan Liao, Fanzhang Li
Location: Guangzhou | Day: TBD
Show Abstract
Current video object segmentation (VOS) solutions often overlook the wealthy information subtitles and depth cues among video sequences, which are crucial for effectively linking video content. Recognizing the significance of these elements, in this paper, we introduce a novel approach termed as "Hyper-graph Text-Depth Collaborative Reasoning Video Object Segmentation" (HTD). It aims to leverage the synergy between textual and depth information to enhance the segmentation of objects in video sequences. The HTD framework integrates textual and depth data into a hyper-graph structure, where nodes represent objects, text, and depth features, and hyper-edges encode complex relationships among them. After grabbing the multimodal context of video scenes, the proposed collaborative reasoning mechanism within the hyper-graph iteratively refines object boundaries by considering the interplay between textual cues, depth information, and spatial-temporal coherence. We demonstrate the effectiveness of HTD through extensive experiments on four benchmarks. The results show that our approach outperforms state-of-the-art VOS methods, particularly in scenarios with complex backgrounds, occlusions, and dynamic scenes. The inclusion of text and depth data not only improves segmentation accuracy but also enhances the interpretability of the segmentation process. We have released the training and testing
code on https://github.com/zyaireleo/HTD.git.
4707: DFMU: Distribution-based Framework for Modeling Aleatoric Uncertainty in Multimodal Sentiment Analysis
Authors: Chen Tang, Tingrui Shen, Xinrong Gong, Chong Zhao, Tong Zhang
Location: Guangzhou | Day: TBD
Show Abstract
In Multimodal Sentiment Analysis (MSA), data noise arising from various sources can lead to uncertainty in Aleatoric Uncertainty (AU), significantly impacting model performance. Current efforts to address AU have insufficiently explored its sources. They primarily focus on modeling noise rather than implementing targeted modeling based on its origin. Consequently, these approaches struggle to effectively mitigate the influence of AU, resulting in sustained limitations in model performance. Our research identifies that the AU primarily stems from two problems: subjective bias in the annotation process and the complex set relationships of sentiment features. To specifically address them, we propose DFMU, a Distribution-based Framework for Modeling Aleatoric Uncertainty, which incorporates an uncertainty modeling block capable of encoding uncertainty distributions and adaptively adjusting optimization objectives. Furthermore, we introduce distribution-based contrastive learning with sentiment words replacement to better capture the complex relationships among features. Extensive experiments on three public MSA datasets, i.e., MOSI, MOSEI, and SIMS, demonstrate that the proposed model maintains robust performance even under high noise conditions and achieves state-of-the-art results on these popular datasets.
4708: Reinforced In-Context Black-Box Optimization
Authors: Lei Song, Chen-Xiao Gao, Ke Xue, Chenyang Wu, Dong Li, Jianye Hao, Zongzhang Zhang, Chao Qian
Location: Guangzhou | Day: TBD
Show Abstract
Black-Box Optimization (BBO) has found successful applications in many fields of science and engineering. Recently, there has been a growing interest in meta-learning particular components of BBO algorithms to speed up optimization and get rid of tedious hand-crafted heuristics. As an extension, learning the entire algorithm from data requires the least labor from experts and can provide the most flexibility. In this paper, we propose RIBBO, a method to reinforce-learn a BBO algorithm from offline data in an end-to-end fashion. RIBBO employs expressive sequence models to learn the optimization histories produced by multiple behavior algorithms and tasks, leveraging the in-context learning ability of large models to extract task information and make decisions accordingly. Central to our method is to augment the optimization histories with regret-to-go tokens, which are designed to represent the performance of an algorithm based on cumulative regret over the future part of the histories. The integration of regret-to-go tokens enables RIBBO to automatically generate sequences of query points that are positively correlated to the user-desired regret, verified by its universally good empirical performance on diverse problems, including BBO benchmark, hyper-parameter optimization, and robot control problems.
4713: Hypernetwork Aggregation for Decentralized Personalized Federated Learning
Authors: Weishi Li, Yong Peng, Mengyao Du, Fuhui Sun, Xiaoyan Wang, Li Shen
Location: Guangzhou | Day: TBD
Show Abstract
Personalized Federated Learning (PFL) meets each user’s personalized needs while still facing the high communication costs due to the large amount of data transmission and frequent communication. Decentralized PFL (DPFL) as an alternative discards the central server in PFL, which reduces the pressure of communication and the risk of server failure by using peer-to-peer communication.Nevertheless, DPFL still suffers from the significant communication pressure due to the transmission of a large number of model parameters, especially numerous nodes. To address the issues, we propose a novel personalized framework, DFedHP, in which each client utilizes a hypernetwork to generate the shared part of model parameters and train the personalized parameters separately. The number of parameters in a hypernetwork is much smaller than those in a typical local network, so hypernetwork aggregation reduces communication costs and the risk of privacy leakage. Furthermore, DFedHP can seamlessly integrate into existing DPFL algorithms as a plugin to boost their efficacy. At last, extensive experiments on various data heterogeneous environments demonstrate that DFedHP can reduce communication costs, accelerate convergence rate, and improve generalization performance compared with state-of-the-art (SOTA) baselines.
4718: VimGeo: Efficient Cross-View Geo-Localization with Vision Mamba Architecture
Authors: Jinglin Huang, Maoqiang Wu, Peichun Li, Wen Wu, Rong Yu
Location: Guangzhou | Day: TBD
Show Abstract
Cross-view geo-localization is a crucial task with diverse applications, yet it remains challenging due to the significant variations in viewpoints and visual appearances between images from different perspectives. While recent advancements have been made, existing methods often suffer from high model complexity, excessive resource consumption, and the impact of sample learning difficulty on optimization. To overcome these limitations, we optimize the Vision Mamba (Vim) model, built on a State Space Model (SSM) architecture, by replacing the traditional classification head with Channel Group Pooling (CGP) for efficient feature integration. This optimization reduces model parameters by 1.5% and computational complexity by 0.4%. Additionally, we propose a novel Dynamic Weighted Batch-tuple Loss (DWBL) to dynamically adjust the weighting of negative samples, improving model performance. By combining CGP and DWBL, we develop an efficient end-to-end network, VimGeo, which achieves state-of-the-art performance with enhanced computational efficiency. Specifically, VimGeo achieves a Recall@1 of 81.67% on the CVACT_test dataset, outperforming prior approaches. Extensive experiments on CVUSA, CVACT, and VIGOR datasets validate VimGeo’s effectiveness and competitiveness in cross-view geo-localization tasks, achieving the leading results among sequence modeling-based methods. The implementation is available at: https://github.com/VimGeoTeam/VimGeo.
4720: ForgDiffuser: General Image Forgery Localization with Diffusion Models
Authors: Mengxi Wang, Shaozhang Niu, Jiwei Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Current general image forgery localization (GIFL) methods confront two main challenges: decoder overconffdence causing misidentiffcation of the authentic regions or incomplete predicted masks, and limited accuracy in localizing forgery details. Recently, diffusion models have excelled as dominant approach for generative models, particularly effective in capturing complex scene details. However, their potential for GIFL remains underexplored. Therefore, we propose a GIFL framework named ForgDiffuser with diffusion models. The core of ForgDiffuser lies in leveraging diffusion models conditioned on the forgery image to efffciently generate the segmentation mask for tampered regions. Speciffcally, we introduce the attentionguided module (AGM) to aggregate and enhance image feature representations. Meanwhile, we design the boundary-driven module (BDM) with edge supervision to improve the localization accuracy of boundary details. Additionally, the probabilistic modeling and stochastic sampling mechanisms of diffusion models effectively alleviate the overconffdence issue commonly observed in traditional decoders. Experiments on six benchmark datasets demonstrate that ForgDiffuser outperforms existing mainstream GIFL methods in both localization accuracy and robustness, especially under challenging manipulation conditions.
4744: Deduction with Induction: Combining Knowledge Discovery and Reasoning for Interpretable Deep Reinforcement Learning
Authors: Haodi Zhang, Xiangyu Zeng, Junyang Chen, Yuanfeng Song, Rui Mao, Fangzhen Lin
Location: Guangzhou | Day: TBD
Show Abstract
Deep reinforcement learning (DRL) has achieved remarkable success in dynamic decision-making tasks. However, its inherent opacity and cold start problem hinder transparency and training efficiency. To address these challenges, we propose HRL-ID, a neural-symbolic framework that combines automated rule discovery with logical reasoning within a hierarchical DRL structure. HRL-ID dynamically extracts first-order logic rules from environmental interactions, iteratively refines them through success-based updates, and leverages these rules to guide action execution during training. Extensive experiments on Atari benchmarks demonstrate that HRL-ID outperforms state-of-the-art methods in training efficiency and interpretability, achieving higher reward rates and successful knowledge transfer between domains.
4751: Problem-dependent Regret for Lexicographic Multi-Armed Bandits with Adversarial Corruptions
Authors: Bo Xue, Xi Lin, Yuanyu Wan, Qingfu Zhang
Location: Guangzhou | Day: TBD
Show Abstract
This paper studies lexicographic multi-armed bandits (MAB), where after selecting an arm, the agent observes a reward vector including multiple objectives, each with a different level of importance. Although previous literature has proposed the algorithm for lexicographic MAB, their algorithm suffers from several limitations: (1) it exhibits poor adversarial robustness due to its reliance on stochastic rewards, (2) its regret bound is suboptimal compared to single-objective counterparts, and (3) the regret bound does not adapt to specific problem instances. To address these limitations, we study lexicographic MAB with adversarial corruptions, where an adversary might corrupt the stochastic rewards with a corruption budget of C. First, when the value of C is known, we propose an algorithm achieving a problem-dependent regret bound of O(∑(log T / Δⁱ(a) + C)) for the i-th objective (i ∈ [M]), where Δⁱ(a) is the reward gap for arm a on the i-th objective, and M is the number of objectives. In the purely stochastic setting (C=0), this regret bound approaches optimality. Second, we introduce another algorithm that does not require value of C but incurs a less favorable regret bound of O(∑(γ_T / Δⁱ(a) + γ_T)) for the i-th objective, where γ_T = O((log T)² + KC(log T)²). Finally, we conduct experiments on both synthetic and real-world datasets to verify the effectiveness of our algorithms.
4782: EFormer: An Effective Edge-based Transformer for Vehicle Routing Problems
Authors: Dian Meng, Zhiguang Cao, Yaoxin Wu, Yaqing Hou, Hongwei Ge, Qiang Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Recent neural heuristics for the Vehicle Routing Problem (VRP) primarily rely on node coordinates as input, which may be less effective in practical scenarios where real cost metrics—such as edge-based distances—are more relevant. To address this limitation, we introduce EFormer, an Edge-based Transformer model that uses edge as the sole input for VRPs. Our approach employs a precoder module with a mixed-score attention mechanism to convert edge information into temporary node embeddings. We also present a parallel encoding strategy characterized by a graph encoder and a node encoder, each responsible for processing graph and node embeddings in distinct feature spaces, respectively. This design yields a more comprehensive representation of the global relationships among edges. In the decoding phase, parallel context embedding and multi-query integration are used to compute separate attention mechanisms over the two encoded embeddings, facilitating efficient path construction. We train EFormer using reinforcement learning in an autoregressive manner. Extensive experiments on the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) reveal that EFormer outperforms established baselines on synthetic datasets, including large-scale and diverse distributions. Moreover, EFormer demonstrates strong generalization on real-world instances from TSPLib and CVRPLib. These findings confirm the effectiveness of EFormer’s core design in solving VRPs.
4816: All Roads Lead to Rome: Exploring Edge Distribution Shifts for Heterophilic Graph Learning
Authors: Yi Wang, Changqin Huang, Ming Li, Tingyi Cai, Zhonglong Zheng, Xiaodi Huang
Location: Guangzhou | Day: TBD
Show Abstract
Heterophilic graph neural networks (GNNs) have gained prominence for their ability to learn effective representations in graphs with diverse, attribute-aware relationships. While existing methods leverage attribute inference during message passing to improve performance, they often struggle with challenging heterophilic graphs. This is due to edge distribution shifts introduced by diverse connection patterns, which blur attribute distinctions and undermine message-passing stability. This paper introduces H₂OGNN, a novel framework that reframes edge attribute inference as an out-of-distribution (OOD) detection problem. H₂OGNN introduces a simple yet effective symbolic energy regularization approach for OOD learning, ensuring robust classification boundaries between homophilic and heterophilic edge attributes. This design significantly improves the stability and reliability of GNNs across diverse connectivity patterns. Through theoretical analysis, we show that H₂OGNN addresses the graph denoising problem by going beyond feature smoothing, offering deeper insights into how precise edge attribute identification boosts model performance. Extensive experiments on nine benchmark datasets demonstrate that H₂OGNN not only achieves state-of-the-art performance but also consistently outperforms other heterophilic GNN frameworks, particularly on datasets with high heterophily.
4836: PAMol: Pocket-Aware Drug Design Method with Hypergraph Representation of Protein Pocket Structure and Feature Fusion
Authors: Xiaoli Lin, Xiongwei Liao, Jun Pang, Bo Li, Xiaolong Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Efficient generation of targeted drug molecules is crucial in the field of drug discovery. Most existing methods neglect the high-order information in the structure of protein pockets, limiting the performance of generated drug molecules. This paper proposes a pocket-aware drug design framework, namely PAMol, constructing the hypergraph to represent the spatial structure of protein pockets, effectively capturing high-order relations and neighborhood information within the pocket structures. This framework also fuses different modal embeddings from proteins and molecules, to generate high-quality molecules. In addition, a conditional molecule generation module uses the high-order structural information in protein pockets as constraints to more accurately generate molecules for specific targets. The performance of PAMol has been assessed by analyzing generated molecules in terms of vina score, high affinity, QED, SA, LogP, Lipinski, diversity, and time. Experimental results demonstrate the potential of PAMol for targeted drug design. The source code is available at https://github.com/YICHUANSYQ/PAMol.git.
4845: Multi-Omics Analysis for Cancer Subtype Inference via Unrolling Graph Smoothness Priors
Authors: Jielong Lu, Zhihao Wu, Jiajun Yu, Jiajun Bu, Haishuai Wang
Location: Guangzhou | Day: TBD
Show Abstract
Integrating multi-omics datasets through data-driven analysis offers a comprehensive understanding of the complex biological processes underlying various diseases, particularly cancer.
Graph Neural Networks (GNNs) have recently demonstrated remarkable ability to exploit relational structures in biological data, enabling advances in multi-omics integration for cancer subtype classification.
Existing approaches often neglect the intricate coupling between heterogeneous omics, limiting their capacity to resolve subtle cancer subtype heterogeneity critical for precision oncology.
To address these limitations, we propose a framework named Graph Transformer for Multi-omics Cancer Subtype Classification (GTMancer).
This framework builds upon the GNN optimization problem and extends its application to complex multi-omics data.
Specifically, our method leverages contrastive learning to embed multi-omics data into a unified semantic space.
We unroll the multiplex graph optimization problem in that unified space and introduce dual sets of attention coefficients to capture structural graph priors both within and among multi-omics data.
This approach enables global omics information to guide the refining of the representations of individual omics.
Empirical experiments on seven real-world cancer datasets demonstrate that GTMancer outperforms existing state-of-the-art algorithms.
4868: Learn Multi-task Anchor: Joint View Imputation and Label Generation for Incomplete Multi-view Clustering
Authors: Xinxin Wang, Yongshan Zhang, Yicong Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Anchor-based incomplete multi-view clustering methods utilize anchors to uncover clustering structures. However, relying on anchor graphs for producing final indicators is indirect, which can lead to information loss and suboptimal outcomes. Besides, most methods neglect the potential of anchors for imputing missing views. To address these limitations, we propose a Joint View Imputation and Label Generation (JVILG) method. JVILG comprises the Anchor-based tensorized Label Generation (ALG) module for generating clustering labels and the Anchor-based sparse regularized Subspace Correlation (ASC) module for recovering missing views. The ALG module explicitly connects data observations, the fine-grained anchor matrix, and soft label matrices within a reconstruction framework through a membership matrix, while imposing tensor Schatten p-norm regularization on the constructed label tensor to capture spatial correlations among views. Meanwhile, the ASC module directly uses fine-grained anchors to impute missing data in respective views. By integrating the ALG and ASC modules, JVILG enhances synergy between different tasks and mitigates the impact of missing information on clustering. Experimental results on six datasets demonstrate the effectiveness of JVILG compared to both shallow and deep state-of-the art methods.The code is available at https://github.com/W-Xinxin/JVILG.
4876: In-Context Meta LoRA Generation
Authors: Yihua Shao, Minxi Yan, Yang Liu, Siyu Chen, Wenjie Chen, Xinwei Long, Ziyang Yan, Lei Li, Chenyu Zhang, Nicu Sebe, Hao Tang, Yan Wang, Hao Zhao, Mengzhu Wang, Jingcai Guo
Location: Guangzhou | Day: TBD
Show Abstract
Low-rank Adaptation (LoRA) has demonstrated remarkable capabilities for task specific fine-tuning. However, in scenarios that involve multiple tasks, training a separate LoRA model for each one results in considerable inefficiency in terms of storage and inference. Moreover, existing parameter generation methods fail to capture the correlations among these tasks, making multi-task LoRA parameter generation challenging. To address these limitations, we propose In-Context Meta LoRA (ICM-LoRA), a novel approach that efficiently achieves task-specific customization of large language models (LLMs). Specifically, we use training data from all tasks to train a tailored generator, Conditional Variational Autoencoder (CVAE). CVAE takes task descriptions as inputs and produces task-aware LoRA weights as outputs. These LoRA weights are then merged with LLMs to create task-specialized models without the need for additional fine-tuning. Furthermore, we utilize in-context meta-learning for knowledge enhancement and task mapping, to capture the relationship between tasks and parameter distributions. As a result, our method achieves more accurate LoRA parameter generation for diverse tasks using CVAE. ICM-LoRA enables more accurate LoRA parameter reconstruction than current parameter reconstruction methods and is useful for implementing task-specific enhancements of LoRA parameters. At the same time, our method occupies 283MB, only 1% storage compared with the original LoRA. The code is available at https://github.com/YihuaJerry/ICM-LoRA.
4897: Diffusion Guided Propagation Augmentation for Popularity Prediction
Authors: Chaozhuo Li, Tianqi Yang, Litian Zhang, Xi Zhang
Location: Guangzhou | Day: TBD
Show Abstract
The prediction of information popularity propagation is critical for applications such as recommendation systems, targeted advertising, and social media trend analysis. Traditional approaches primarily rely on historical cascade data, often sacrificing timeliness for prediction accuracy. These methods capture aggregate diffusion patterns but fail to account for the complex temporal dynamics of early-stage propagation. In this paper, we introduce Diffusion Guided Propagation Augmentation(DGPA), a novel framework designed to improve early-stage popularity prediction. DGPA models cascade dynamics by leveraging a generative approach, where a temporal conditional interpolator serves as a noising process and forecasting as a denoising process. By iteratively generating cascade representations through a sampling procedure, DGPA effectively incorporates the evolving time steps of diffusion, significantly enhancing prediction timeliness and accuracy. Extensive experiments on benchmark datasets from Twitter, Weibo, and APS demonstrate that DGPA outperforms state-of-the-art methods in early-stage popularity prediction.
4900: Universal Backdoor Defense via Label Consistency in Vertical Federated Learning
Authors: Peng Chen, Haolong Xiang, Xin Du, Xiaolong Xu, Xuhao Jiang, Zhihui Lu, Jirui Yang, Qiang Duan, Wanchun Dou
Location: Guangzhou | Day: TBD
Show Abstract
Backdoor attacks in vertical federated learning (VFL) are particularly concerning as they can covertly compromise VFL decision-making, posing a severe threat to critical applications of VFL. Existing defense mechanisms typically involve either label obfuscation during training or model pruning during inference. However, the inherent limitations on the defender’s access to the global model and complete training data in VFL environments fundamentally constrain the effectiveness of these conventional methods. To address these limitations, we propose the Universal Backdoor Defense (UBD) framework. UBD leverages Label Consistent Clustering (LCC) to synthesize plausible latent triggers associated with the backdoor class. This synthesized information is then utilized for mitigating backdoor threats through Linear Probing (LP), guided by a constraint on Batch Normalization (BN) statistics. Positioned within a unified VFL backdoor defense paradigm, UBD offers a generalized framework for both detection and mitigation that critically does not necessitate access to the entire model or dataset. Extensive experiments across multiple datasets rigorously demonstrate the efficacy of the UBD framework, achieving state-of-the-art performance against diverse backdoor attack types in VFL, including both dirty-label and clean-label variants.
4910: Uncertainty-guided Graph Contrastive Learning from a Unified Perspective
Authors: Zhiqiang Li, Jie Wang, Jianqing Liang, Junbiao Cui, Xingwang Zhao, Jiye Liang
Location: Guangzhou | Day: TBD
Show Abstract
The success of current graph contrastive learning methods largely relies on the choice of data augmentation and contrastive objectives. However, most existing methods tend to optimize these two components independently, neglecting their potential interplay, which leads to suboptimal quality of the learned embeddings. To address this issue, we propose Uncertainty-guided Graph Contrastive Learning (UGCL) from a unified perspective. The core of our method is the introduction of sample uncertainty, a critical metric that quantifies the degree of class ambiguity within individual samples. On this basis, we design a novel multi-scale data augmentation strategy and a weighted graph contrastive loss function, both of which significantly enhance the quality of embeddings. Theoretically, we demonstrate that UGCL can coordinate overall optimization objectives through uncertainty, and through experiments, we show that it improves the performance of tasks such as node classification, node clustering, and link prediction, thereby verifying the effectiveness of our method.
4915: Interaction-Data-guided Conditional Instrumental Variables for Debiasing Recommender Systems
Authors: Zhirong Huang, Debo Cheng, Lin Liu, Jiuyong Li, Guangquan Lu, Shichao Zhang
Location: Guangzhou | Day: TBD
Show Abstract
It is often challenging to identify a valid instrumental variable (IV), although the IV methods have been regarded as effective tools of addressing the confounding bias introduced by latent variables. To deal with this issue, an Interaction-Data-guided Conditional IV (IDCIV) debiasing method is proposed for Recommender Systems, called IDCIV-RS. The IDCIV-RS automatically generates the representations of valid CIVs and their corresponding conditioning sets directly from interaction data, significantly reducing the complexity of IV selection while effectively mitigating the confounding bias caused by latent variables in recommender systems. Specifically, the IDCIV-RS leverages a variational autoencoder (VAE) to learn both the CIV representations and their conditioning sets from interaction data, followed by the application of least squares to derive causal representations for click prediction. Extensive experiments on two real-world datasets, Movielens-10M and Douban-Movie, demonstrate that IDCIV-RS successfully learns the representations of valid CIVs, effectively reduces bias, and consequently improves recommendation accuracy.
4940: QiMeng-TensorOp: One-Line Prompt is Enough for High-Performance Tensor Operator Generation with Hardware Primitives
Authors: Xuzhi Zhang, Shaohui Peng, Qirui Zhou, Yuanbo Wen, Qi Guo, Ruizhi Chen, Xinguo Zhu, Weiqiang Xiong, Haixin Chen, Congying Ma, Ke Gao, Chen Zhao, Yanjun Wu, Yunji Chen, Ling Li
Location: Guangzhou | Day: TBD
Show Abstract
Computation-intensive tensor operators constitute over 90% of the computations in Large Language Models (LLMs) and Deep Neural Networks. Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability. LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators.

We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to 1291× performance improvement. Even compared with human experts, QiMeng-TensorOp could reach 251% of OpenBLAS on RISC-V CPUs, and 124% of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by 200× compared with human experts.
4942: Spatial-Spectral Similarity-Guided Fusion Network for Pansharpening
Authors: Jiazhuang Xiong, Yongshan Zhang, Xinxin Wang, Lefei Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Pansharpening fuses lower-resolution multispectral (LRMS) images with high-resolution panchromatic (PAN) images to generate high-resolution multispectral (HRMS) images that preserves both spatial and spectral information. Most deep pansharpening methods face challenges in cross-modal feature extraction and fusion, as well as in exploring the similarities between the fused image and both PAN and LRMS images. In this paper, we propose a spatial-spectral similarity-guided fusion network (S3FNet) for pansharpening. This architecture is composed of three parts. Specifically, a shallow feature extraction layer learns initial spatial, spectral and fused features from PAN and LRMS images. Then, a multi-branch asymmetric encoder, consisting of spatial, spectral and fusion branches, generates corresponding high-level features at different scales. A multi-scale reconstruction decoder, equipped with a well-designed cross-feature multi-head attention fusion block, processes the intermediate feature maps to generate HRMS images. To ensure HRMS images retain maximum spatial and spectral information, a similarity-constrained loss is defined for network training. Extensive experiments demonstrate the effectiveness of our S3FNet over state-of-the-art methods. The code is released at https://github.com/ZhangYongshan/S3FNet.
4944: Improving Prediction Certainty Estimation for Reliable Early Exiting via Null Space Projection
Authors: Jianing He, Qi Zhang, Duoqian Miao, Yi Kun, Shufeng Hao, Hongyun Zhang, Zhihua Wei
Location: Guangzhou | Day: TBD
Show Abstract
Early exiting has demonstrated great potential in accelerating the inference of pre-trained language models (PLMs) by enabling easy samples to exit at shallow layers, eliminating the need for executing deeper layers. However, existing early exiting methods primarily rely on class-relevant logits to formulate their exiting signals for estimating prediction certainty, neglecting the detrimental influence of class-irrelevant information in the features on prediction certainty. This leads to an overestimation of prediction certainty, causing premature exiting of samples with incorrect early predictions. To remedy this, we define an NSP score to estimate prediction certainty by considering the proportion of class-irrelevant information in the features. On this basis, we propose a novel early exiting method based on the Certainty-Aware Probability (CAP) score, which integrates insights from both logits and the NSP score to enhance prediction certainty estimation, thus enabling more reliable exiting decisions. The experimental results on the GLUE benchmark show that our method can achieve an average speed-up ratio of 2.19× across all tasks with negligible performance degradation, surpassing the state-of-the-art (SOTA) ConsistentEE by 28%, yielding a better trade-off between task performance and inference efficiency. The code is available at https://github.com/He-Jianing/NSP.git.
4969: Exploiting Position Information in Convolutional Kernels for Structural Re-parameterization
Authors: Tianxiang Hao, Hui Chen, Guiguang Ding
Location: Guangzhou | Day: TBD
Show Abstract
In order to boost the performance of a convolutional neural network (CNN), several approaches have shown the benefit of enhancing the spatial encoding of feature maps. However, few works paid attention to the positional properties of convolutional kernels. In this paper, we demonstrate that different kernel positions are of different importance, which depends on the task, dataset and architecture, and adaptively emphasizing the informative parts in convolutional kernels can lead to considerable improvement. Therefore, we propose a novel structural re-parameterization Position Boosting Convolution (PBConv) to exploit and enhance the position information in the convolutional kernel. PBConv consists of several concurrent small convolutional kernels, which can be equivalently converted to the original kernel and bring no extra inference cost. Different from existing structural re-parameterization methods, PBconv searches for the optimal re-parameterized structure by a fast heuristic algorithm based on the dispersion of kernel weights. Such heuristic search is efficient yet effective, well adapting the varying kernel weight distribution. As a result, PBConv can significantly improve the representational power of a model, especially its ability to extract fine-grained low-level features. Importantly, PBConv is orthogonal to procedural re-parameterization methods and can further boost performance based on them. Code is available at github.
4971: Dynamic and Adaptive Feature Generation with LLM
Authors: Xinhao Zhang, Jinghan Zhang, Banafsheh Rekabdar, Yuanchun Zhou, Pengfei Wang, Kunpeng Liu
Location: Guangzhou | Day: TBD
Show Abstract
The representation of feature space is a crucial environment where data points get vectorized and embedded for subsequent modeling. Thus, the efficacy of machine learning (ML) algorithms is closely related to the quality of feature engineering. As one of the most important techniques, feature generation transforms raw data into an optimized feature space conducive to model training and further refines the space. Despite the advancements in automated feature engineering and feature generation, current methodologies often suffer from three fundamental issues: lack of explainability, limited applicability, and inflexible strategy. These shortcomings frequently hinder and limit the deployment of ML models across varied scenarios. Our research introduces a novel approach adopting large language models (LLMs) and feature-generating prompts to address these challenges. We propose a dynamic and adaptive feature generation method that enhances the interpretability of the feature generation process. Our approach broadens the applicability across various data types and tasks and offers advantages in terms of strategic flexibility. A broad range of experiments showcases that our approach is significantly superior to existing methods.
4984: Base-Detail Feature Learning Framework for Visible-Infrared Person Re-Identification
Authors: Zhihao Gong, Lian Wu, Yong Xu
Location: Guangzhou | Day: TBD
Show Abstract
Visible-infrared person re-identification (VIReID) provides a solution for ReID tasks in 24-hour scenarios; however, significant challenges persist in achieving satisfactory performance due to the substantial discrepancies between visible (VIS) and infrared (IR) modalities. Existing methods inadequately leverage information from different modalities, primarily focusing on digging distinguishing features from modality-shared information while neglecting modality-specific details. To fully utilize differentiated minutiae, we propose a Base-Detail Feature Learning Framework (BDLF) that enhances the learning of both base and detail knowledge, thereby capitalizing on both modality-shared and modality-specific information. Specifically, the proposed BDLF mines detail and base features through a lossless detail feature extraction module and a complementary base embedding generation mechanism, respectively, supported by a novel correlation restriction method that ensures the features gained by BDLF enrich both detail and base knowledge across VIS and IR features. Comprehensive experiments conducted on the SYSU-MM01, RegDB, and LLCM datasets validate the effectiveness of BDLF.
5007: Asset Pricing with Contrastive Adversarial Variational Bayes
Authors: Ruirui Liu, Huichou Huang, Johannes Ruf
Location: Guangzhou | Day: TBD
Show Abstract
Machine learning techniques have gained considerable attention in the field of empirical asset pricing. Conditioning on a broad set of firm characteristics, one of the most popular no-arbitrage workhorses is a nonlinear conditional asset pricing model that consists of two modules within a neural network structure, i.e., factor and beta estimates, for which we propose a novel contrastive adversarial variational Bayes (CAVB) framework. To exploit the factor structure, we employ adversarial variational Bayes that transforms the maximum-likelihood problem into a zero-sum game between a variational autoencoder (VAE) and a generative adversarial network (GAN), where an auxiliary discriminative network brings in arbitrary expressiveness to the inference model. To tackle the problem of learning indistinguishable feature representations in the beta network, we introduce a contrastive loss to learn distinctive hidden features of the factor loadings in correspondence to conditional quantiles of return distributions. CAVB establishes a robust relation between the cross-section of asset returns and the common latent factors with nonlinear factor loadings. Extensive experiments show that CAVB not only significantly outperforms prominent models in the existing literature in terms of total and predictive R-squares, but also delivers superior Sharpe ratios after transaction costs for both long-only and long-short portfolios.
5011: Fast Explanations via Policy Gradient-Optimized Explainer
Authors: Deng Pan, Nuno Moniz, Nitesh V. Chawla
Location: Guangzhou | Day: TBD
Show Abstract
The challenge of delivering efficient explanations is a critical barrier that prevents the adoption of model explanations in real-world applications. Existing approaches often depend on extensive model queries for sample-level explanations or rely on expert’s knowledge of specific model structures that trade general applicability for efficiency. To address these limitations, this paper introduces a novel framework Fast EXplanation (FEX) that represents attribution-based explanations via probability distributions, which are optimized by leveraging the policy gradient method. The proposed framework offers a robust, scalable solution for real-time, large-scale model explanations, bridging the gap between efficiency and applicability.
We validate our framework on image and text classification tasks and the experiments demonstrate that our method reduces inference time by over 97 percent and memory usage by 70 percent compared to traditional model-agnostic approaches while maintaining high-quality explanations and broad applicability.
5018: Randomised Optimism via Competitive Co-Evolution for Matrix Games with Bandit Feedback
Authors: Shishen Lin
Location: Guangzhou | Day: TBD
Show Abstract
Learning in games is a fundamental problem in machine learning and artificial intelligence, with numerous applications. This work investigates two-player zero-sum matrix games with an unknown payoff matrix and bandit feedback, where each player observes their actions and the corresponding noisy payoff. Prior studies have proposed algorithms for this setting, demonstrating the effectiveness of deterministic optimism (e.g., UCB for matrix games) in achieving sublinear regret. However, the potential of randomised optimism in matrix games remains theoretically unexplored.

We propose Competitive Co-evolutionary Bandit Learning (CoEBL), a novel algorithm that integrates evolutionary algorithms (EAs) into the bandit framework to implement randomised optimism through EA variation operators. We prove that CoEBL achieves sublinear regret, matching the performance of deterministic optimism-based methods. To the best of our knowledge, this is the first theoretical regret analysis of an evolutionary bandit learning algorithm in matrix games.

Empirical evaluations on diverse matrix game benchmarks demonstrate that CoEBL not only achieves sublinear regret but also consistently outperforms classical bandit algorithms, including EXP3, the variant EXP3-IX, and UCB. These results highlight the potential of evolutionary bandit learning, particularly the efficacy of randomised optimism via evolutionary algorithms in game-theoretic settings.
5084: Multimodal Inverse Attention Network with Intrinsic Discriminant Feature Exploitation for Fake News Detection
Authors: Tianlin Zhang, En Yu, Yi Shao, Jiande Sun
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal fake news detection has garnered significant attention due to its profound implications for social security. While existing approaches have contributed to understanding cross-modal consistency, they often fail to leverage modal-specific representations and explicit discrepant features. To address these limitations, we propose a Multimodal Inverse Attention Network (MIAN), a novel framework that explores intrinsic discriminative features based on news content to advance fake news detection. Specifically, MIAN introduces a hierarchical learning module that captures diverse intra-modal relationships through local-to-global and local-to-local interactions, thereby generating enhanced unimodal representations to improve the identification of fake news at the intra-modal level. Additionally, a cross-modal interaction module employs a co-attention mechanism to establish and model dependencies between the refined unimodal representations, facilitating seamless semantic integration across modalities. To explicitly extract inconsistency features, we propose an inverse attention mechanism that effectively highlights the conflicting patterns and semantic deviations introduced by fake news in both intra- and inter-modality. Extensive experiments on benchmark datasets demonstrate that MIAN significantly outperforms state-of-the-art methods, underscoring its pivotal contribution to advancing social security through enhanced multimodal fake news detection.
5090: LLM4VKG: Leveraging Large Language Models for Virtual Knowledge Graph Construction
Authors: Guohui Xiao, Lin Ren, Guilin Qi, Haohan Xue, Marco Di Panfilo, Davide Lanti
Location: Guangzhou | Day: TBD
Show Abstract
Virtual Knowledge Graphs (VKGs) provide an effective solution for data integration but typically require significant expertise for their construction. This process, involving ontology development, schema analysis, and mapping creation, is often hindered by naming ambiguities and matching issues, which traditional rule-based methods struggle to address. Large language models (LLMs), with their ability to process and generate contextually relevant text, offer a potential solution. In this work, we introduce LLM4VKG, a novel framework that leverages LLMs to automatize VKG construction. Experimental evaluation on the RODI benchmark demonstrates that LLM4VKG surpasses state-of-the-art methods, achieving an average F1-score improvement of +17% and a peak gain of +39%. Moreover, LLM4VKG proves robust against incomplete ontologies and can handle complex mappings where current methods fail.
5100: Fast Guaranteed Tensor Recovery with Adaptive Tensor Nuclear Norm
Authors: Jiangjun Peng, Hailin Wang, Xiangyong Cao, Shuang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Real-world datasets like multi-spectral images and videos are naturally represented as tensors. However, limitations in data acquisition often lead to corrupted or incomplete tensor data, making tensor recovery a critical challenge. Solving this problem requires exploiting inherent structural patterns, with the low-rank property being particularly vital. An important category of existing low-rank tensor recovery methods relies on the tensor nuclear norms. However, these methods struggle with either computational inefficiency or weak theoretical guarantees for large-scale data. To address these issues, we propose a fast guaranteed tensor recovery framework based on a new tensor nuclear norm. Our approach adaptively extracts a column-orthogonal matrix from the data, reducing a large-scale tensor into a smaller subspace for efficient processing. This dimensionality reduction enhances speed without compromising accuracy. The recovery theories of two typical models are established by introducing an adjusted incoherence condition. Extensive experiments demonstrate the effectiveness of the proposed method, showing improved accuracy and speed over existing approaches. Our code and supplementary material are available at https://github.com/andrew-pengjj/adaptive_tensor_nuclear_norm.
5103: Efficient Multi-view Clustering via Reinforcement Contrastive Learning
Authors: Qianqian Wang, Haiming Xu, Zihao Zhang, Zhiqiang Tao, Quanxue Gao
Location: Guangzhou | Day: TBD
Show Abstract
Contrastive multi-view clustering has demonstrated remarkable potential in complex data analysis, yet existing approaches face two critical challenges: difficulty in constructing high-quality positive and negative pairs and high computational overhead due to static optimization strategies. To address these challenges, we propose an innovative efficient Multi-View Clustering framework with Reinforcement Contrastive Learning (EMVCRCL). Our key innovation is developing a reinforcement contrastive learning paradigm for dynamic clustering optimization. First, we leverage multi-view contrastive learning to obtain latent features, which are then sent to the reinforcement learning module to refine low-quality features. Specifically, it selects high-confident features to guide the positive/negative pair construction of contrastive learning. For the low-confident features, it utilizes the prior balanced distribution to adjust their assignment. Extensive experimental results showcase the effectiveness and superiority of our proposed method on multiple benchmark datasets.
5109: Fair Incomplete Multi-View Clustering via Distribution Alignment
Authors: Qianqian Wang, Haiming Xu, Meiling Liu, Wei Feng, Xiangdong Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Incomplete multi-view clustering (IMVC) extracts consistent and complementary information from multi-source/modality data with missing views, aiming to partition the data into different clusters. It can effectively address the problem of unsupervised multi-source data analysis in complex environments and has gained considerable attention. However, the fairness of IMVC remains underexplored, particularly when data contains sensitive features ({e.g.}, gender, marital status, and age). To tackle the problem, this work presents a novel Fair Incomplete Multi-View Clustering (FIMVC) method. The proposed FIMVC introduces fairness constraints to ensure clustering results are independent of sensitive features. Additionally, it learns consensus representations to enhance clustering performance by maximizing mutual information and aligning the distributions of different views. Experimental results on three datasets containing sensitive features demonstrate that our method improves the fairness of clustering results while outperforming state-of-the-art IMVC methods in clustering performance.
5124: MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning
Authors: Zikang Guo, Benfeng Xu, Xiaorui Wang, Zhendong Mao
Location: Guangzhou | Day: TBD
Show Abstract
Complex tasks involving tool integration pose significant challenges for Large Language Models (LLMs), leading to the emergence of multi-agent workflows as a promising solution. Reflection has emerged as an effective strategy for correcting erroneous trajectories in agentic workflows. However, existing approaches only exploit such capability in the post-action stage, where the agent observes the execution outcomes. We argue that, like humans, LLMs can also engage in reflection before action execution: the agent can anticipate undesirable outcomes from its own decisions, which not only provides a necessarily complementary perspective to evaluate the decision but also prevents the propagation of errors throughout the trajectory. In this paper, we propose MIRROR, a framework that consists of both intra-reflection, which critically assesses intended actions before execution, and inter-reflection, which further adjusts the trajectory based on observations. This design systematically leverages LLM reflection capabilities to eliminate and rectify erroneous actions on a more comprehensive scope. Evaluations on both the StableToolBench and TravelPlanner benchmarks demonstrate MIRROR’s superior performance, achieving state-of-the-art results compared to existing approaches.
5129: Learning Neural Vocoder from Range-Null Space Decomposition
Authors: Andong Li, Tong Lei, Zhihang Sun, Rilin Chen, Erwei Yin, Xiaodong Li, Chengshi Zheng
Location: Guangzhou | Day: TBD
Show Abstract
Despite the rapid development of neural vocoders in recent years, they usually suffer from some intrinsic challenges like opaque modeling, and parameter-performance trade-off. In this study, we propose an innovative time-frequency (T-F) domain-based neural vocoder to resolve the above-mentioned challenges. To be specific, we bridge the connection between the classical signal range-null decomposition (RND) theory and vocoder task, and the reconstruction of target spectrogram can be decomposed into the superimposition between the range-space and null-space, where the former is enabled by a linear domain shift from the original mel-scale domain to the target linear-scale domain, and the latter is instantiated via a learnable network for further spectral detail generation. Accordingly, we propose a novel dual-path framework, where the spectrum is hierarchically encoded/decoded, and the cross- and narrow-band modules are elaborately devised for efficient sub-band and sequential modeling. Comprehensive experiments are conducted on the LJSpeech and LibriTTS benchmarks. Quantitative and qualitative results show that while enjoying lightweight network parameters, the proposed approach yields state-of-the-art performance among existing advanced methods. Our code and the pretrained model weights are available at https://github.com/Andong-Li-speech/RNDVoC.
5135: Boosting Few-Shot Open-Set Object Detection via Prompt Learning and Robust Decision Boundary
Authors: Zhaowei Wu, Binyi Su, Qichuan Geng, Hua Zhang, Zhong Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Few-shot Open-set Object Detection (FOOD) poses a challenge in many open-world scenarios. It aims to train an open-set detector to detect known objects while rejecting unknowns with scarce training samples. Existing FOOD methods are subject to limited visual information, and often exhibit an ambiguous decision boundary between known and unknown classes. To address these limitations, we propose the first prompt-based few-shot open-set object detection framework, which exploits additional textual information and delves into constructing a robust decision boundary for unknown rejection. Specifically, as no available training data for unknown classes, we select pseudo-unknown samples with Attribution-Gradient based Pseudo-unknown Mining (AGPM), which leverages the discrepancy in attribution gradients to quantify uncertainty. Subsequently, we propose Conditional Evidence Decoupling (CED) to decouple and extract distinct knowledge from selected pseudo-unknown samples by eliminating opposing evidence. This optimization process can enhance the discrimination between known and unknown classes. To further regularize the model and form a robust decision boundary for unknown rejection, we introduce Abnormal Distribution Calibration (ADC) to calibrate the output probability distribution of local abnormal features in pseudo-unknown samples. Our method achieves superior performance over previous state-of-the-art approaches, improving the average recall of unknown class by 7.24% across all shots in VOC10-5-5 dataset settings and 1.38% in VOC-COCO dataset settings. Our source code is available at https://gitee.com/VR_NAVE/ced-food.
5141: Causal-aware Large Language Models: Enhancing Decision-Making Through Learning, Adapting and Acting
Authors: Wei Chen, Jiahao Zhang, Haipeng Zhu, Boyan Xu, Zhifeng Hao, Keli Zhang, Junjian Ye, Ruichu Cai
Location: Guangzhou | Day: TBD
Show Abstract
Large language models (LLMs) have shown great potential in decision-making due to the vast amount of knowledge stored within the models.However, these pre-trained models are prone to lack reasoning abilities and are difficult to adapt to new environments, further hindering their application to complex real-world tasks. To address these challenges, inspired by the human cognitive process, we propose Causal-Aware LLMs, which integrate the structural causal model (SCM) into the decision-making process to model, update, and utilize structured knowledge of the environment in a "learning-adapting-acting" paradigm.Specifically, in the learning stage, we first utilize an LLM to extract the environment-specific causal entities and their causal relations to initialize a structured causal model of the environment. Subsequently, in the adapting stage, we update the structured causal model through external feedback about the environment, via an idea of causal intervention. Finally, in the acting stage, Causal-Aware LLMs exploit structured causal knowledge for more efficient policy-making through the reinforcement learning agent. The above processes are performed iteratively to learn causal knowledge, ultimately enabling the causal-aware LLM to achieve a more accurate understanding of the environment and make more efficient decisions. Experimental results across 22 diverse tasks within the open-world game "Crafter" validate the effectiveness of our proposed method.
5148: STAMImputer: Spatio-Temporal Attention MoE for Traffic Data Imputation
Authors: Yiming Wang, Hao Peng, Senzhang Wang, Haohua Du, Chunyang Liu, Jia Wu, Guanlin Wu
Location: Guangzhou | Day: TBD
Show Abstract
Traffic data imputation is fundamentally important to support various applications in intelligent transportation systems such as traffic flow prediction. However, existing time-to-space sequential methods often fail to effectively extract features in block-wise missing data scenarios. Meanwhile, the static graph structure for spatial feature propagation significantly constrains the model’s flexibility in handling the distribution shift issue for the nonstationary traffic data. To address these issues, this paper proposes a Spatio-Temporal Attention Mixture of experts network named STAMImputer for traffic data imputation. Specifically, we introduce a Mixture of Experts (MoE) framework to capture latent spatio-temporal features and their influence weights, effectively imputing block missing. A novel Low-rank guided Sampling Graph ATtention (LrSGAT) mechanism is designed to dynamically balance the local and global correlations across road networks. The sampled attention vectors are utilized to generate dynamic graphs that capture real-time spatial correlations. Extensive experiments are conducted on four traffic datasets for evaluation. The result shows STAMImputer achieves significantly performance improvement compared with existing SOTA approaches. Our codes are available at https://github.com/RingBDStack/STAMImupter.
5154: A Hybrid Multi-Factor Network with Dynamic Sequence Modeling for Early Warning of Intraoperative Hypotension
Authors: Mingyue Cheng, Jintao Zhang, Zhiding Liu, Chunli Liu
Location: Guangzhou | Day: TBD
Show Abstract
Intraoperative hypotension (IOH) prediction using past physiological signals is crucial, as IOH may lead to inadequate organ perfusion and significantly elevate the risk of severe complications and mortality. However, current methods often rely on static modeling, overlooking the complex temporal dependencies and the inherently non-stationary nature of physiological signals. We propose a Hybrid Multi-Factor (HMF) network that formulates IOH prediction as a dynamic sequence forecasting task, explicitly capturing both temporal dependencies and physiological non-stationarity. We represent signal dynamics as multivariate time series and decompose them into trend and seasonal components, enabling separate modeling of long-term and periodic variations. Each component is encoded with a patch-based Transformer to balance computational efficiency and feature representation. To address distributional drift from evolving signals, we introduce a symmetric normalization mechanism. Experiments on both public and real-world clinical datasets show that HMF significantly outperforms competitive baselines. We hope HMF offers new insights into IOH prediction and ultimately promotes safer surgical care. Our code is available at https://github.com/Mingyue-Cheng/HMF.
5170: ABNet: Mitigating Sample Imbalance in Anomaly Detection Within Dynamic Graphs
Authors: Yifan Hong, Muhammad Asif Ali, Huan Wang, Junyang Chen, Di Wang
Location: Guangzhou | Day: TBD
Show Abstract
In dynamic graphs, detecting anomalous nodes faces challenges due to sample imbalance, stemming from the scarcity of anomalous samples and feature representation bias. Existing methods often use unsupervised or semi-supervised learning to extract anomalous samples from unlabeled data, but struggle to obtain enough anomalous instances due to their low occurrence. Moreover, GNN-based approaches often prioritize normal samples, neglecting rare anomalies. To address these issues, we propose the Anomaly Balance Network (ABNet), designed to alleviate sample imbalance and enhance anomaly detection. ABNet includes three key components: a feature extractor that compares node features across time points to avoid bias, an anomaly augmenter that amplifies anomaly details and generates diverse anomalous samples, and an anomaly detector using meta-learning to adapt to graph evolution. Experimental results show that ABNet outperforms existing methods on three real-world datasets, effectively addressing sample imbalance.
5172: Understanding Visual Detail Hallucinations of Large Vision-Language Models
Authors: Xiaoxi Sun, Jianxin Liang, Yueqian Wang, Huishuai Zhang, Dongyan Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Understanding small visual objects is crucial in fields such as video surveillance, remote sensing, and autonomous driving. In this paper, we investigate the capability of advanced large vision-language models (LVLMs) to recognize and interpret small objects in visual data. To this end, we curate a specialized dataset for evaluating fine-grained visual hallucinations, incorporating two object categories and three types of hallucinations.
First, we assess 11 state-of-the-art LVLMs, yielding several key insights, as anticipated, LVLMs perform significantly worse on queries related to small objects compared to regular-sized ones, with performance on regular objects proving to be an unreliable predictor of that on small objects. This finding underscores the need for dedicated research on fine-grained visual hallucinations. Second, we evaluate three training-free methods: Scaffold, Chain of Thought (CoT), and Image Resizing, all of which result in varying degrees of improvement. Furthermore, we conduct a series of detailed ablation studies on the visual encoders of Eagle-X5, examining their performance across fine-grained visual hallucination tasks. Our findings reveal that ConvNeXt architecture is critical for object existence recognition tasks. In contrast, for mitigating other types of hallucinations, integrating information from multiple visual encoders is significantly more effective than relying on a single encoder.
These results highlight several promising directions for advancing small object recognition with LVLMs.
5186: Learning Causally Disentangled Representations for Fair Personality Detection
Authors: Yangfu Zhu, Meiling Li, Yuting Wei, Di Liu, Yuqing Li, Bin Wu
Location: Guangzhou | Day: TBD
Show Abstract
Personality detection aims to identify the personality traits implied in social posts. Existing methods mainly focus on learning the mapping between user-generated posts and personality trait labels but inevitably suffer from potential harm caused by individual bias, as these posts are written by authors from different backgrounds. Learning such spurious associations between posts and traits may lead to the formation of stereotypes, ultimately restricting the detection of personality in different kind of individual. To tackle the issue, we first investigate individual bias in personality detection from the causality perspective. We propose an Interventional Personality Detection Network (IPDN) to learn implicit confounders in user-generated posts and exploit the true causal effect to train the detection model. Specifically, our IPDN disentangled the causal and biased features behind user-generated posts, and then the biased features are accumulatively clustered as confounder prototypes as the training iterations increase. In parallel, the reconstruction network is reused to approximate backdoor adjustment on raw posts, ensuring that traits see each confounder equally before detection. Extensive experiments conducted on three real-world datasets demonstrate that our IPDN outperforms state-of-the-art methods in personality detection.
5197: Conditional Causal Representation Learning for Heterogeneous Single-cell RNA Data Integration and Prediction
Authors: Jiayi Dong, Jiahao Li, Fei Wang
Location: Guangzhou | Day: TBD
Show Abstract
Single-cell sequencing technology provides deep insights into gene activity at the individual cell level, facilitating the study of gene regulatory mechanisms. However, observed gene expression are often influenced by confounding factors such as batch effects, perturbations, and spatial position, which obscure the true gene regulatory network that governs the cell’s intrinsic state. To address these challenges, we propose scConCRL, a novel conditionally causal representation learning framework designed to extract the true gene regulatory relationships independent of confounding information. By considering both fine-grained molecular gene variables and coarse-grained latent domain variables, scConCRL not only uncovers the intrinsic biological signals but also models the complex relationships between these variables. This dual function enables the separation of genuine cellular states from domain information, providing valuable insights for downstream analyses and biological discovery. We demonstrate the effectiveness of our model on multi-domain datasets from different platforms and perturbation conditions, showing its ability to accurately disentangle confounding influences and discover novel gene relationships. Extensive comparisons across various scenarios illustrate the superior performance of scConCRL in several tasks compared to existing methods.
5201: Two-stage Risk Control with Application to Ranked Retrieval
Authors: Yunpeng Xu, Mufang Ying, Wenge Guo, Zhi Wei
Location: Guangzhou | Day: TBD
Show Abstract
Practical machine learning systems often operate in multiple sequential stages, as seen in ranking and recommendation systems, which typically include a retrieval phase followed by a ranking phase. Effectively assessing prediction uncertainty and ensuring effective risk control in such systems pose significant challenges due to their inherent complexity. To address these challenges, we developed two-stage risk control methods based on the recently proposed learn-then-test (LTT) and conformal risk control (CRC) frameworks. Unlike the methods in prior work that address multiple risks, our approach leverages the sequential nature of the problem, resulting in reduced computational burden. We provide theoretical guarantees for our proposed methods and design novel loss functions tailored for ranked retrieval tasks. The effectiveness of our approach is validated through experiments on two large-scale, widely-used datasets: MSLR- Web and Yahoo LTRC.
5217: A Cross-Modal Densely Guided Knowledge Distillation Based on Modality Rebalancing Strategy for Enhanced Unimodal Emotion Recognition
Authors: Shuang Wu, Heng Liang, Yong Zhang, Yanlin Chen, Ziyu Jia
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal emotion recognition has garnered significant attention for its ability to integrate data from multiple modalities to enhance performance. However, physiological signals like electroencephalogram are more challenging to acquire than visual data due to higher collection costs and complexity. This limits the practical application of multimodal networks. To address this issue, this paper proposes a cross-modal knowledge distillation framework for emotion recognition. The framework aims to leverage the strengths of a multimodal teacher network to enhance the performance of a unimodal student network using only the visual modality as input. Specifically, we design a prototype-based modality rebalancing strategy, which dynamically adjusts the convergence rates of different modalities to mitigate modality imbalance issue. It enables the teacher network to better integrate multimodal information. Building upon this, we develop a Cross-Modal Densely Guided Knowledge Distillation (CDGKD) method, which effectively transfers knowledge extracted by the multimodal teacher network to the unimodal student network. Our CDGKD uses multi-level teacher assistant networks to bridge the teacher-student gap and employs dense guidance to reduce error accumulation during knowledge transfer. Experimental results demonstrate that the proposed framework outperforms existing methods on two public emotion datasets, providing an effective solution for emotion recognition in modality-constrained scenarios.
5223: Divide and Conquer: Coordinating Multiplex Mixture of Graph Learners to Handle Multi-Omics Analysis
Authors: Zhihao Wu, Jielong Lu, Jiajun Yu, Sheng Zhou, Yueyang Pi, Haishuai Wang
Location: Guangzhou | Day: TBD
Show Abstract
Graph learning has shown significant advantages in organizing and leveraging complex data, making it promising for numerous real-world applications with heterogeneous information, particularly multi-omics data analysis. Despite its potential in such scenarios, existing methods are still in their infancy, lacking architectural potential and struggling to handle such complex data. In this paper, we propose the Multiplex Mixture of Graph Learners (MMoG) framework. MMoG first conducts fine-grained processing of consensus and unique information, constructing consistent features and multiplex graph structures. Then, a macroscopically shared group of sub-GNNs with diverse orders and architectures synergistically learn representations, providing a foundation for strong interaction between different views. Inspired by the mixture of experts (MoE), each sample in different omics adaptively determines the neighborhood ranges and architectures for information aggregation, while blocking unsuitable sub-GNNs. MMoG treats the complex multi-omics analysis as a multi-view learning problem, and essentially decomposes it into multiple sub-problems, allowing each omics/view to solve intersecting yet unique sub-problem groups. Additionally, we introduce mutual information-driven orthogonal loss and balancing loss to avoid view collapse. Extensive experiments on multi-omics data across multiple cancer types highlight MMoG’s superiority.
5234: SCVBench: A Benchmark with Multi-turn Dialogues for Story-Centric Video Understanding
Authors: Sisi You, Bowen Yuan, Bing-Kun Bao
Location: Guangzhou | Day: TBD
Show Abstract
Video understanding seeks to enable machines to interpret visual content across three levels: action, event, and story. Existing models are limited in their ability to perform high-level long-term story understanding, due to (1) the oversimplified treatment of temporal information and (2) the training bias introduced by action/event-centric datasets. To address this, we introduce SCVBench, a novel benchmark for story-centric video understanding. SCVBench evaluates LVLMs through an event ordering task decomposed into sub-questions leading to a final question, quantitatively measuring historical dialogue exploration. We collected 1,253 final questions and 6,027 sub-question pairs from 925 videos, constructing continuous multi-turn dialogues. Experimental results show that while closed-source GPT-4o outperforms other models, most open-source LVLMs struggle with story-centric video understanding. Additionally, our StoryCoT model significantly surpasses open-source LVLMs on SCVBench. SCVBench aims to advance research by comprehensively analyzing LVLMs’ temporal reasoning and comprehension capabilities. Code can be accessed at https://github.com/yuanrr/SCVBench.
5240: Query-Based and Unnoticeable Graph Injection Attack from Neighborhood Perspective
Authors: Chang Liu, Hai Huang, Xingquan Zuo
Location: Guangzhou | Day: TBD
Show Abstract
The robustness of Graph Neural Networks (GNNs) has become an increasingly important topic due to their expanding range of applications. Various attack methods have been proposed to explore the vulnerabilities of GNNs, ranging from Graph Modification Attacks (GMA) to the more practical and flexible Graph Injection Attacks (GIA). However, existing methods face two key challenges: (i) their reliance on surrogate models, which often leads to reduced attack effectiveness due to structural differences and prior biases, and (ii) existing GIA methods often sacrifice attack success rates in undefended settings to bypass certain defense models, thereby limiting their overall effectiveness. To overcome these limitations, we propose QUGIA, a Query-based and Unnoticeable Graph Injection Attack. QUGIA injects nodes by first selecting edges based on victim node connections and then generating node features using a Bayesian framework. This ensures that the injected nodes are similar to the original graph nodes, implicitly preserving homophily and making the attack more unnoticeable. Unlike previous methods, QUGIA does not rely on surrogate models, thereby avoiding performance degradation and achieving better generalization. Extensive experiments on six real-world datasets with diverse characteristics demonstrate that QUGIA achieves unnoticeable attacks and outperforms state-of-the-art attackers. Our code is available at: https://anonymous.4open.science/r/QUGIA-588E/.
5246: Federated Multi-view Graph Clustering with Incomplete Attribute Imputation
Authors: Wei Feng, Zeyu Bi, Qianqian Wang, Bo Dong
Location: Guangzhou | Day: TBD
Show Abstract
Federated Multi-View Clustering (FedMVC) aims to uncover consistent clustering structures from distributed multi-view data for clustering while preserving data privacy. However, existing FedMVC methods under vertical settings either ignore the ubiquitous incomplete view issue or require uploading data features, which may lead to privacy leakage or induce high communication costs. To mitigate the view incompleteness issue and simultaneously maintain privacy and efffciency, we propose a novel Federated Multiview Graph Clustering with Incomplete Attribute Imputation (FMVC-IAI). This method constructs a consensus graph structure through complementary multi-view data and then utilizes a non-parametric graph neural network (GNN) to impute missing features. Additionally, it utilizes the adjacency graph as the knowledge carrier to share and fuse the multi-view information. To alleviate the high communication cost due to graph sharing, we proposed to share the anchor graph for global adjacency graph construction, which reduces communication cost and also helps to reduce privacy leakage risk. Extensive experiments demonstrate the superiority of our method in FedMVC tasks with incomplete views.
5250: Tensorial Multi-view Clustering with Deep Anchor Graph Projection
Authors: Wei Feng, Dongyuvan Wei, Qianqian Wang, Bo Dong
Location: Guangzhou | Day: TBD
Show Abstract
Multi-view clustering (MVC) has emerged as an important unsupervised multi-view learning method that leverages consistent and complementary information to enhance clustering performance. Recently, tensorized MVC, which processes multi-view data as a tensor to capture their cross-view information, has received considerable attention.
However, existing tensorized MVC methods generally overlook deep structures within each view and rely on post-processing to derive clustering results, leading to potential information loss and degraded performance. To address these issues, we develop Tensorial Multi-view Clustering with Deep Anchor Graph Projection (TMVC-DAGP), which performs deep projection on the anchor graph, thus improving model scalability. Besides, we utilize a sparsity regularization to eliminate the redundancy and enforce the projected anchor graph to retain a clear clustering structure. Furthermore, TMVC-DAGP leverages weighted Tensor Schatten $p$-norm to exploit the consistent and complementary information. Extensive experiments on multiple datasets demonstrate TMVC-DAGP’s effectiveness and superiority.
5268: CRAFT: Time Series Forecasting with Cross-Future Behavior Awareness
Authors: Yingwei Zhang, Ke Bu, Zhuoran Zhuang, Tao Xie, Yao Yu, Dong Li, Yang Guo, Detao Lv
Location: Guangzhou | Day: TBD
Show Abstract
The past decades witness the significant advancements in time series forecasting (TSF) across various real-world domains, including e-commerce and disease spread prediction. However, TSF is usually constrained by the uncertainty dilemma of predicting future data with limited past observations. To settle this question, we explore the use of Cross-Future Behavior (CFB) in TSF, which occurs before the current time but takes effect in the future. We leverage CFB features and propose the CRoss-Future Behavior Awareness based Time Series Forecasting method (CRAFT). The core idea of CRAFT is to utilize the trend of cross-future behavior to mine the trend of time series data to be predicted. Specifically, to settle the sparse and partial flaws of cross-future behavior, CRAFT employs the Koopman Predictor Module to extract the key trend and the Internal Trend Mining Module to supplement the unknown area of the cross-future behavior matrix. Then, we introduce the External Trend Guide Module with a hierarchical structure to acquire more representative trends from higher levels. Finally, we apply the demand-constrained loss to calibrate the distribution deviation of prediction results. We conduct experiments on real-world dataset. Experiments on both offline large-scale dataset and online A/B test demonstrate the effectiveness of CRAFT. Our dataset and code are available at https://github.com/CRAFTinTSF/CRAFT.
5275: Improving Consistency Identification in Task-oriented Dialogue Through Multi-Agent Collaboration
Authors: Peng Wang, Shuo Li, Ruoxi Zhou, Qiguang Chen, Xiao Xu, Hao Fei, Dagang Li, Wanxiang Che, Libo Qin
Location: Guangzhou | Day: TBD
Show Abstract
Consistency identification in task-oriented dialog (CI-ToD) typically consists of three sub-tasks: User Query Inconsistency (QI) identification, Dialogue History Inconsistency (HI) identification, and Knowledge Base Inconsistency (KBI) identification, which aim to determine inconsistent relationships between system response and user query, dialogue history, and knowledge base. Previous approaches focus on the exploration of deep learning models for CI-ToD. While these models achieve remarkable progress, they still rely on large amounts of labeled data, which is hard to achieve in real-world scenarios. Motivated by this, in the paper, we aim to explore large language models for CI-ToD, which do not require any training data. In addition, we further introduce a multi-agent collaboration framework (MAC-CIToD) to model the interaction across three sub-tasks in CI-ToD, including (1) Full Connection paradigm, (2) Cycle Connection paradigm, and (3) Central Connection paradigm, which effectively builds interaction across QI, HI, and KBI. Experiments on the standard benchmark reveal that our framework achieves superior performance. Additionally, we compare MAC-CIToD with the most advanced trained approaches and find that its zero-shot performance on most metrics even surpasses that of models after training on the CI-ToD dataset.
5295: MMNet: Missing-Aware and Memory-Enhanced Network for Multivariate Time Series Imputation
Authors: Xiaoye Miao, Han Shi, Yi Yuan, Daozhan Pan, Yangyang Wu, Xiaohua Pan
Location: Guangzhou | Day: TBD
Show Abstract
Multivariate time series (MTS) data in real-world scenarios are often incomplete, which hinders effective data analysis. Therefore, MTS imputation has been widely studied to facilitate various MTS tasks. Existing imputation methods primarily initialize missing values with zeros in order to perform effective incomplete MTS encoding, which impede the model’s capacity to precisely discern the missing distribution. Moreover, these methods often overlook the global similarity in time series but are limited in the use of local information within the sample. To this end, we propose a novel multivariate time series imputation network model, named MMNet. MMNet introduces a Missing-Aware Embedding (MAE) approach to adaptively represent incomplete MTS, allowing the model to better distinguish between missing and observed data. Furthermore, we design a Memory-Enhanced Encoder (MEE) aimed at modeling prior knowledge through memory mechanism, enabling better utilization of the global similarity within the time series. Building upon this, MMNet incorporates a Multi-scale Mixing architecture (MSM) that leverages information from multiple scales to enhance the final imputation. Extensive experiments on four public real-world datasets demonstrate that, MMNet yields a more than 25% gain in performance, compared with the state-of-the-art methods.
5299: WenyanGPT: A Large Language Model for Classical Chinese Tasks
Authors: Xinyu Yao, Mengdi Wang, Bo Chen, Xiaobing Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Classical Chinese, as the core carrier of Chinese culture, plays a crucial role in the inheritance and study of ancient literature. However, existing natural language processing models primarily optimize for Modern Chinese, resulting in inadequate performance on Classical Chinese. This paper presents a comprehensive solution for Classical Chinese language processing. By continuing pre-training and instruction fine-tuning on the LLaMA3-8B-Chinese model, we construct a large language model, WenyanGPT, which is specifically designed for Classical Chinese tasks. Additionally, we develop an evaluation benchmark dataset, WenyanBENCH. Experimental results on WenyanBENCH demonstrate that WenyanGPT significantly outperforms current advanced LLMs in various Classical Chinese tasks. We make the model’s training data, instruction fine-tuning data, and evaluation benchmark dataset publicly available to promote further research and development in the field of Classical Chinese processing.
5304: Generic Adversarial Attack Framework Against Vertical Federated Learning
Authors: Yimin Liu, Peng Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Vertical federated learning (VFL) enables feature-level collaboration by incorporating scattered attributes from aligned samples, and allows each party to contribute its personalized input to joint training and inference. The injection of adversarial inputs can mislead the joint inference towards the attacker’s will, forcing other benign parties to make negligible contributions and losing rewards regarding the importance of their contributions. However, most attacks require server model queries, subsets of complete test samples, or labeled auxiliary images from the training domain. These extra requirements are not practical for real-world VFL applications. In this paper, we propose PGAC, a novel and practical attack framework for crafting adversarial inputs to dominate joint inference, which does not rely on the above requirements. PGAC advances prior attacks by requiring only access to auxiliary images from non-training domains. PGAC learns generalized label-indicative embeddings and estimates class-transferable probabilities across domains to generate a proxy model that closely approximates the server model. PGAC then augments images by emphasizing salient regions with class activation maps, creating a diverse shadow input set that resembles influential test inputs. With proxy fidelity and input diversity, PGAC crafts transferable adversarial inputs. Evaluation on diverse model architectures confirms the effectiveness of PGAC.
5317: ReplayCAD: Generative Diffusion Replay for Continual Anomaly Detection
Authors: Lei Hu, Zhiyong Gan, Ling Deng, Jinglin Liang, Lingyu Liang, Shuangping Huang, Tianshui Chen
Location: Guangzhou | Day: TBD
Show Abstract
Continual Anomaly Detection (CAD) enables anomaly detection models in learning new classes while preserving knowledge of historical classes. CAD faces two key challenges: catastrophic forgetting and segmentation of small anomalous regions. Existing CAD methods store image distributions or patch features to mitigate catastrophic forgetting, but they fail to preserve pixel-level detailed features for accurate segmentation. To overcome this limitation, we propose ReplayCAD, a novel diffusion-driven generative replay framework that replay high-quality historical data, thus effectively preserving pixel-level detailed features. Specifically, we compress historical data by searching for a class semantic embedding in the conditional space of the pre-trained diffusion model, which can guide the model to replay data with fine-grained pixel details, thus improving the segmentation performance. However, relying solely on semantic features results in limited spatial diversity. Hence, we further use spatial features to guide data compression, achieving precise control of sample space, thereby generating more diverse data. Our method achieves state-of-the-art performance in both classification and segmentation, with notable improvements in segmentation: 11.5% on VisA and 8.1% on MVTec. Our source code is available at https://github.com/HULEI7/ReplayCAD.
5324: ListenNet: A Lightweight Spatio-Temporal Enhancement Nested Network for Auditory Attention Detection
Authors: Cunhang Fan, Xiaoke Yang, Hongyu Zhang, Ying Chen, Lu Li, Jian Zhou, Zhao Lv
Location: Guangzhou | Day: TBD
Show Abstract
Auditory attention detection (AAD) aims to identify the direction of the attended speaker in multi-speaker environments from brain signals, such as Electroencephalography (EEG) signals. However, existing EEG-based AAD methods overlook the spatio-temporal dependencies of EEG signals, limiting their decoding and generalization abilities. To address these issues, this paper proposes a Lightweight Spatio-Temporal Enhancement Nested Network (ListenNet) for AAD. The ListenNet has three key components: Spatio-temporal Dependency Encoder (STDE), Multi-scale Temporal Enhancement (MSTE), and Cross-Nested Attention (CNA). The STDE reconstructs dependencies between consecutive time windows across channels, improving the robustness of dynamic pattern extraction. The MSTE captures temporal features at multiple scales to represent both fine-grained and long-range temporal patterns. In addition, the CNA integrates hierarchical features more effectively through novel dynamic attention mechanisms to capture deep spatio-temporal correlations. Experimental results on three public datasets demonstrate the superiority of ListenNet over state-of-the-art methods in both subject-dependent and challenging subject-independent settings, while reducing the trainable parameter count by approximately 7 times. Code is available at:https://github.com/fchest/ListenNet.
5328: NS4S: Neighborhood Search for Scheduling Problems Via Large Language Models
Authors: Junjie Zhang, Canhui Luo, Zhouxing Su, Qingyun Zhang, Zhipeng Lü, Junwen Ding, Yan Jin
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) have emerged as a promising technology for solving combinatorial optimization problems. However, their direct application to scheduling problems remains limited due to the inherent complexity of these problems. This paper proposes an LLMs-based neighborhood search method that leverages LLMs to tackle the job shop scheduling problem (JSP) and its variants.
The main contributions of this work are threefold. First, we introduce a novel LLMs-guided neighborhood evaluation strategy that guides local search by dynamically adjusting operation weights. Second, we develop a verification evolution (VeEvo) framework to mitigate the hallucination effects of LLMs, enabling the generation of high-quality heuristics for weight updates. Third, we integrate this framework with the weighted neighborhood evaluation strategy to effectively guide the search towards promising regions.
Extensive experiments are conducted on 349 benchmark instances across three classical scheduling problems. The results demonstrate that our algorithm significantly outperforms existing state-of-the-art methods. For JSP, our algorithm reduces the average optimality gap from 10.46% to 1.35% on Taillard’s instances compared to reinforced adaptive staircase curriculum learning. For flexible JSP (FJSP), it reduces the gap from 13.24% to 0.05% on Brandimarte’s instances compared to deep reinforcement learning methods. Furthermore, for FJSP with sequence dependent setup time, our algorithm updates 9 upper bounds for benchmark instances.
5355: Explainable Graph Neural Networks via Structural Externalities
Authors: Lijun Wu, Dong Hao, Zhiyi Fan
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) have achieved outstanding performance across a wide range of graph-related tasks. However, their "black-box" nature poses significant challenges to their explainability, and existing methods often fail to effectively capture the intricate interaction patterns among nodes within the network. In this work, we propose a novel explainability framework, GraphEXT, which leverages cooperative game theory and the concept of social externalities. GraphEXT partitions graph nodes into coalitions, decomposing the original graph into independent subgraphs. By integrating graph structure as an externality and incorporating the Shapley value under externalities, GraphEXT quantifies node importance through their marginal contributions to GNN predictions as the nodes transition between coalitions. Unlike traditional Shapley value-based methods that primarily focus on node attributes, our GraphEXT places greater emphasis on the interactions among nodes and the impact of structural changes on GNN predictions. Experimental studies on both synthetic and real-world datasets show that GraphEXT outperforms existing baseline methods in terms of fidelity across diverse GNN architectures , significantly enhancing the explainability of GNN models.
5383: App2Exa: Accelerating Exact kNN Search via Dynamic Cache-Guided Approximation
Authors: Ke Li, Leong Hou U, Shuo Shang
Location: Guangzhou | Day: TBD
Show Abstract
The k-nearest neighbor (kNN) query is a cornerstone of similarity-based applications across various domains. While prior work has enhanced kNN search efficiency, it typically focuses on approximate methods for high-dimensional data or exact methods for low-dimensional data, often assuming static query and data distributions. This creates a significant gap in accelerating exact kNN search for low-to-medium dimensional data with dynamic query distributions. To fill this gap, we propose App2Exa, a cache-guided framework that integrates approximate and exact kNN search. App2Exa utilizes a dynamically maintained cache graph index to retrieve approximate results, which subsequently guide exact search using a VP-Tree with a best-first strategy. A benefit-driven caching mechanism further optimizes performance by prioritizing vectors based on frequency, recency, and computational cost. Experimental results demonstrate that App2Exa significantly boosts efficiency, providing a robust and scalable solution for evolving query patterns and enabling exact kNN search to support higher dimensionality more effectively.
5401: HPDM: A Hierarchical Popularity-aware Debiased Modeling Approach for Personalized News Recommender
Authors: Xiangfu He, Qiyao Peng, Minglai Shao, Hongtao Liu
Location: Guangzhou | Day: TBD
Show Abstract
News recommender systems face inherent challenges from popularity bias, where user interactions concentrate heavily on a small subset of popular news. While existing debiasing methods have made progress in recommendation, they often overlook two critical aspects: the different granularity of news popularity (across titles, categories, etc.) and how hierarchical popularity levels distinctly influence user interest modeling. Hence, in this paper, we propose a hierarchical causal debiasing framework that effectively captures genuine user interests while mitigating popularity bias at different granularity levels. Our framework incorporates two key components during training: (1) a hierarchical popularity-aware user modeling module to capture user interests by distinguishing popular and unpopular interactions at different granularity news content; and (2) a dual-view structure combining counterfactual reasoning for popular-view news with inverse propensity weighting for unpopular-view news to model user genuine interests. During inference, our framework removes popularity-induced effects to predict relatedness between user and candidate news. Extensive experiments on two widely-used datasets, MIND and Adressa, demonstrate that our framework significantly outperforms existing baseline approaches in addressing both the long-tail distribution challenge. Our code is available at \url{https://github.com/hexiangfu123/HPDM}.
5402: Community-Aware Graph Transformer for Brain Disorder Identification
Authors: Shengbing Pei, Jiajun Ma, Zhao Lv, Chao Zhang, Jihong Guan
Location: Guangzhou | Day: TBD
Show Abstract
Abnormal brain functional network is an effective biomarker for brain disease diagnosis. Most existing methods focus on mining discriminative information from whole-brain connectivity patterns. However, multi-level collaboration is the foundation of efficient brain function, in addition to the whole-brain network, there are multiple sub-networks that can quickly integrate and process specific cognitive functions, forming the modular community structure of the brain. To address this gap, we propose a novel method, community-aware graph Transformer (CAGT), that integrates the community information of sub-networks and the topological information of brain graph into the Transformer architecture for better brain disorder identification. CAGT enhances information exchange within and between functional communities through dual-scale feature fusion, capturing interactive information across various scales. Additionally, it incorporates prior knowledge to design brain region position encoding and guide the self-attention, thereby enhancing the spatial awareness of the Transformer and aligning it with the brain’s natural information transfer process. Experimental results indicate that our proposed method significantly improves performance on both large and small datasets, and can reliably capture the interactions between sub-networks, demonstrating its generalization and interpretability.
5405: Feint and Attack: Jailbreaking and Protecting LLMs via Attention Distribution Modeling
Authors: Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, Zaisheng Ye
Location: Guangzhou | Day: TBD
Show Abstract
Most jailbreak methods for large language models (LLMs) focus on superficially improving attack success through manually defined rules. However, they fail to uncover the underlying mechanisms within target LLMs that explain why an attack succeeds or fails. In this paper, we propose investigating the phenomenon of jailbreaks and defenses for LLMs from the perspective of attention distributions within the models. A preliminary experiment reveals that the success of a jailbreak is closely linked to the LLM’s attention on sensitive words.Inspired by this interesting finding, we propose incorporating critical signals derived from internal attention distributions within LLMs, namely Attention Intensity on Sensitive Words and Attention Dispersion Entropy, to guide both attacks and defenses. Drawing inspiration from the concept of "Feint and Attack", we introduce an attention-guided jailbreak model, ABA, which redirects the model’s attention to benign contexts, and an attention-based defense model, ABD, designed to detect attacks by analyzing internal attention entropy. Experimental results demonstrate the superiority of our proposal when compared to SOTA baselines.
5413: MGCA-Net: Multi-Graph Contextual Attention Network for Two-View Correspondence Learning
Authors: Shuyuan Lin, Mengtin Lo, Haosheng Chen, Yanjie Liang, Qiangqiang Wu
Location: Guangzhou | Day: TBD
Show Abstract
Two-view correspondence learning is a key task in computer vision, which aims to establish reliable matching relationships for applications such as camera pose estimation and 3D reconstruction. However, existing methods have limitations in local geometric modeling and cross-stage information optimization, which make it difficult to accurately capture the geometric constraints of matched pairs and thus reduce the robustness of the model. To address these challenges, we propose a Multi-Graph Contextual Attention Network (MGCA-Net), which consists of a Contextual Geometric Attention (CGA) module and a Cross-Stage Multi-Graph Consensus (CSMGC) module. Specifically, CGA dynamically integrates spatial position and feature information via an adaptive attention mechanism and enhances the capability to capture both local and global geometric relationships. Meanwhile, CSMGC establishes geometric consensus via a cross-stage sparse graph network, ensuring the consistency of geometric information across different stages. Experimental results on two representative YFCC100M and SUN3D datasets show that MGCA-Net significantly outperforms existing SOTA methods in the outlier rejection and camera pose estimation tasks. Source code is available at http://www.linshuyuan.com.
5414: Optimal Planning to Coordinate Science Data Collection and Downlink for a Constellation of Agile Satellites with Limited Storage
Authors: Richard Levinson, Vinay Ravindra, Sreeja Roy-Singh
Location: Guangzhou | Day: TBD
Show Abstract
We present a novel Mixed Integer Linear Program formulation that produces optimal plans for a constellation of remote sensing satellites. The generalized formulation is applied to an operational NASA constellation to improve wildfire danger prediction. The planner generates integrated data collection and downlink plans for multiple agile satellites with limited storage capacity, minimum energy requirements, and temporal constraints. Observation targets and modes are associated with science rewards. The planner maximizes the aggregate rewards collected for all observations on all satellites.

Our generalized model for integrated data collection and downlink uses a novel interval-based abstraction called Data Cycles, without time-indexed variables. Data cycles organize the multitude of observation and downlink opportunities from 1 second granularity into sequences of data collection and downlink intervals. Experiments using large-scale real-world data yield optimal 24-hr plans for an eight satellite constellation, which capture 99% of the ~23,000 available targets and 99.9% of available science rewards.
5416: Consistency-Aware Padding for Incomplete Multi-Modal Alignment Clustering Based on Self-Repellent Greedy Anchor Search
Authors: Shubin Ma, Liang Zhao, Mingdong Lu, Yifan Guo, Bo Xu
Location: Guangzhou | Day: TBD
Show Abstract
Multi-modal representation is faithful and highly effective in describing real-world data samples’ characteristics by describing their complementary information. However, the collected data often exhibits incomplete and misaligned characteristics due to factors such as inconsistent sensor frequencies and device malfunctions. Existing research has not effectively addressed the issue of filling missing data in scenarios where multiview data are both imbalanced and misaligned. Instead, it relies on class-level alignment of the available data. Thus, it results in some data samples not being well-matched, thereby affecting the quality of data fusion. In this paper, we propose the Consistency-Aware Padding for Incomplete Multi-Modal Alignment Clustering Based on Self-Repellent Greedy Anchor Search(CAPIMAC) to tackle the problem of filling imbalanced and misaligned data in multi-modal datasets. Specifically, we propose a self-repellent greedy anchor search module(SRGASM), which employs a self-repellent random walk combined with a greedy algorithm to identify anchor points for re-representing incomplete and misaligned multi-modal data. Subsequently, based on noise-contrastive learning, we design a consistency-aware padding module (CAPM) to effectively interpolate and align imbalanced and misaligned data, thereby improving the quality of multi-modal data fusion. Experimental results demonstrate the superiority of our method over benchmark datasets. The code will be publicly released at https://github.com/bestow09090/-CAPIMAC.git.
5429: Electron Density-enhanced Molecular Geometry Learning
Authors: Hongxin Xiang, Jun Xia, Xin Jin, Wenjie Du, Li Zeng, Xiangxiang Zeng
Location: Guangzhou | Day: TBD
Show Abstract
Electron density (ED), which describes the probability distribution of electrons in space, is crucial for accurately understanding the energy and force distribution in molecular force fields (MFF).
Existing machine learning force fields (MLFF) focus on mining appropriate physical quantities from the atom-level conformation to enhance the molecular geometry representation while ignoring the unique information from microscopic electrons. In this work, we propose an efficient Electronic Density representation framework to enhance molecular Geometric learning (called EDG), which leverages images rendered from ED to boost molecular geometric representations in MLFF. Specifically, we construct a novel image-based ED representation, which consists of 2 million 6-view images with RGB-D channels, and design an ED representation learning model, called ImageED, to learn ED-related knowledge from these images. We further propose an efficient ED-aware teacher and introduce a cross-modal distillation strategy to transfer knowledge from the image-based teacher to the geometry-based students. Extensive experiments on QM9 and rMD17 demonstrate that EDG can be directly integrated into existing geometry-based models and significantly improves the capabilities of these models (e.g., SchNet, EGNN, SphereNet, ViSNet) for geometry representation learning in MLFF with a maximum average performance increase of 33.7%. Code and appendix are available at https://github.com/HongxinXiang/EDG
5440: Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval
Authors: Weijia Liu, Jiuxin Cao, Bo Miao, Zhiheng Fu, Xuelin Zhu, Jiawei Ge, Bo Liu, Mehwish Nasim, Ajmal Mian
Location: Guangzhou | Day: TBD
Show Abstract
Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization. To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations. Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules. TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations. TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training. Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR. Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics. Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.
5449: BridgeVoC: Neural Vocoder with Schrödinger Bridge
Authors: Tong Lei, Zhiyu Zhang, Rilin Chen, Meng Yu, Jing Lu, Chengshi Zheng, Dong Yu, Andong Li
Location: Guangzhou | Day: TBD
Show Abstract
While previous diffusion-based neural vocoders typically follow a noise-to-data generation pipe-line, the linear-degradation prior of the mel-spectrogram is often neglected, resulting in limited generation quality. By revisiting the vocoding task and excavating its connection with the signal restoration task, this paper proposes a time-frequency (T-F) domain-based neural vocoder with the Schrödinger Bridge, called BridgeVoC, which is the first to follow the data-to-data generation paradigm. Specifically, the mel-spectrogram can be projected into the target linear-scale domain and regarded as a degraded spectral representation with a deficient rank distribution. Based on this, the Schrödinger Bridge is leveraged to establish a connection between the degraded and target data distributions. During the inference stage, starting from the degraded representation, the target spectrum can be gradually restored rather than generated from a Gaussian noise process. Quantitative experiments on LJSpeech and LibriTTS show that BridgeVoC achieves faster inference and surpasses existing diffusion-based vocoder baselines, while also matching or exceeding non-diffusion state-of-the-art methods across evaluation metrics.
5463: Test-Time Adaptation on Recommender System with Data-Centric Graph Transformation
Authors: Yating Liu, Xin Zheng, Yi Li, Yanqing Guo
Location: Guangzhou | Day: TBD
Show Abstract
Distribution shifts in recommender systems between training and testing in user-item interactions lead to inaccurate recommendations. Despite the promising performance of test-time adaptation technology in various domains, it still faces challenges in recommender systems due to the impracticality of fine-tuning models and the infeasibility of obtaining test-time labels. To address these challenges, we first propose a Test-Time Adaptation framework for Graph-based Recommender system, named TTA-GREC, to dynamically adapt user-item graphs at test time in a data-centric way, handling distribution shifts effectively. Specifically, our TTA-GREC targets KG-enhanced GNN-based recommender systems with three core components: (1) Pseudo-label guided UI graph transformation for adaptive improvement; (2) Rationale score guided KG graph revision for semantic enhancement; and (3) Sampling-based self-supervised adaptation for contrastive learning. Experiments demonstrate TTA-GREC’s superiority at test time and provide new data-centric insights on test-time adaptation for better recommender system inference.
5470: Predicting Spectral Information for Self-Supervised Signal Classification
Authors: Yi Xu, Shuang Wang, Hantong Xing, Chenxu Wang, Dou Quan, Rui Yang, Dong Zhao, Luyang Mei
Location: Guangzhou | Day: TBD
Show Abstract
Deep learning methods have demonstrated remarkable performance across various communication signal processing tasks. However, most signal classification methods require a substantial amount of labeled samples for training, posing significant challenges in the field of communication signals, as labeling necessitates expert knowledge. This paper proposes a novel self-supervised signal classification method called Spectral-Guided Self-Supervised Signal Classification (SGSSC). Specifically, to leverage frequency-domain information with modulation semantics as prior knowledge for the model, we design a previously unexplored pretext task tailored to the format of signal data. This task involves predicting spectral information from masked time-domain signals, enabling the model to learn implicit signal features through cross-domain pattern transformation. Furthermore, the pretext task in the SGSSC method is relevant to the downstream classification task, and using traditional fine-tuning strategies on the downstream task may lead to the loss of certain features associated with the pretext task. Therefore, we propose an attention mechanism-based fine-tuning strategy that adaptively integrates pre-trained features from different levels. Extensive experimental results validate the superiority of the SGSSC method. For instance, when the proportion of labeled samples is only 0.5%, our method achieves an average improvement of 2.3% in downstream classification tasks compared to the best-performing self-supervised training strategies.
5476: HLMTrans: A Sim-to-Real Transfer Framework for Spatial Crowdsourcing with Human-Guided Language Models
Authors: Qingshun Wu, Yafei Li, Lulu Li, Yuanyuan Jin, Shuo He, Mingliang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Reinforcement Learning (RL), trained via trial and error in simulators, has been proven to be an effective approach for addressing task assignment problems in spatial crowdsourcing. However, a performance gap still exists when transferring the simulator-trained RL Models (RLMs) to real-world settings due to the misalignment of travel time. Existing works mostly focus on using data-driven and learning-based methods to predict travel time; unfortunately, these approaches are limited in achieving accurate predictions by requiring a large amount of real-world data covering the entire state distribution. In this paper, we propose a Sim-to-Real Transfer with Human-guided Language Models framework called HLMTrans, which comprises three core modules: RLMs decision for task assignment, sim-to-real transfer with Large Language Models (LLMs), and preference learning from human feedback. HLMTrans first leverages the zero-shot chain-of-thought reasoning capability of LLMs to estimate travel time by capturing the real-world dynamics. This estimation is then input as domain knowledge into the forward model of Grounded Action Transformation (GAT) to enhance the action transformation of RLMs. Further, we design a human preference learning mechanism to fine-tune LLMs, improving their generation quality and enabling RLMs learn a more realistic policy. We evaluate the proposed HLMTrans on two real-world datasets, and the experimental results demonstrate that HLMTrans outperforms the SOTA methods in terms of effectiveness and efficiency.
5477: DIIN: Diffusion Iterative Implicit Networks for Arbitrary-scale Super-resolution
Authors: Tao Dai, Song Wang, Hang Guo, Jianping Wang, Zexuan Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Implicit neural representation (INR) aims to represent continuous domain signals via implicit neural functions and has achieved great success in arbitrary-scale image super-resolution (SR). However, most existing INR-based SR methods focus on learning implicit features from independent coordinate, while neglecting interactions of neighborhood coordinates, thus resulting in limited contextual awareness. In this paper, we rethink the forward process of implicit neural functions as a signal diffusion process, we propose a novel Diffusion Iterative Implicit Network (DIIN) for arbitrary-scale SR to promote global signal flow with neighborhood interactions. The DIIN framework mainly consists of stacked Diffusion Iteration Layers with dictionary cross-attention block to enrich the iterative update process with supplementary information. Besides, we develop the Position-Aware Embedding Block to strengthen spatial dependencies between consecutive input samples.Extensive experiments on public datasets demonstrate that our method achieves state-of-the-art or competitive performance, highlighting its effectiveness and efficiency for arbitrary-scale SR. Our code is available at https://github.com/Song-1205/DIIN.
5482: Preference Identification by Interaction Overlap for Bundle Recommendation
Authors: Fei-Yao Liang, Wu-Dong Xi, Xing-Xing Xing, Wei Wan, Chang-Dong Wang, Hui-Yu Zhou
Location: Guangzhou | Day: TBD
Show Abstract
In the digital age, recommendation systems are crucial for enhancing user experiences, with bundle recommendations playing a key role by integrating complementary products. However, existing methods fail to accurately identify user preferences for specific items within bundles, making it difficult to design bundles containing more items of interest to users. Additionally, these methods do not leverage similar preferences among users of the same category, resulting in unstable and incomplete preference expressions. To address these issues, we propose Preference Identification by Interaction Overlap for Bundle Recommendation (PIIO). The data augmentation module analyzes the overlap between bundle-item inclusions and user-item interactions to calculate the interaction probability of non-interacted bundles, selecting the bundle with the highest probability as a positive sample to enrich user-bundle interactions and uncover user preferences for items within bundles. The preference aggregation module utilizes the overlap in user-item interactions to select similar users, aggregates preferences using an autoencoder, and constructs comprehensive preference profiles. The optimization module predicts user-bundle matching scores based on a user interest boundary loss function. The proposed PIIO model is applied to two bundle recommendation datasets, and experiments demonstrate the effectiveness of the PIIO model, surpassing state-of-the-art models.
5484: Adaptive Graph Unlearning
Authors: Pengfei Ding, Yan Wang, Guanfeng Liu, Jiajie Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Graph unlearning, which deletes graph elements such as nodes and edges from trained graph neural networks (GNNs), is crucial for real-world applications where graph data may contain outdated, inaccurate, or privacy-sensitive information. However, existing methods often suffer from (1) incomplete or over unlearning due to neglecting the distinct objectives of different unlearning tasks, and (2) inaccurate identification of neighbors affected by deleted elements across various GNN architectures. To address these limitations, we propose AGU, a novel Adaptive Graph Unlearning framework that flexibly adapts to diverse unlearning tasks and GNN architectures. AGU ensures the complete forgetting of deleted elements while preserving the integrity of the remaining graph. It also accurately identifies affected neighbors for each GNN architecture and prioritizes important ones to enhance unlearning performance. Extensive experiments on seven real-world graphs demonstrate that AGU outperforms existing methods in terms of effectiveness, efficiency, and unlearning capability.
5494: Fully Test-Time Adaptation for Feature Decrement in Tabular Data
Authors: Zi-Jian Cheng, Zi-Yi Jia, Kun-Yang Yu, Zhi Zhou, Lan-Zhe Guo
Location: Guangzhou | Day: TBD
Show Abstract
Tabular data is widely adopted in various machine learning tasks. Current tabular data learning mainly focuses on closed environments, while in real-world applications, open environments are often encountered, where distribution shifts and feature decrements occur, leading to severe performance degradation. Previous studies have primarily focused on addressing distribution shifts, while feature decrements, a unique challenge in tabular data learning, have received relatively little attention. In this paper, we present the first comprehensive study on the problem of Fully Test-Time Adaptation for Feature Decrement in Tabular Data. Through empirical analysis, we identify the suboptimality of existing missing-feature imputation methods and the limited applicability of missing-feature adaptation approaches. To address these challenges, we propose a novel method, LLM-IMPUTE, which leverages Large Language Models (LLMs) to impute missing features without relying on training data. Furthermore, we introduce Augmented-Training LLM (ATLLM), a method designed to enhance the robustness of feature decrements by simulating feature-decrement scenarios during the training phase to address tasks that can not be imputed by LLM-IMPUTE. Extensive experimental results demonstrate that our proposal significantly improves both performance and robustness in missing feature imputation and adaptation scenarios.
5495: HGMP: Heterogeneous Graph Multi-Task Prompt Learning
Authors: Pengfei Jiao, Jialong Ni, Di Jin, Xuan Guo, Huan Liu, Hongjiang Chen, Yanxian Bi
Location: Guangzhou | Day: TBD
Show Abstract
The pre-training and fine-tuning methods have gained widespread attention in the field of heterogeneous graph neural networks due to their ability to leverage large amounts of unlabeled data during the pre-training phase, allowing the model to learn rich structural features. However, these methods face the issue of a mismatch between the pre-trained model and downstream tasks, leading to suboptimal performance in certain application scenarios. Prompt learning methods have emerged as a new direction in heterogeneous graph tasks, as they allow flexible adaptation of task representations to address target inconsistency. Building on this idea, this paper proposes a novel multi-task prompt framework for the heterogeneous graph domain, named HGMP. First, to bridge the gap between the pre-trained model and downstream tasks, we reformulate all downstream tasks into a unified graph-level task format. Next, we address the limitations of existing graph prompt learning methods, which struggle to integrate contrastive pre-training strategies in the heterogeneous graph domain. We design a graph-level contrastive pre-training strategy to better leverage heterogeneous information and enhance performance in multi-task scenarios. Finally, we introduce heterogeneous feature prompts, which enhance model performance by refining the representation of input graph features. Experimental results on public datasets show that our proposed method adapts well to various tasks and significantly outperforms baseline methods.
5522: Integration of Old and New Knowledge for Generalized Intent Discovery: A Consistency-driven Prototype-Prompting Framework
Authors: Xiao Wei, Xiaobao Wang, Ning Zhuang, Chenyang Wang, Longbiao Wang, Jianwu Dang
Location: Guangzhou | Day: TBD
Show Abstract
Intent detection aims to identify user intents from natural language inputs, where supervised methods rely heavily on labeled in-domain (IND) data and struggle with out-of-domain (OOD) intents, limiting their practical applicability. Generalized Intent Discovery (GID) addresses this by leveraging unlabeled OOD data to discover new intents without additional annotation. However, existing methods focus solely on clustering unsupervised data while neglecting domain adaptation. Therefore, we propose a consistency-driven prototype-prompting framework for GID from the perspective of integrating old and new knowledge, which includes a prototype-prompting framework for transferring old knowledge from external sources, and a hierarchical consistency constraint for learning new knowledge from target domains. We conducted extensive experiments and the results show that our method significantly outperforms all baseline methods, achieving state-of-the-art results, which strongly demonstrates the effectiveness and generalization of our methods. Our source code is publicly available at https://github.com/smileix/cpp.
5529: Odyssey : Empowering Minecraft Agents with Open-World Skills
Authors: Shunyu Liu, Yaoru Li, Kongcheng Zhang, Zhenyu Cui, Wenkai Fang, Yuxuan Zheng, Tongya Zheng, Mingli Song
Location: Guangzhou | Day: TBD
Show Abstract
Recent studies have delved into constructing generalist agents for open-world environments like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic programmatic tasks, e.g., material collection and tool-crafting following the Minecraft tech-tree, treating the ObtainDiamond task as the ultimate goal. This limitation stems from the narrowly defined set of actions available to agents, requiring them to learn effective long-horizon strategies from scratch. Consequently, discovering diverse gameplay opportunities in the open world becomes challenging. In this work, we introduce Odyssey, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. Odyssey comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills. (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki. (3) A new agent capability benchmark includes the long-term planning task, the dynamic-immediate planning task, and the autonomous exploration task. Extensive experiments demonstrate that the proposed Odyssey framework can effectively evaluate different capabilities of LLM-based agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.
5544: Preference-based Deep Reinforcement Learning for Historical Route Estimation
Authors: Boshen Pan, Yaoxin Wu, Zhiguang Cao, Yaqing Hou, Guangyu Zou, Qiang Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Recent Deep Reinforcement Learning (DRL) techniques have advanced solutions to Vehicle Routing Problems (VRPs). However, many of these methods focus exclusively on optimizing distance-oriented objectives (i.e., minimizing route length), often overlooking the implicit drivers’ preferences for routes. These preferences, which are crucial in practice, are challenging to model using traditional DRL approaches. To address this gap, we propose a preference-based DRL method characterized by its reward design and optimization objective, which is specialized to learn historical route preferences. Our experiments demonstrate that the method aligns generated solutions more closely with human preferences. Moreover, it exhibits strong generalization performance across a variety of instances, offering a robust solution for different VRP scenarios.
5554: Wrapped Partial Label Dimensionality Reduction via Dependence Maximization
Authors: Xiang-Ru Yu, Deng-Bao Wang, Min-Ling Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Partial label learning induces classifier from data with ambiguous supervision, where each instance is associated with a set of candidate labels but only one of which is valid. As a classic data preprocessing strategy, dimensionality reduction contributes to enhance the generalization capabilities of learning algorithms. Due to the ambiguity of supervision, existing works on partial label dimensionality reduction are confined to two separate stages: dimensionality reduction and partial label disambiguation. However, the decoupling of dimensionality reduction from partial label disambiguation can lead to severe performance degradation. In this paper, we present a novel approach called Wrapped Partial Label Dimensionality Reduction (WPLDR) to address this challenge. Specifically, WPLDR integrates the dimensionality reduction and partial label disambiguation within a unified framework, employing alternating optimization to concurrently perform dimensionality reduction and partial label disambiguation. WPLDR maximizes the interdependence between features in the embedded space and confidence-based label information, while simultaneously ensuring the manifold consistency between the embedded feature space and label space. Extensive experiments over a broad range of synthetic and real-world partial label data sets validate that the performance of well-established partial label learning algorithms can be significantly improved by the proposed WPLDR.
5584: Instance Relation Learning Network with Label Knowledge Propagation for Few-shot Multi-label Intent Detection
Authors: Shiman Zhao, Shangyuan Li, Wei Chen, Tengjiao Wang, Jiahui Yao, Jiabin Zheng, Kam-Fai Wong
Location: Guangzhou | Day: TBD
Show Abstract
Few-shot Multi-label Intent Detection (MID) is crucial for dialogue systems, aiming to detect multiple intents of utterances in low-resource dialogue domains. Previous studies focus on a two-stage pipeline. They first learn representations of utterances with multiple labels and then use a threshold-based strategy to identify multi-label results. However, these methods rely on representation classification and ignore instance relations, leading to error propagation. To solve the above issues, we propose a multi-label joint learning method for few-shot MID in an end-to-end manner, which constructs an instance relation learning network with label knowledge propagation to eliminate error propagation. Concretely, we learn the interaction relations between instances with class information to propagate label knowledge between a few labeled (support set) and unlabeled (query set) instances. With label knowledge propagation, the relation strength between instances directly indicates whether two utterances belong to the same intent for multi-label prediction. Besides, a dual relation-enhanced loss is developed to optimize support- and query-level relation strength to improve performance. Experiments show that we outperform strong baselines by an average of 9.54% AUC and 11.19% Macro-F1 in 1-shot scenarios.
5589: Guiding Large Language Models in Modeling Optimization Problems via Question Partitioning
Authors: Xiaotian Pan, Junhao Fang, Feng Wu, Sijia Zhang, Yi-Xiang Hu, Shaoang Li, Xiang-Yang Li
Location: Guangzhou | Day: TBD
Show Abstract
Optimization problems are ubiquitous across various domains, such as resource scheduling, production planning, and sales management. Traditionally, they are modeled manually, leading to inefficiencies due to difficulties in communication and collaboration between modeling and domain experts. The emergence of Large Language Models (LLMs) has made automated modeling possible. However, real-world applications are often large-scale and have numerous variables and constraints, limiting the applicability of existing methods. To address this, we propose PaMOP, a novel modeling framework based on LLMs, to model optimization problems automatically, given only natural language descriptions. Specifically, we extract and partition the problems using a tree structure, guiding the LLMs to model each set of constraints with self-augmented prompts, thus reducing the demands on the LLM’s capabilities of large contents. The mathematical model is then iteratively corrected and validated through our correction procedures. The experiments demonstrate that our method improves performance on the common benchmark dataset NLP4LP, achieving an accuracy of 62.3% and a code executability rate of 86.8% when tested on GPT-4. Additionally, we demonstrate the effectiveness of our PaMOP in handling large real-world problems.
5600: A Weighted-Based Fast Local Search for α-Neighbor p-Center Problem
Authors: Qingyun Zhang, Zhipeng Lü, Junwen Ding, Zhouxing Su
Location: Guangzhou | Day: TBD
Show Abstract
The α-neighbor p-center problem (α-pCP) is an extension of the classical p-center problem. It aims to select p centers from a set of candidate centers to minimize the maximum distance between any client and its α service centers. In this paper, we propose a weighting-based fast local search algorithm called WFLS for solving α-pCP. First, WFLS converts the complex α-pCP into a series of decision subproblems by specifying the service radius, effectively mitigating the gradient vanishing issue during the search process, and introduces a new MIP model. Then, it addresses the simpliffed subproblems using a fast local search procedure with a swap-based neighborhood structure. WFLS adopts an efffcient weighting strategy, an incremental evaluation technique, a reffned-grained penaltybased neighborhood evaluation, and two scoring functions of neighborhood evaluation to accelerate and guide the search process. Computational experiments on 154 widely used public benchmark instances demonstrate that WFLS outperforms the state-of-the-art methods in the literature. Speciffcally, WFLS improves 69 previous best known results and matches the best know results for all the remaining ones in less time than other competitors.
5619: Negative Metric Learning for Graphs
Authors: Yiyang Zhao, Chengpei Wu, Lilin Zhang, Ning Yang
Location: Guangzhou | Day: TBD
Show Abstract
Graph contrastive learning (GCL) often suffers from false negatives, which degrades the performance on downstream tasks. The existing methods addressing the false negative issue usually rely on human prior knowledge, still leading GCL to suboptimal results. In this paper, we propose a novel Negative Metric Learning (NML) enhanced GCL (NML-GCL). NML-GCL employs a learnable Negative Metric Network (NMN) to build a negative metric space, in which false negatives can be distinguished better from true negatives based on their distance to anchor node. To overcome the lack of explicit supervision signals for NML, we propose a joint training scheme with bi-level optimization objective, which implicitly utilizes the self-supervision signals to iteratively optimize the encoder and the negative metric network. The solid theoretical analysis and the extensive experiments conducted on widely used benchmarks verify the superiority of the proposed method.
5641: FedAPA: Server-side Gradient-Based Adaptive Personalized Aggregation for Federated Learning on Heterogeneous Data
Authors: Yuxia Sun, Aoxiang Sun, Siyi Pan, Zhixiao Fu, Jingcai Guo
Location: Guangzhou | Day: TBD
Show Abstract
Personalized federated learning (PFL) tailors models to clients’ unique data distributions while preserving privacy. However, existing aggregation-weight-based PFL methods often struggle with heterogeneous data, facing challenges in accuracy, computational efficiency, and communication overhead. We propose FedAPA, a novel PFL method featuring a server-side, gradient-based adaptive aggregation strategy to generate personalized models, by updating aggregation weights based on gradients of client-parameter changes with respect to the aggregation weights in a centralized manner. FedAPA guarantees theoretical convergence and achieves superior accuracy and computational efficiency compared to 10 PFL competitors across three datasets, with competitive communication overhead. The code and full proofs are available at: https://github.com/Yuxia-Sun/FL_FedAPA.
5642: Self-supervised End-to-end ToF Imaging Based on RGB-D Cross-modal Dependency
Authors: Weihang Wang, Jun Wang, Fei Wen
Location: Guangzhou | Day: TBD
Show Abstract
Time-of-Flight (ToF) imaging systems are susceptible to various noise and degradation, which can severely affect image quality. Traditional sequential imaging pipelines often suffer from error accumulation due to separate multi-stage processing. Existing end-to-end methods typically rely on noisy-clean depth image pairs for supervised learning.
However, acquiring ground-truth is challenging in real-world scenarios due to factors such as Multi-Path Interference (MPI), phase wrapping, and complex noise patterns.
In this paper, we propose a self-supervised learning framework for end-to-end ToF imaging, which does not require any noisy-clean pairs yet generalizes well across various off-the-shelf cameras.
Our framework leverages the cross-modal dependencies between RGB and depth data as implicit supervision to effectively suppress noise and maintain image fidelity. Additionally, the loss function integrates the statistical characteristics of raw measurement data, enhancing robustness against noise and artifacts.
Extensive experiments on both synthetic and real-world data demonstrate that our approach achieves performance comparable to supervised methods, without requiring paired noisy-clean data for training.
Furthermore, our method consistently delivers strong performance across all evaluated cameras, highlighting its generalization capabilities. The code is available at https://github.com/WeihangWANG/RGBD_imaging.
5644: D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning
Authors: Jia Zhang, Chen-Xi Zhang, Yao Liu, Yi-Xuan Jin, Xiao-Wen Yang, Bo Zheng, Yi Liu, Lan-Zhe Guo
Location: Guangzhou | Day: TBD
Show Abstract
Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and efficiency of instruction tuning. In this paper, we first establish data selection criteria based on three distinct aspects of data value: diversity, difficulty, and dependability, and then propose the D3 method comprising two key steps of scoring and selection. Specifically, in the scoring step, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty to evaluate sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we integrate an external LLM for dependability assessment. In the selection step, we formulate the D3 weighted coreset objective, which jointly optimizes three aspects of data value to solve for the most valuable subset. The two steps of D3 can iterate multiple rounds, incorporating feedback to refine the selection focus adaptively. Experiments on both public datasets and the real-world Taobao Live application demonstrate the effectiveness of D3 in endowing LLMs with competitive or even superior instruction-following capabilities using less than 10% of the entire dataset.
5647: LiBOG: Lifelong Learning for Black-Box Optimizer Generation
Authors: Jiyuan Pei, Yi Mei, Jialin Liu, Mengjie Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Meta-Black-Box Optimization (MetaBBO) garners attention due to its success in automating the configuration and generation of black-box optimizers, significantly reducing the human effort required for optimizer design and discovering optimizers with higher performance than classic human-designed optimizers. However, existing MetaBBO methods conduct one-off training under the assumption that a stationary problem distribution with extensive and representative training problem samples is pre-available. This assumption is often impractical in real-world scenarios, where diverse problems following shifting distribution continually arise. Consequently, there is a pressing need for methods that can continuously learn from new problems encountered on-the-fly and progressively enhance their capabilities. In this work, we explore a novel paradigm of lifelong learning in MetaBBO and introduce LiBOG, a novel approach designed to learn from sequentially encountered problems and generate high-performance optimizers for Black-Box Optimization (BBO). LiBOG consolidates knowledge both across tasks and within tasks to mitigate catastrophic forgetting. Extensive experiments demonstrate LiBOG’s effectiveness in learning to generate high-performance optimizers in a lifelong learning manner, addressing catastrophic forgetting while maintaining plasticity to learn new tasks.
5650: SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation
Authors: Bin Xu, Yiguan Lin, Yinghao Li, Yang Gao
Location: Guangzhou | Day: TBD
Show Abstract
Large language models exhibit remarkable performance in simple code generation tasks. However, they encounter significant challenges when addressing complex problems that require reasoning and question decomposition. To tackle this, we propose a self-driven reasoning augmentation process, SRA-MCTS, which incorporates Monte Carlo Tree Search (MCTS) for reasoning data generation.
SRA-MCTS enables LLMs to self-generate intermediate reasoning steps and perform iterative self-evaluation, facilitating self-improvement. Specifically, it utilizes MCTS to produce diverse intermediate reasoning steps. During each iteration, MCTS generates a step and employs self-evaluation to guide the selection of subsequent branches, ultimately forming a sufficiently diverse reasoning path referred to as “thinking”. This thinking guides the model in generating corresponding code, and both are combined as training data for supervised fine-tuning.
Experimental results demonstrate that SRA-MCTS achieves consistent performance improvements across three model scales without additional supervisory assistance. Applied to the Meta-Llama-3.1-8B-Instruct model, it delivers an 11-point improvement on the MBPP-Complex dataset, underscoring the significant potential for model self-improvement. The code and data are available at https://github.com/DIRECT-BIT/SRA-MCTS.
5661: VidEvo: Evolving Video Editing through Exhaustive Temporal Modeling
Authors: Sizhe Dang, Huan Liu, Mengmeng Wang, Xin Lai, Guang Dai, Jingdong Wang
Location: Guangzhou | Day: TBD
Show Abstract
Text-guided video editing (TGVE) has become a recent hotspot due to its entertainment value and practical applications. To reduce overhead, existing methods primarily extend from text-to-image diffusion models and typically involve reconstruction and editing phases. However, challenges persist, particularly in enhancing temporal consistency of a video while adhering to textual alignment requirements. A crucial factor leading to the aforementioned issue is the inadequate and implicit tuning of the attention module within existing methods, which is specifically designed to capture temporal information. In light of this, we introduce VidEvo, a novel one-shot video editing method that leverages explicit cues derived from the original video to enhance temporal modeling. By integrating null-video embedding (NVE) and window-frame attention (WFA) components, VidEvo facilitates the smooth and coherent generation of videos from global and local perspectives simultaneously. To be specific, NVE learns a set of multi-scale temporal embeddings within the visual space during the reconstruction phase. These embeddings are subsequently directly injected into the attention module of the editing phase, explicitly augmenting the temporal consistency of the entire video. On the other hand, WFA enhances local temporal modeling by dynamically optimizing attention mechanisms between adjacent frames, which improves temporal coherence with reduced computational costs. Experimental evaluations show that VidEvo enhances frame-to-frame temporal consistency. Ablation studies confirm NVE and WFA’s effectiveness and their plug-and-play capability with other methods.
5670: New Algorithms for #2-SAT and #3-SAT
Authors: Junqiang Peng, Zimo Sheng, Mingyu Xiao
Location: Guangzhou | Day: TBD
Show Abstract
The #2-SAT and #3-SAT problems involve counting the number of satisfying assignments (also called models) for instances of 2-SAT and 3-SAT, respectively. In 2010, Zhou et al. (https://doi.org/10.1609/aaai.v24i1.7537) proposed an O*(1.1892^m)-time algorithm for #2-SAT and an efficient approach for #3-SAT, where m denotes the number of clauses. In this paper, we show that the weighted versions of #2-SAT and #3-SAT can be solved in O*(1.1082^m) and O*(1.4423^m) time, respectively. These results directly apply to the unweighted cases and achieve substantial improvements over the previous results. These advancements are enabled by the introduction of novel reduction rules, a refined analysis of branching operations, and the application of path decompositions on the primal and dual graphs of the formula.
5696: HeTa: Relation-wise Heterogeneous Graph Foundation Attack Model
Authors: Yuling Wang, Zihui Chen, Pengfei Jiao, Xiao Wang
Location: Guangzhou | Day: TBD
Show Abstract
Heterogeneous Graph Neural Networks (HGNNs) are vulnerable, highlighting the need for tailored attacks to assess their robustness and ensure security. However, existing HGNN attacks often require complex retraining of parameters to generate specific perturbations for new scenarios. Recently, foundation models have opened new horizons for the generalization of graph neural networks by capturing shared semantics across various graph distributions. This leads us to ask: Can we design a foundation attack model for HGNNs that enables generalizable perturbations across different HGNNs, and quickly adapts to new heterogeneous graphs (HGs)? Empirical findings reveal that, despite significant differences in model design and parameter space, different HGNNs surprisingly share common vulnerability patterns from a relation-aware perspective. Therefore, we explore how to design foundation HGNN attack criteria by mining shared attack units. In this paper, we propose a novel relation-wise heterogeneous graph foundation attack model, HeTa. We introduce a foundation surrogate model to align heterogeneity and identify the importance of shared relation-aware attack units. Building on this, we implement a serialized relation-by-relation attack based on the identified relational weights. In this way, the perturbation can be transferred to various target HGNNs and easily fine-tuned for new HGs. Extensive experiments exhibit powerful attack performances and generalizability of our method.
5704: DiffECG: Diffusion Model-Powered Label-Efficient and Personalized Arrhythmia Diagnosis
Authors: Tianren Zhou, Zhenge Jia, Dongxiao Yu, Zhaoyan Shen
Location: Guangzhou | Day: TBD
Show Abstract
Arrhythmia diagnosis using electrocardiogram (ECG) is critical for preventing cardiovascular risks. However, existing deep learning-based methods struggle with label scarcity and contrastive learning-based methods suffer from false-negative samples, which lead to poor model generalization. Besides, due to inter-subject variability, pre-trained models cannot achieve evenly performance across individuals. Conducting model fine-tuning for each individual is computationally expensive and does not guarantee improvement. We propose DiffECG, a diffusion-based self-supervised learning framework for label-efficient and personalized arrhythmia detection. Our method utilizes a diffusion model to extract robust ECG representations, coupled with a novel feature extractor and a multi-modal feature fusion strategy to obtain a well-generalized model. Moreover, we propose an efficient model personalization mechanism based on zeroth-order optimization. It personalizes the model by tuning the noise-adding step t in the diffusion process, significantly reducing computational costs compared to model fine-tuning. Experimental results show that our proposed method outperforms the SOTA method by 37.9% and 23.9% in generalization and personalization performance, respectively. The source code is available at: https://github.com/Auguuust/DiffEC
5720: Gradient-based Causal Feature Selection
Authors: Zhaolong Ling, Mengxiang Guo, Xingyu Wu, Debo Cheng, Peng Zhou, Tianci Li, Zhangling Duan
Location: Guangzhou | Day: TBD
Show Abstract
Causal feature selection leverages causal discovery techniques to identify critical features associated with a target variable using observational data. Traditional methodologies primarily rely on constraint-based or score-based techniques, which are fraught with limitations. For example, conditional independence tests often yield unreliable results in the presence of noise and complex data generation processes, while the computational complexity of learning directed acyclic graphs increases exponentially with the number of variables involved. In light of recent advancements in deep learning, gradient-based methods have shown promise for global causal discovery. However, significant challenges arise when focusing on the identification of local causal features, particularly in defining the local causal constraint space to achieve both minimality and completeness. To address these issues, we introduce a novel gradient-based causal feature selection method (GCFS) that leverages an AutoEncoder to simultaneously model the target variable alongside other variables, thereby capturing of causal associations within a divide-and-conquer framework. Additionally, our approach incorporates a mask pruning strategy that transforms the search process into the minimization of a non-cyclic local reconstruction loss objective function. This function is then effectively optimized using a gradient-based method to accurately identify the causal features related to the target variable. Experimental results substantiate that GCFS surpasses existing methodologies across both synthetic and real datasets.
5745: SketchAgent: Generating Structured Diagrams from Hand-Drawn Sketches
Authors: Cheng Tan, Qi Chen, Jingxuan Wei, Gaowei Wu, Zhangyang Gao, Siyuan Li, Bihui Yu, Ruifeng Guo, Stan Z. Li
Location: Guangzhou | Day: TBD
Show Abstract
Hand-drawn sketches are a natural and efficient medium for capturing and conveying ideas. Despite significant advancements in controllable natural image generation, translating freehand sketches into structured, machine-readable diagrams remains a labor-intensive and predominantly manual task. The primary challenge stems from the inherent ambiguity of sketches, which lack the structural constraints and semantic precision required for automated diagram generation. To address this challenge, we introduce SketchAgent, a multi-agent system designed to automate the transformation of hand-drawn sketches into structured diagrams. SketchAgent integrates sketch recognition, symbolic reasoning, and iterative validation to produce semantically coherent and structurally accurate diagrams, significantly reducing the need for manual effort. To evaluate the effectiveness of our approach, we propose the Sketch2Diagram Benchmark, a comprehensive dataset and evaluation framework encompassing eight diverse diagram categories, such as flowcharts, directed graphs, and model architectures. The dataset comprises over 6,000 high-quality examples with token-level annotations, standardized preprocessing, and rigorous quality control. By streamlining the diagram generation process, SketchAgent holds great promise for applications in design, education, and engineering, while offering a significant step toward bridging the gap between intuitive sketching and machine-readable diagram generation.
5749: Categorical Attention: Fine-grained Language-guided Noise Filtering Network for Occluded Person Re-Identification
Authors: Minghui Chen, Dayan Wu, Chenxu Yang, Qinghang Su, Zheng Lin
Location: Guangzhou | Day: TBD
Show Abstract
Person Re-Identification (ReID) aims to match individuals across different camera views, but occlusions in real-world scenarios, such as vehicles or crowds, hinder feature extraction and matching. Current occluded ReID methodologies typically leverage visual augmentation techniques in an attempt to mitigate the disruptive effects of occlusion-induced noise. However, relying solely on visual data fail to effectively filter out occlusion noise. In this paper, we introduce the Fine-grained Language-guided Noise Filtering Network (FLaN-Net) for occluded ReID. FLaN-Net innovatively employs categorical attention mechanism to generate adaptive tokens that capture the following three distinct types of visual information: comprehensive descriptions of individuals, detailed visible attributes, and characteristics of occluding objects. Subsequently, a cross-attention mechanism aligns these prompts with the image, guiding the model to focus on relevant regions. To generate robust and discriminative features for occluded pedestrians, we further introduce a dynamic weighting fusion module that integrates visual, textual, and cross-attention features based on their reliability. Experimental results demonstrate that FLaN-Net outperforms existing methods on occluded ReID benchmarks, offering a robust solution for challenging real-world conditions.
5761: Conditional Denoising Meets Polynomial Modeling: A Flexible Decoupled Framework for Time Series Forecasting
Authors: Jintao Zhang, Mingyue Cheng, Xiaoyu Tao, Zhiding Liu, Daoyu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Time series forecasting models are becoming increasingly prevalent due to their critical role in decision-making across various domains. However, most existing approaches represent the coupled temporal patterns, often neglecting the distinction between their specific components. In particular, fluctuating patterns and smooth trends within time series exhibit distinct characteristics. In this work, to model complicated temporal patterns, we propose a Conditional Denoising Polynomial Modeling (CDPM) framework, where probabilistic diffusion models and deterministic linear models are trained end-to-end. Instead of modeling the coupled time series, CDPM decomposes it into trend and seasonal components for modeling them separately. To capture the fluctuating seasonal component, we employ a probabilistic diffusion model based on statistical properties from the historical window. For the smooth trend component, a module is proposed to enhance linear models by incorporating historical dependencies, thereby preserving underlying trends and mitigating noise distortion. Extensive experiments conducted on six benchmarks demonstrate the effectiveness of our framework, highlighting the potential of combining probabilistic and deterministic models. Our code is available at https://github.com/zjt-gpu/CDPM.
5804: Beyond Fixed Length: Bucket Pre-training is All You Need
Authors: Qing Yang, Qiyao Peng, Hongtao Liu, Kai Liu, Bing Qin, Ting Liu
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) have demonstrated exceptional performance across various tasks, with pre-training stage serving as the cornerstone of their capabilities. However, the conventional fixed-length data composition strategy for pre-training presents several practical challenges. When using shorter sequences, documents are often truncated, potentially leading to information loss and affecting the model’s ability to capture long-range dependencies. Conversely, longer sequences require concatenation of multiple documents, which can introduce noise and affect the natural document boundaries and semantic coherence as well as require substantial computational overhead. To address these challenges, we first establish three quantitative metrics for evaluating data composition quality: padding ratio, truncation ratio, and concatenation ratio. Building upon these metrics, we propose a novel multi-bucket data composition method that transcends the fixed-length paradigm. Our approach adaptively organizes training data to achieve optimal composition quality as measured by the proposed metrics, offering a more flexible and efficient approach for pre-training. We conduct extensive experiments and the results demonstrate that our proposed method significantly enhances both the efficiency and effectiveness of LLM pre-training. Our proposed method has been adopted in the Du Xiaoman–XuanYuan series of financial large language models at https://github.com/Duxiaoman-DI/XuanYuan.
5811: Learning Robust Multi-view Representation Using Dual-masked VAEs
Authors: Jiedong Wang, Kai Guo, Peng Hu, Xi Peng, Hao Wang
Location: Guangzhou | Day: TBD
Show Abstract
Most existing multi-view representation learning methods assume view-completeness and noise-free data. However, such assumptions are strong in real-world applications. Despite advances in methods tailored to view-missing or noise problems individually, a one-size-fits-all approach that concurrently addresses both remains unavailable. To this end, we propose a holistic method, called Dual-masked Variational Autoencoders (DualVAE), which aims at learning robust multi-view representation. The DualVAE exhibits an innovative amalgamation of dual-masked prediction, mixture-of-experts learning, representation disentangling, and a joint loss function in wrapping up all components. The key novelty lies in the dual-masked (view-mask and patch-mask) mechanism to mimic missing views and noisy data. Extensive experiments on four multi-view datasets show the effectiveness of the proposed method and its superior performance in comparison to baselines. The code is available at https://github.com/XLearning-SCU/2025-IJCAI-DualVAE.
5828: EchoGPT: An Interactive Cardiac Function Assessment Model for Echocardiogram Videos
Authors: Bo Xu, Quanhao Zhu, Qingchen Zhang, Mengmeng Wang, Liang Zhao, Hongfei Lin, Jing Ren, Feng Xia
Location: Guangzhou | Day: TBD
Show Abstract
With the development of wearable cardiac ultrasound devices, it is no longer sufficient to solely rely on doctors for diagnosing long-term echocardiogram videos. Automated diagnosis of echocardiogram videos has now become a research hotspot. Existing studies only analyze echocardiogram video through discriminative models, which have limited question-answering capabilities. Therefore, this study innovatively proposes a large language model with cardiac ultrasound diagnostic capabilities—EchoGPT. EchoGPT integrates the robust communication and comprehension capabilities of large language models (LLMs) with the diagnostic prowess of traditional medical models, empowering patients to obtain accurate medical indicator data and comprehend their health conditions through interactive questioning with the model. The model is capable of local deployment on personal computers, effectively safe guarding user privacy. EchoGPT operates through three main components: left ventricle segmentation, left ventricular ejection fraction LVEF prediction, and finetuning of video-text LLMs. Experimental results demonstrate EchoGPT’s superior accuracy in predicting LVEF compared to other models, and positive feedback from professional physicians through questionnaire surveys, validating its potential in practical applications. The demo is available at https://github.com/zhuqh19/EchoGPT.
5835: Explainable Graph Representation Learning via Graph Pattern Analysis
Authors: Xudong Wang, Ziheng Sun, Chris Ding, Jicong Fan
Location: Guangzhou | Day: TBD
Show Abstract
Explainable artificial intelligence (XAI) is an important area in the AI community, and interpretability is crucial for building robust and trustworthy AI models. While previous work has explored model-level and instance-level explainable graph learning, there has been limited investigation into explainable graph representation learning. In this paper, we focus on representation-level explainable graph learning and ask a fundamental question: What specific information about a graph is captured in graph representations? Our approach is inspired by graph kernels, which evaluate graph similarities by counting substructures within specific graph patterns. Although the pattern counting vector can serve as an explainable representation, it has limitations such as ignoring node features and being high-dimensional. To address these limitations, we introduce a framework (PXGL-GNN) for learning and explaining graph representations through graph pattern analysis. We start by sampling graph substructures of various patterns. Then, we learn the representations of these patterns and combine them using a weighted sum, where the weights indicate the importance of each graph pattern’s contribution. We also provide theoretical analyses of our methods, including robustness and generalization. In our experiments, we show how to learn and explain graph representations for real-world data using pattern analysis. Additionally, we compare our method against multiple baselines in both supervised and unsupervised learning tasks to demonstrate its effectiveness.
5859: Code-BT: A Code-Driven Approach to Behavior Tree Generation for Robot Tasks Planning with Large Language Models
Authors: Siyang Zhang, Bin Li, Jingtao Qi, Xueying Wang, Fu Li, Jianan Wang, En Zhu, Jinjing Sun
Location: Guangzhou | Day: TBD
Show Abstract
Behavior trees(BTs) provide a systematic and structured control architecture extensively employed in game AI and robotic behavior control, owing to their modularity, reactivity, and reusability. Nonetheless, manual BTs design requires significant expertise and becomes inefficient as task complexity increases. Recent automation technologies have avoided manual work, but often have high application barriers and face challenges in adapting to new tasks, making it difficult to easily configure them to specific requirements. Code-BT introduces a novel approach that utilizes large language models(LLMs) to automatically generate BTs, representing the task planning process as the process of coding and organizing sequences. By retrieving control flow information from the generated code, BTs can be efficiently constructed to address the complexity and diversity of task planning challenges. Rather than relying on manual design, Code-BT uses task instructions to guide the selection of relevant APIs, and then systematically assembles these APIs into modular code to align with the BTs structure. Finally, action sequences and control logic are extracted from the generated code to construct the BTs. Our approach not only ensures the automation of BTs generation but also guarantees the scalability and adaptability for long-term tasks. Experimental results demonstrate that Code-BT substantially improves LLM performance in BTs generation, achieving improvements ranging from16.67% to 29.17%.
5882: Token-Level Accept or Reject: A Micro Alignment Approach for Large Language Models
Authors: Yang Zhang, Yu Yu, Bo Tang, Yu Zhu, Chuxiong Sun, Wenqiang Wei, Jie Hu, Zipeng Xie, Zhiyu Li, Feiyu Xiong, Edward Chung
Location: Guangzhou | Day: TBD
Show Abstract
With the rapid development of Large Language Models (LLMs), aligning these models with human preferences and values is critical to ensuring ethical and safe applications. However, existing alignment techniques such as RLHF or DPO often require direct fine-tuning on LLMs with billions of parameters, resulting in substantial computational costs and inefficiencies. To address this, we propose Micro token-level Accept-Reject Aligning (MARA) approach designed to operate independently of the language models. MARA simplifies the alignment process by decomposing sentence-level preference learning into token-level binary classification, where a compact three-layer fully-connected network determines whether candidate tokens are “Accepted” or “Rejected” as part of the response. Extensive experiments across seven different LLMs and three open-source datasets show that MARA achieves significant improvements in alignment performance while reducing computational costs. The source code and implementation details are publicly available at https://github.com/IAAR-Shanghai/MARA, and the trained models are released at https://huggingface.co/IAAR-Shanghai/MARA_AGENTS.
5898: Leveraging Personalized PageRank and Higher-Order Topological Structures for Heterophily Mitigation in Graph Neural Networks
Authors: Yumeng Wang, Zengyi Wo, Wenjun Wang, Xingcheng Fu, Minglai Shao
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) excel in node classification tasks but often assume homophily, where connected nodes share similar labels. This assumption does not hold in many real-world heterophilic graphs. Existing models for heterophilic graphs primarily rely on pairwise relationships, overlooking multi-scale information from higher-order structures. This leads to suboptimal performance, particularly under noise from conflicting class information across nodes. To address these challenges, we propose HPGNN, a novel model integrating Higher-order Personalized PageRank with Graph Neural Networks. HPGNN introduces an efficient high-order approximation of Personalized PageRank (PPR) to capture long-range and multiscale node interactions. This approach reduces computational complexity and mitigates noise from surrounding information. By embedding higher-order structural information into convolutional networks, HPGNN effectively models key interactions across diverse graph dimensions. Extensive experiments on benchmark datasets demonstrate HPGNN’s effectiveness. The model achieves better performance than five out of seven state-of-the-art methods on heterophilic graphs in downstream tasks while maintaining competitive performance on homophilic graphs. HPGNN’s ability to balance multi-scale information and robustness to noise makes it a versatile solution for real-world graph learning challenges. Codes are available at https://github.com/streetcorner/HPGNN.
5901: MC3D-AD: A Unified Geometry-aware Reconstruction Model for Multi-category 3D Anomaly Detection
Authors: Jiayi Cheng, Can Gao, Jie Zhou, Jiajun Wen, Tao Dai, Jinbao Wang
Location: Guangzhou | Day: TBD
Show Abstract
3D Anomaly Detection (AD) is a promising means of controlling the quality of manufactured products. However, existing methods typically require carefully training a task-specific model for each category independently, leading to high cost, low efficiency, and weak generalization. This study presents a novel unified model for Multi-Category 3D Anomaly Detection (MC3D-AD) that aims to utilize both local and global geometry-aware information to reconstruct normal representations of all categories. First, to learn robust and generalized features of different categories, we propose an adaptive geometry-aware masked attention module that extracts geometry variation information to guide mask attention. Then, we introduce a local geometry-aware encoder reinforced by the improved mask attention to encode group-level feature tokens. Finally, we design a global query decoder that utilizes point cloud position embeddings to improve the decoding process and reconstruction ability. This leads to local and global geometry-aware reconstructed feature tokens for the 3D AD task. MC3D-AD is evaluated on two publicly available Real3D-AD and Anomaly-ShapeNet datasets, and exhibits significant superiority over current state-of-the-art single-category methods, achieving 3.1% and 9.3% improvement in object-level AUROC over Real3D-AD and Anomaly-ShapeNet, respectively. The code is available at https://github.com/iCAN-SZU/MC3D-AD.
5903: Heterogeneous Temporal Hypergraph Neural Network
Authors: Huan Liu, Pengfei Jiao, Mengzhou Gao, Chaochao Chen, Di Jin
Location: Guangzhou | Day: TBD
Show Abstract
Graph representation learning (GRL) has emerged as an effective technique for modeling graph-structured data. When modeling heterogeneity and dynamics in real-world complex networks, GRL methods designed for complex heterogeneous temporal graphs (HTGs) have been proposed and have achieved successful applications in various fields. However, most existing GRL methods mainly focus on preserving the low-order topology information while ignoring higher-order group interaction relationships, which are more consistent with real-world networks. In addition, most existing hypergraph methods can only model static homogeneous graphs, limiting their ability to model high-order interactions in HTGs. Therefore, to simultaneously enable the GRL model to capture high-order interaction relationships in HTGs, we first propose a formal definition of heterogeneous temporal hypergraphs and P-uniform heterogeneous hyperedge construction algorithm that does not rely on additional information. Then, a novel Heterogeneous Temporal HyperGraph Neural network (HTHGN), is proposed to fully capture higher-order interactions in HTGs. HTHGN contains a hierarchical attention mechanism module that simultaneously performs temporal message-passing between heterogeneous nodes and hyperedges to capture rich semantics in a wider receptive field brought by hyperedges. Furthermore, HTHGN performs contrastive learning by maximizing the consistency between low-order correlated heterogeneous node pairs on HTG to avoid the low-order structural ambiguity issue. Detailed experimental results on three real-world HTG datasets verify the effectiveness of the proposed HTHGN for modeling high-order interactions in HTGs and demonstrate significant performance improvements.
5914: No Regret Reinforcement Learning Algorithms for Online Scheduling with Multi-Stage Tasks
Authors: Yongxin Xu, Hengquan Guo, Ziyu Shao, Xin Liu
Location: Guangzhou | Day: TBD
Show Abstract
We study online task scheduling problems where tasks arrive sequentially and are processed by the platform or server. The service processes for tasks are multi-stage and are modeled as episodic Markov Decision Processes (MDPs). While processing a task, the system acquires rewards by consuming resources. The goal of the platform is to maximize the reward-to-cost ratio over a sequence of K tasks.
Online scheduling with multi-stage tasks faces two major challenges: intra-dependence among the different stages within a task and inter-dependence among different tasks. These challenges are further exacerbated by the unknown rewards, costs, and task arrival distribution. To address these challenges, we propose the Robbins-Monro-based Value Iteration for Ratio Maximization (RM^2VI) algorithm. Specifically,RM^2VI addresses “intra-dependence” through optimistic value iteration and handles “inter-dependence” using the Robbins-Monro method. The algorithm has a greedy structure and achieves a sub-linear regret of O(K^(3/4)), establishing the no-regret property (per-task).
We test RM^2VI in two synthetic experiments of sale promotion in E-commerce and machine learning job training in cloud computing. The results show RM^2VI achieves the best reward-to-cost ratio compared with the baselines.
5920: Logarithmic Approximations for Fair k-Set Selection
Authors: Shi Li, Chenyang Xu, Ruilong Zhang
Location: Guangzhou | Day: TBD
Show Abstract
We study the fair k-set selection problem where we aim to select k sets from a given set system such that the (weighted) occurrence times that each element appears in these k selected sets are balanced, i.e., the maximum (weighted) occurrence times are minimized. By observing that a set system can be formulated into a bipartite graph G:=(L cup R, E), our problem is equivalent to selecting k vertices from R such that the maximum (weighted) number selected neighbors of vertices in L is minimized. The problem arises in a wide range of applications in various fields, such as machine learning, artificial intelligence, and operations research.

We first prove that the problem is NP-hard even if the maximum degree Delta of the input bipartite graph is 3, and the problem is in P when Delta=2. We then show that the problem is also in P when the input set system forms a laminar family. Based on intuitive linear programming, we show that two rounding algorithms achieve O(log n/(log log n))-approximation on general bipartite graphs, and an independent rounding algorithm achieves O(log(Delta))-approximation on bipartite graphs with a maximum degree Delta. We demonstrate that our analysis is almost tight by providing a hard instance for this linear programming.
5923: AttentionDrag: Exploiting Latent Correlation Knowledge in Pre-trained Diffusion Models for Image Editing
Authors: Biao Yang, Muqi Huang, Yuhui Zhang, Yun Xiong, Kun Zhou, Xi Chen, Shiyang Zhou, Huishuai Bao, Chuan Li, Feng Shi, Hualei Liu
Location: Guangzhou | Day: TBD
Show Abstract
Traditional point-based image editing methods rely on iterative latent optimization or geometric transformations, which are either inefficient in their processing or fail to capture the semantic relationships within the image. These methods often overlook the powerful yet underutilized image editing capabilities inherent in pre-trained diffusion models. In this work, we propose a novel one-step point-based image editing method, named \textbf{AttentionDrag}, which leverages the inherent latent knowledge and feature correlations within pre-trained diffusion models for image editing tasks. This framework enables semantic consistency and high-quality manipulation without the need for extensive re-optimization or retraining. Specifically, we reutilize the latent correlations knowledge learned by the self-attention mechanism in the U-Net module during the DDIM inversion process to automatically identify and adjust relevant image regions, ensuring semantic validity and consistency. Additionally, AttentionDrag adaptively generates masks to guide the editing process, enabling precise and context-aware modifications with friendly interaction. Our results demonstrate a performance that surpasses most state-of-the-art methods with significantly faster speeds, showing a more efficient and semantically coherent solution for point-based image editing tasks. Code is released at: https://github.com/GPlaying/AttentionDrag.
5931: Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation
Authors: Feizhen Huang, Yu Wu, Yutian Lin, Bo Du
Location: Guangzhou | Day: TBD
Show Abstract
Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.
5933: Active Multimodal Distillation for Few-shot Action Recognition
Authors: Weijia Feng, Yichen Zhu, Ruojia Zhang, Chenyang Wang, Fei Ma, Xiaobao Wang, Xiaobai Li
Location: Guangzhou | Day: TBD
Show Abstract
Owing to its rapid progress and broad application prospects, few-shot action recognition has attracted considerable interest. However, current methods are predominantly based on limited single-modal data, which does not fully exploit the potential of multimodal information. This paper presents a novel framework that actively identifies reliable modalities for each sample using task-specific contextual cues, thus significantly improving recognition performance. Our framework integrates an Active Sample Inference (ASI) module, which utilizes active inference to predict reliable modalities based on posterior distributions and subsequently organizes them accordingly. Unlike reinforcement learning, active inference replaces rewards with evidence-based preferences, making more stable predictions.
Additionally, we introduce an active mutual distillation module that enhances the representation learning of less reliable modalities by transferring knowledge from more reliable ones. Adaptive multimodal inference is employed during the meta-test to assign higher weights to reliable modalities. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing approaches.
5956: MaskDGNN: Self-Supervised Dynamic Graph Neural Networks with Activeness-aware Temporal Masking
Authors: Yiming He, Xiang Li, Zhongying Zhao, Haobing Liu, Peilan He, Yanwei Yu
Location: Guangzhou | Day: TBD
Show Abstract
Integrating dynamics into graph neural networks (GNNs) provides deeper insights into the evolution of dynamic graphs, thereby enhancing the temporal representation in real-world dynamic network problems. Existing methods extracting critical information from dynamic graphs face two key challenges, either overlooking the negative impact of redundant information or struggling in addressing the distribution shifting issue in dynamic graphs. To address these challenges, we propose MaskDGNN, a novel dynamic GNN architecture that consists of two modules: First, self-supervised activeness-aware temporal masking mechanism selectively retains edges between highly active nodes while masking those with low activeness, effectively reducing redundancy. Second, adaptive frequency enhancing graph representation learner amplifies the frequency-domain features of nodes to capture intrinsic features under distribution shifting. Experiments on five real-world dynamic graph datasets demonstrate that MaskDGNN outperforms state-of-the-art methods, achieving an average improvement of 7.07% in accuracy and 13.87% in MRR for link prediction tasks.
5957: Enhanced Unsupervised Discriminant Dimensionality Reduction for Nonlinear Data
Authors: Qianqian Wang, Mengping Jiang, Wei Feng, Zhengming Ding
Location: Guangzhou | Day: TBD
Show Abstract
Linear Discriminant Analysis (LDA) is a classical supervised dimensionality reduction algorithm. However, LDA focuses more on global structure and overly depends on reliable data labels. For data with outliers and nonlinear structures, LDA cannot effectively capture the true structure of the data. Moreover, the subspace dimension learned by LDA must be smaller than cluster number, which limits its practical applications. To address these issues, we propose a novel unsupervised LDA method that combines centerless K-means and LDA. This method eliminates the need to calculate cluster centroids and improves model robustness. By fusing centerless K-means and LDA into a unified framework and deducing the connection between K-means and manifold learning, this method captures the local manifold structure and discriminative structure. Additionally, the dimensionality of the subspace is not restricted. This method not only overcomes the limitations of traditional LDA but also improves the model’s adaptability to complex data. Extensive experiments on seven datasets demonstrate the effectiveness of the proposed method.
5979: Learning to Explain: Towards Human-Aligned Explainability in Deep Reinforcement Learning via Attention Guidance
Authors: Bokai Ji, Guangxia Li, Yulong Shen, Gang Xiao
Location: Guangzhou | Day: TBD
Show Abstract
Recent advances in explainable deep reinforcement learning (DRL) have provided insights into the reasoning behind decisions made by DRL agents. However, existing methods often overlook the subjective nature of explanations and fail to consider human cognitive styles and preferences. Such ignorance tends to reduce the interpretability and relevance of the generated explanations from a human evaluator’s perspective. To address this issue, we introduce human cognition into the explaining procedure by integrating DRL with attention guidance in a novel manner. The proposed concept proximal policy optimization (Concept-PPO) learns to generate human-aligned explanations by jointly optimizing the DRL performance and the discrepancy between generated explanations and human annotations. Its key component is a specially designed spatial concept transformer that can enhance explaining efficiency by premasking decision-irrelevant information. Experiments on the ATARI benchmark demonstrate that Concept-PPO achieves better policies than its black-box counterparts, and user studies confirm its superiority in generating human-aligned explanations compared to existing explainable DRL methods.
5992: Fair Submodular Maximization over a Knapsack Constraint
Authors: Lijun Li, Chenyang Xu, Liuyi Yang, Ruilong Zhang
Location: Guangzhou | Day: TBD
Show Abstract
We consider fairness in submodular maximization subject to a knapsack constraint, a fundamental problem with various applications in economics, machine learning, and data mining. In the model, we are given a set of ground elements, each associated with a cost and a color, and a monotone submodular function defined over them. The goal is to maximize the submodular function while guaranteeing that the total cost does not exceed a specified budget (the knapsack constraint) and that the number of elements selected for each color falls within a designated range (the fairness constraint).

While there exists some recent literature on this topic, the existence of a non-trivial approximation for the problem — without relaxing either the knapsack or fairness constraints — remains a challenging open question. This paper makes progress in this direction. We demonstrate that when the number of colors is constant, there exists a polynomial-time algorithm that achieves a constant approximation with high probability. Additionally, we show that if either the knapsack or fairness constraint is relaxed only to require expected satisfaction, a tight approximation ratio of (1-1/e-epsilon) can be obtained in expectation for any epsilon >0.
5994: MSMAR-RL: Multi-Step Masked-Attention Recovery Reinforcement Learning for Safe Maneuver Decision in High-Speed Pursuit-Evasion Game
Authors: Yang Zhao, Wenzhe Zhao, Xuelong Li
Location: Guangzhou | Day: TBD
Show Abstract
Ensuring the safety of high-speed agent in dynamic adversarial environments, such as pursuit-evasion games with target-purchase and obstacle-avoidance, is a significant challenge. Existing reinforcement learning methods often fail to balance safety and reward under strict safety constraints and diverse environmental conditions. To address these limitations, this paper proposes a novel zero-constraint-violation recovery RL framework tailored for high-speed uav pursuit-evasion combat games. The framework includes three key innovations. (1) An extendable multi-step reach-avoid theory: we provide a zero-constraint-violation safety guarantee for multi-strategy reinforcement learning and enabling early danger detection in high speed game. (2) A masked-attention recovery strategy: we introduce a padding-mask attention architecture to handle spatiotemporal variations in dynamic obstacles with varying threat levels. (3) Experimental validation: we validate the framework in obstacle-rich pursuit-evasion scenarios, demonstrating its superiority through comparison with other algorithm and ablation studies. Our approach also shows potential for extension to other rapid-motion tasks and more complex hazardous scenarios. Details and code could be found at https://msmar-rl.github.io.
6013: Contrastive Cross-Course Knowledge Tracing via Concept Graph Guided Knowledge Transfer
Authors: Wenkang Han, Wang Lin, Liya Hu, Zhenlong Dai, Yiyun Zhou, Mengze Li, Zemin Liu, Chang Yao, Jingyuan Chen
Location: Guangzhou | Day: TBD
Show Abstract
Knowledge tracing (KT) aims to predict learners’ future performance based on historical learning interactions. However, existing KT models predominantly focus on data from a single course, limiting their ability to capture a comprehensive understanding of learners’ knowledge states. In this paper, we propose TransKT, a contrastive cross-course knowledge tracing method that leverages concept graph guided knowledge transfer to model the relationships between learning behaviors across different courses, thereby enhancing knowledge state estimation. Specifically, TransKT constructs a cross-course concept graph by leveraging zero-shot Large Language Model (LLM) prompts to establish implicit links between related concepts across different courses. This graph serves as the foundation for knowledge transfer, enabling the model to integrate and enhance the semantic features of learners’ interactions across courses. Furthermore, TransKT includes an LLM-to-LM pipeline for incorporating summarized semantic features, which significantly improves the performance of Graph Convolutional Networks (GCNs) used for knowledge transfer. Additionally, TransKT employs a contrastive objective that aligns single-course and cross-course knowledge states, thereby refining the model’s ability to provide a more robust and accurate representation of learners’ overall knowledge states. Our code and datasets are available at https://github.com/DQYZHWK/TransKT/.
6021: Dual Robust Unbiased Multi-View Clustering for Incomplete and Unpaired Information
Authors: Liang Zhao, Ziyue Wang, Chuanye He, Qingchen Zhang, Bo Xu
Location: Guangzhou | Day: TBD
Show Abstract
Recently, multi-view data has gradually attracted attention. However, real-world applications often face Partial View-aligned Problem (PVP) and Partially Sample-missing Problem (PSP) due to data loss or corruption. Existing methods addressing PVP typically focus only on learning from the information of aligned data, while ignoring unaligned data where samples exist but lack alignment relationships. This introduces PSP, which does not inherently exist in the data, leading to biased learning of the data’s information. For PSP, due to varying degrees of missing data, incomplete spatial structures can cause clustering centers-shifted problem, resulting in the model learning incorrect correspondences and biased spatial structures.To tackle them, we propose a novel method called Dual Robust Unbiased Multi-View Clustering for Incomplete and Unpaired Information (DRUMVC). To our knowledge, this is the first noise-robust and unbiased multi-view clustering method capable of simultaneously addressing both PVP and PSP. Specifically, DRUMVC leverages aligned and complete samples as a bridge to construct high-quality correspondences for samples lacking cross-view relationship information due to PVP or PSP. Additionally, we employ a dual noise-robust contrastive learning loss to mitigate the impact of noise potentially introduced during the pair construction. Experiments on several challenging datasets demonstrate the superiority of our proposed method.
6026: SAP: Privacy-Preserving Fine-Tuning on Language Models with Split-and-Privatize Framework
Authors: Xicong Shen, Yang Liu, Yi Liu, Peiran Wang, Huiqi Liu, Jue Hong, Bing Duan, Zirui Huang, Yunlong Mao, Ye Wu, Sheng Zhong
Location: Guangzhou | Day: TBD
Show Abstract
Pre-trained Language Models (PLM) have enabled a cost-effective approach to handling various downstream applications via Parameter-Efficient-Fine-Tuning (PEFT) techniques. In this context, service providers have introduced a popular fine-tuning-based product service known as Model-as-a-Service (MaaS). This service offers users access to extensive PLMs and training resources. With MaaS, users can fine-tune, deploy, and utilize their customized models seamlessly, leveraging a one-stop platform that allows them to work with their private datasets efficiently. However, this service paradigm has recently been exposed to the possibility of leaking user private data. To this end, we identify the data privacy leakage risks in MaaS-based PEFT and propose a Split-and-Privatize (SAP) framework, mitigating the privacy leakage by integrating split learning and differential privacy into MaaS PEFT. Furthermore, we propose Contributing-Token-Identification (CTI), a novel method to balance model utility degradation and privacy leakage. As a result, the proposed framework is comprehensively evaluated, demonstrating a 65% improvement in empirical privacy with only a 1% degradation in model performance on the Stanford Sentiment Treebank dataset, outperforming existing state-of-the-art baselines.
6044: Learning Neural Jump Stochastic Differential Equations with Latent Graph for Multivariate Temporal Point Processes
Authors: Yuchen Wang, Dongpeng Hou, Chao Gao, Xianghua Li
Location: Guangzhou | Day: TBD
Show Abstract
Multivariate Temporal Point Processes (MTPPs) play an important role in diverse domains such as social networks and finance for predicting event sequence data. In recent years, MTPPs based on Ordinary Differential Equations (ODEs) and Stochastic Differential Equations (SDEs) have demonstrated their strong modeling capabilities. However, these models have yet to thoroughly consider the underlying relationships among different event types to enhance their modeling capacity. Therefore, this paper introduces a method that uses neural SDEs with a jump process guided by the latent graph. Firstly, our proposed method employs multi-dimensional SDEs to capture the dynamics of the intensity function for each event type. Subsequently, a latent graph structure is integrated into the jump process without any encoder, aiming to enhance the modeling and predictive capabilities for MTPPs. Theoretical analysis guarantees the existence and uniqueness of the solution for our proposed method. The experiments conducted on multiple real-world datasets show that our approaches demonstrate significant competitiveness when compared to state-of-the-art neural point processes. Meanwhile, the trainable parameters of the latent graph also improve the model interpretability without any prior knowledge. Our code is available at https://github.com/cgao-comp/LNJSDE.
6053: Towards Automatic Sampling of User Behaviors for Sequential Recommender Systems
Authors: Hao Zhang, Mingyue Cheng, Zhiding Liu, Junzhe Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Sequential recommender systems (SRS) have gained increasing popularity due to their remarkable proficiency in capturing dynamic user preferences. In the current setup of SRS, a common configuration is to uniformly consider each historical behavior as a positive interaction. However, this setting has the potential to yield sub-optimal performance as each individual item often have a different impact on shaping the user’s interests. Hence, in this paper, we propose a novel automatic sampling framework for sequential recommendation, named AutoSAM, to non-uniformly treat historical behaviors. Specifically, AutoSAM extends the conventional SRS framework by integrating an extra sampler to intelligently discern the skew distribution of the raw input, and then sample informative sub-sets to build more generalizable SRS. To tackle the challenges posed by non-differentiable sampling actions and to introduce multiple decision factors for sampling, we further design a novel reinforcement learning based method to guide the training of the sampler. Furthermore, we theoretically devise multi-objective sampling rewards including Future Prediction and Sequence Perplexity, and then optimize the whole framework in an end-to-end manner by combining the policy gradient. We conduct extensive experiments on benchmark recommendation models and four real-world datasets. The experimental results demonstrate the effectiveness of the proposed AutoSAM.
6109: Enhancing Mixture of Experts with Independent and Collaborative Learning for Long-Tail Visual Recognition
Authors: Yanhao Chen, Zhongquan Jian, Nianxin Ke, Shuhao Hu, Junjie Jiao, Qingqi Hong, Qingqiang Wu
Location: Guangzhou | Day: TBD
Show Abstract
Deep neural networks (DNNs) face substantial challenges in Long-Tail Visual Recognition (LTVR) due to the inherent class imbalances in real-world data distributions.
The Mixture of Experts (MoE) framework has emerged as a promising approach to addressing these issues.
However, in MoE systems, experts are typically trained to optimize a collective objective, often neglecting the individual optimality of each expert. This individual optimality usually contributes to the overall performance, as the goals of different experts are not mutually exclusive.
We propose the Independent and Collaborative Learning (ICL) framework to optimize each expert independently while ensuring global optimality.
First, Diverse Optimization Learning (DOL) is introduced to enhance expert diversity and individual performance.
Then, we conceptualize experts as parallel circuit branches and introduce Competition and Collaboration Learning (CoL). Competition Learning amplifies the gradients of better-performing experts to preserve individual optimality, and Collaboration Learning encourages collaboration through mutual distillation to enhance optimal knowledge sharing.
ICL achieves state-of-the-art accuracy in experiments on CIFAR-100/10-LT, ImageNet-LT, and iNaturalist 2018, respectively. Our code is available at https://github.com/PolarisLight/ICL.
6116: Language-Guided Hybrid Representation Learning for Visual Grounding on Remote Sensing Images
Authors: Biao Liu, Xu Liu, Lingling Li, Licheng Jiao, Fang Liu, Xinyu Sun, Youlin Huang
Location: Guangzhou | Day: TBD
Show Abstract
Visual grounding (VG) refers to detecting the specific objects in images based on linguistic expressions, and it has profound significance in the advanced interpretation of natural images. In remote sensing image interpretation, visual grounding is limited by characteristics such as the complex scenes and diverse object sizes. To solve this problem, we propose a novel remote sensing visual grounding (RSVG) framework, named language-guided hybrid representation learning Transformer (LGFormer). Specifically, we designed a multimodal dual-encoder Transformer structure called the adaptive multimodal feature fusion module. This structure innovatively integrates text and visual features as hybrid queries, enabling early-stage decoding queries to perceive the target position accurately. Then, the different modal information from the dual encoders is aggregated by hybrid queries to obtain the final object embedding for coordinate regression. Besides, a multi-scale cross-modal feature enhancement module (MSCM) is designed to enhance the self-representation of the extracted text and visual features and align them semantically. As for the hybrid queries, we use linguistic guidance to select visual features as the visual part and sentence-level features as the textual part. Finally, the LGFormer model we designed achieved the best results compared to existing models on the DIOR-RSVG and OPT-RSVG datasets.
6148: DUQ: Dual Uncertainty Quantification for Text-Video Retrieval
Authors: Xin Liu, Shibai Yin, Jun Wang, Jiaxin Zhu, Xingyang Wang, Yee-Hong Yang
Location: Guangzhou | Day: TBD
Show Abstract
Text-video retrieval establishes accurate similarity relationships between text and video through feature enhancement and granularity alignment. However, relying solely on similarity to associate intra-pair features and distinguish inter-pair features is insufficient, \textit{e.g.}, when querying a multi-scene video with sparse text or selecting the most relevant video from many similar candidates. In this paper, we propose a novel Dual Uncertainty Quantification (DUQ) model that separately handles uncertainties in intra-pair interaction and inter-pair exclusion. Specifically, to enhance intra-pair interaction, we propose an intra-pair similarity uncertainty module to provide similarity-based trustworthy predictions and explicitly model this uncertainty. To increase inter-pair exclusion, we propose an inter-pair distance uncertainty module to construct a distance-based diversity probability embeding, thereby widening the gap between similar features. The two components work synergistically, jointly improving the calculation of similarity between features. We evaluate our model on six benchmark datasets: MSRVTT (51.2%), DiDeMo, MSVD, LSMDC, Charades, and VATEX, achieving state-of-the-art retrieval performance.
6149: A Fast and Accurate ANN-SNN Conversion Algorithm with Negative Spikes
Authors: Xu Wang, Dongchen Zhu, Jiamao Li
Location: Guangzhou | Day: TBD
Show Abstract
Spiking neural network (SNN) is an event-driven neural network that can greatly reduce the power consumption of the conventional artificial neural networks (ANN). Many ANN models can be converted to SNN models when the activation function is ReLU. For ANN models with other activation functions, such as the Leaky ReLU function, the converted SNN models either suffer from serious accuracy degradation or require a long time step. In this paper, we propose a fast and accurate ANN-SNN conversion algorithm for models with the Leaky ReLU function. We design a novel neuron model that supports negative spikes. To address the problem of long tail distribution in the activation values, we propose a threshold optimization algorithm based on the variance of the activation values. To avoid the problem of error accumulation, we jointly calibrate all layers in the SNN model with adaptive weighting. Experiment results verify the effectiveness of the proposed algorithm.
6153: RobustHAR: Multi-scale Spatial-temporal Masked Self-supervised Pre-training for Robust Human Activity Recognition
Authors: Xiao Liu, Guan Yuan, Yanmei Zhang, Shang Liu, Qiuyan Yan
Location: Guangzhou | Day: TBD
Show Abstract
Human activity recognition (HAR) is prone to performance degradation in real-world applications due to data missing between intra-sensor and inter-sensor channels. Masked modeling, as one mainstream paradigm of self-supervised pre-training, can learn robust representations across sensors in the data missing scenario by reconstructing the masked content based on the unmasked part. However, the existing methods predominantly emphasize the temporal dynamics of human activities, which limits their ability to effectively capture the spatial interdependencies among multiple sensors. Besides, different human activities often span across various spatial-temporal scales, which results in activity recognizer failing to capture intricate spatial-temporal semantic information. To address these issues, we propose RobustHAR, a new HAR model with multi-scale spatial-temporal masked self-supervised pre-training designed to improve model performance on the data missing context. RobustHAR involves three main steps: (1) RobustHAR constructs location-inspired spatial-temporal 3D-variation modeling to capture spatial-temporal correlated information in human activity data. (2) RobustHAR then designs multi-scale spatial-temporal masked self-supervised pre-training with semantic-consistent multi-scale feature co-learning for learning robust features at different scales. (3) Finally, RobustHAR fine-tunes the pretraining model with adaptive multi-scale feature fusion for human activity recognition. Extensive experiments on three public multi-sensor datasets demonstrate that RobustHAR outperforms existing state-of-the-art methods.
6154: Stackelberg vs. Nash in the Lottery Colonel Blotto Game
Authors: Yan Liu, Bonan Ni, Weiran Shen, Zihe Wang, Jie Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Resource competition problems are often modeled using Colonel Blotto games, where players take simultaneous actions. However, many real-world scenarios involve sequential decision-making rather than simultaneous moves.

To model these dynamics, we represent the Lottery Colonel Blotto game as a Stackelberg game, in which one player, the leader, commits to a strategy first, and the other player, the follower, responds. We derive the Stackelberg equilibrium for this game, formulating the leader’s strategy as a bi-level optimization problem.

To solve this, we develop a constructive method based on iterative game reductions, which allows us to efficiently compute the leader’s optimal commitment strategy in polynomial time. Additionally, we identify the conditions under which the Stackelberg equilibrium coincides with the Nash equilibrium. Specifically, this occurs when the budget ratio between the leader and the follower equals a certain threshold, which we can calculate in closed form. In some instances, we observe that when the leader’s budget exceeds this threshold, both players achieve higher utilities in the Stackelberg equilibrium compared to the Nash equilibrium.
Lastly, we show that, in the best case, the leader can achieve an infinite utility improvement by making an optimal first move compared to the Nash equilibrium.
6162: DToMA: Training-free Dynamic Token MAnipulation for Long Video Understanding
Authors: Bowen Yuan, Sisi You, Bing-Kun Bao
Location: Guangzhou | Day: TBD
Show Abstract
Video Large Language Models (VideoLLMs) often require thousands of visual tokens to process long videos, leading to substantial computational costs, further exacerbated by visual token inefficiency. Existing token reduction and alternative video representation methods improve efficiency but often compromise comprehension abilities. In this work, we analyze the reasoning processes of VideoLLMs in multi-choice VideoQA task, identifying three reasoning stages—shallow, intermediate, and deep stages—that closely mimic human cognitive processing. Our analysis reveals specific inefficiencies at each stage: in shallow layers, VideoLLMs attempt to memorize all video details without prioritizing relevant content; in intermediate layers, models fail to re-examine uncertain content dynamically; and in deep layers, they continue processing video even when sufficiently confident. To bridge this gap, we propose DToMA, a training-free Dynamic Token MAnipulation method inspired by human adjustment mechanisms in three aspects: 1) Text-guided keyframe-aware reorganization to prioritize keyframes and reduce redundancy, 2) Uncertainty-based visual injection to revisit content dynamically, and 3) Early-exit pruning to halt visual tokens when confident. Experiments on 6 long video understanding benchmarks show that DToMA enhances both efficiency and comprehension, outperforming state-of-the-art methods and generalizing well across 3 VideoLLM architectures and sizes. Code is available at https://github.com/yuanrr/DToMA.
6170: Conditional Independent Test in the Presence of Measurement Error with Causal Structure Learning
Authors: Hongbin Zhang, Kezhou Chen, Nankai Lin, Aimin Yang, Zhifeng Hao, Zhengming Chen
Location: Guangzhou | Day: TBD
Show Abstract
Testing conditional independence is a critical task, particularly in causal discovery and learning in Bayesian networks. However, in many real-world scenarios, variables are often measured with errors, such as those introduced by insufficient measurement accuracy, complicating the testing process. This paper focuses on testing conditional independence in the linear non-Gaussian measurement error model, under the condition that measurement error noise follows a Gaussian distribution. By leveraging high-order cumulants, we derive rank constraints on the cumulant matrix and establish their role in effectively assessing conditional independence, even in the presence of measurement errors. Based on these theoretical results, we leverage the rank constraints of the cumulant matrix as a tool for conditional independence testing and incorporate it into the PC algorithm, resulting in the PC-ME algorithm — a method designed to learn causal structures from observed data while accounting for measurement errors. Experimental results demonstrate that the proposed method outperforms existing approaches, particularly in cases other methods encounter difficulties.
6174: Empowering Multimodal Road Traffic Profiling with Vision Language Models and Frequency Spectrum Fusion
Authors: Haolong Xiang, Xiaolong Xu, Guangdong Wang, Xuyun Zhang, Xiaoyong Li, Qi Zhang, Amin Beheshti, Wei Fan
Location: Guangzhou | Day: TBD
Show Abstract
With the rapid urbanization in the modern era, smart traffic profiling based on multimodal sources of data has been playing a significant role in ensuring safe travel, reducing traffic congestion and optimizing urban mobility. Most existing methods for traffic profiling on the road level usually utilize single-modality data, i.e., they mainly focus on image processing with deep vision models or auxiliary analysis on the textual data. However, the joint modeling and multimodal fusion of the textual and visual modalities have been rarely studied in road traffic profiling, which largely hinders the accurate prediction or classification of traffic conditions. To address this issue, we propose a novel multimodal learning and fusion framework for road traffic profiling, named TraffiCFUS. Specifically, given the traffic images, our TraffiCFUS framework first introduces Vision Language Models (VLMs) to generate text and then creates tailored prompt instructions for refining this text according to the specific scene requirements of road traffic profiling. Next, we apply the discrete Fourier transform to convert multimodal data from the spatial domain to the frequency domain and perform a cross-modal spectrum transform to filter out irrelevant information for traffic profiling. Furthermore, the processed spatial multimodal data is combined to generate fusion loss and interaction loss with contrastive learning. Finally, extensive experiments on four real-world datasets illustrate superior performance compared with the state-of-the-art approaches.
6181: DGCPL: Dual Graph Distillation for Concept Prerequisite Relation Learning
Authors: Miao Zhang, Jiawei Wang, Jinying Han, Kui Xiao, Zhifei Li, Yan Zhang, Hao Chen, Shihui Wang
Location: Guangzhou | Day: TBD
Show Abstract
Concept prerequisite relations determine the learning order of knowledge concepts in one domain, which has an important impact on teachers’ course design and students’ personalized learning. Current research usually predicts concept prerequisite relations from the perspective of knowledge, and rarely pays attention to the role of learners’ learning behavior. We propose a Dual Graph Distillation Method for Concept Prerequisite Relation Learning (DGCPL). Specifically, DGCPL constructs a dual graph structure from both the knowledge and learning behavior perspectives, and captures the high-order knowledge features and learning behavior features through the concept-resource hypergraph and the learning behavior graph respectively. In addition, we introduce a gated knowledge distillation to fuse the structural information of concept nodes in the two graphs, so as to obtain a more comprehensive concept embedding representation and achieve accurate prediction of prerequisite relations. On three public benchmark datasets, we compare DGCPL with eight graph-based baseline methods and five traditional classification baseline methods. The experimental results show that DGCPL achieves state-of-the-art performance in learning concept prerequisite relations. Our code is available at https://github.com/wisejw/DGCPL.
6184: FedCCH: Automatic Personalized Graph Federated Learning for Inter-Client and Intra-Client Heterogeneity
Authors: Pengfei Jiao, Zian Zhou, Meiting Xue, Huijun Tang, Zhidong Zhao, HuaMing Wu
Location: Guangzhou | Day: TBD
Show Abstract
Graph federated learning (GFL) is increasingly utilized in domains such as social network analysis and recommendation systems, where non-IID data exist extensively and necessitate a strong emphasis on personalized learning. However, existing methods focus only on the personality among different clients instead of the personality within a client which widely exists in the real social networks, where intra-client personality addresses the heterogeneity of known data, while inter-client personality always tackle client heterogeneity under privacy constraint. In this paper, we propose a novel automatic personalized graph federated learning (PGFL) scheme named FedCCH to capture both inter-client and intra-client heterogeneity. For intra-client heterogeneity, we innovatively propose the learnable Personalized Factor (PF) to automatically normalize each graph representation within clients by learnable parameters, which weakens the impact of non-IID data distribution. For inter-client heterogeneity, we propose a novel hash-based similarity clustering method to generate the hash signature for each client, and then group similar clients for joint training among different clients. Ultimately, we collaboratively train intra-client and inter-client modules to improve the effectiveness of capturing the heterogeneity of the graph data of clients. Experiment results demonstrate that FedCCH outperforms other state-of-the-art baseline methods.
6195: DGraFormer: Dynamic Graph Learning Guided Multi-Scale Transformer for Multivariate Time Series Forecasting
Authors: Han Yan, Dongliang Chen, Guiyuan Jiang, Bin Wang, Lei Cao, Junyu Dong, Yanwei Yu
Location: Guangzhou | Day: TBD
Show Abstract
Multivariate time series forecasting is a critical focus across many fields. Existing transformer-based models have overlooked the explicit modeling of inter-variable correlations. Similarly, the graph-based methods have also failed to address the dynamic nature of multivariate correlations and the noise in correlation modeling. To overcome these challenges, we propose a novel Dynamic Graph Learning Guided Multi-Scale Transformer (DGraFormer) for multivariate time series forecasting. Specifically, our method consists of two main components: Dynamic correlation-aware graph Learning (DCGL) and multi-scale temporal transformer (MTT). The former aims to capture dynamic correlations across different time windows, filters out noise, and selects key weights to guide the aggregation of relevant feature representations. The latter can effectively extract temporal patterns from patch data at varying scales. Finally, the proposed method can capture rich local correlation graph structures and multi-scale global temporal features. Experimental results demonstrate that DGraformer significantly outperforms existing state-of-the-art models on ten real-world datasets, achieving the best performance across multiple evaluation metrics. The source code of our model is available at \url{https://anonymous.4open.science/r/DGraFormer}.
6224: Revisiting Proportional Allocation with Subsidy: Simplification and Improvements
Authors: Xiaowei Wu, Quan Xue, Shengwei Zhou
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we revisit the problem of fair allocation with subsidy. We first consider the allocation of m indivisible chores to n agents with additive (dis)utility functions. Under the assumption that the maximum (dis)utility of an item can be compensated by one dollar, Wu et al. (WINE 2023) showed that a total of n/4 dollars suffices to guarantee a proportional allocation by rounding fractional allocations. Their subsidy guarantee is optimal when n is even. For odd n, there is still a small gap between the upper and lower bounds for the total subsidy. In this paper, we propose a much simpler algorithm for the problem, which does not require rounding fractional allocations, and achieves an optimal subsidy guarantee for all values of n. Different from existing works, our algorithm does not require the computation and rounding of fractional allocations and admits a much simpler analysis. We further show that our algorithm and analysis framework can be extended to the mixture of (subjective) goods and chores, achieving the optimal subsidy guarantee.
6226: Causality-Inspired Disentanglement for Fair Graph Neural Networks
Authors: Guixian Zhang, Debo Cheng, Guan Yuan, Shang Liu, Yanmei Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Fair graph neural networks aim to eliminate discriminatory biases in predictions. Existing approaches often rely on adversarial learning to mitigate dependencies between sensitive attributes and labels but face challenges due to optimisation difficulties. A key limitation lies in neglecting intrinsic causality, which may lead to the entanglement of sensitive and causal factors, discarding causal factors or retaining sensitive factors in the final prediction, especially on unbalanced datasets.
To address this issue, we propose a Causality-inspired Disentangled framework for Fair Graph neural networks (CDFG). In CDFG, node representations are conceptualised as a combination of causal and sensitive factors, enabling fair representation learning by only utilising the causal factors. We first use a counterfactual data generation mechanism to generate counterfactual data with similar causal factors but completely different sensitive factors. Then, we input real-world data and counterfactual data into the factor disentanglement module to achieve independence and disentanglement between the causal factors and sensitive factors. Finally, an adaptive mask module extracts the causal representation for fair and accurate graph-based predictions.
Extensive experiments on three widely used datasets demonstrate that CDFG consistently outperforms existing methods, achieving competitive utility and significantly improved fairness.
6244: Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation
Authors: Ning Wang, Zihan Yan, Weiyang Li, Chuan Ma, He Chen, Tao Xiang
Location: Guangzhou | Day: TBD
Show Abstract
Embodied agents exhibit immense potential across a multitude of domains, making the assurance of their behavioral safety a fundamental prerequisite for their widespread deployment. However, existing research predominantly concentrates on the security of general large language models, lacking specialized methodologies for establishing safety benchmarks and input moderation tailored to embodied agents. To bridge this gap, this paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents. This framework encompasses the entire pipeline, including taxonomy definition, dataset curation, moderator architecture, model training, and rigorous evaluation. Notably, we introduce EAsafetyBench, a meticulously crafted safety benchmark engineered to facilitate both the training and stringent assessment of moderators specifically designed for embodied agents. Furthermore, we propose Pinpoint, an innovative prompt-decoupled input moderation scheme that harnesses a masked attention mechanism to effectively isolate and mitigate the influence of functional prompts on moderation tasks. Extensive experiments conducted on diverse benchmark datasets and models validate the feasibility and efficacy of the proposed approach. The results demonstrate that our methodologies achieve an impressive average detection accuracy of 94.58%, surpassing the performance of existing state-of-the-art techniques, alongside an exceptional moderation processing time of merely 0.002 seconds per instance. The source code and datasets can be found at https://github.com/ZihanYan-CQU/EAsafetyBench.
6252: Enhancing Multimodal Model Robustness Under Missing Modalities via Memory-Driven Prompt Learning
Authors: Yihan Zhao, Wei Xi, Xiao Fu, Jizhong Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Existing multimodal models typically assume the availability of all modalities, leading to significant performance degradation when certain modalities are missing. Recent methods have introduced prompt learning to adapt pretrained models to incomplete data, achieving remarkable performance when the missing cases are consistent during training and inference. However, these methods rely heavily on distribution consistency and fail to compensate for missing modalities, limiting their ability to generalize to unseen missing cases. To address this issue, we propose Memory-Driven Prompt Learning, a framework that adaptively compensates for missing modalities through prompt learning. The compensation strategies are achieved by two types of prompts: generative prompts and shared prompts. Generative prompts retrieve semantically similar samples from a predefined prompt memory that stores modality-specific semantic information, while shared prompts leverage available modalities to provide cross-modal compensation. Extensive experiments demonstrate the effectiveness of the proposed model, achieving significant improvements across diverse missing-modality scenarios, with average performance increasing from 34.76% to 40.40% on MM-IMDb, 62.71% to 77.06% on Food101, and 60.40% to 62.77% on Hateful Memes. The code is available at https://github.com/zhao-yh20/MemPrompt.
6254: A Little Subsidy Ensures MMS Allocation for Three Agents
Authors: Xiaowei Wu, Quan Xue, Shengwei Zhou
Location: Guangzhou | Day: TBD
Show Abstract
We consider the problem of fair allocation of m indivisible items to a group of n agents with subsidies (money). We address scenarios where agents have general additive cost/utility functions. Our work primarily focuses on the special case of three agents. Assuming that the maximum cost/utility of an item to an agent can be compensated by one dollar, we demonstrate that a total subsidy of 1/6 dollars is sufficient to ensure the existence of Maximin Share (MMS) allocations for both goods and chores. Additionally, we provide examples to establish the lower bounds of the required subsidies.
6260: Efficient Hi-Fi Style Transfer via Statistical Attention and Modulation
Authors: Zhirui Fang, Yi Li, Xin Xie, Chengyan Li, Yanqing Guo
Location: Guangzhou | Day: TBD
Show Abstract
Style transfer is a challenging task in computer vision, aiming to blend the stylistic features of one image with the content of another while preserving the content details. Traditional methods often face challenges in terms of computational efficiency and fine-grained content preservation. In this paper, we propose a novel feature modulation mechanism based on parameterized normalization, where the modulation parameters for content and style features are learned using a dual convolution network (BiConv). These parameters adjust the mean and standard deviation of the features, improving both the stability and quality of the style transfer process. To achieve fast inference, we introduce an efficient acceleration technique by leveraging a row and column weighted attention matrix. In addition, we incorporate a contrastive learning scheme to align the local features of the content and the stylized images, improving the fidelity of the generated output. Experimental results demonstrate that our method significantly improves the inference speed and the quality of style transfer while preserving content details, outperforming existing approaches based on both convolution and diffusion.
6264: ADFormer: Aggregation Differential Transformer for Passenger Demand Forecasting
Authors: Haichen Wang, Liu Yang, Xinyuan Zhang, Haomin Yu, Ming Li, Jilin Hu
Location: Guangzhou | Day: TBD
Show Abstract
Passenger demand forecasting helps optimize vehicle scheduling, thereby improving urban efficiency. Recently, attention-based methods have been used to adequately capture the dynamic nature of spatio-temporal data. However, existing methods that rely on heuristic masking strategies cannot fully adapt to the complex spatio-temporal correlations, hindering the model from focusing on the right context. These works also overlook the high-level correlations that exist in the real world. Effectively integrating these high-level correlations with the original correlations is crucial. To fill this gap, we propose the Aggregation Differential Transformer (ADFormer), which offers new insights to demand forecasting promotion. Specifically, we utilize Differential Attention to capture the original spatial correlations and achieve attention denoising. Meanwhile, we design distinct aggregation strategies based on the nature of space and time. Then, the original correlations are unified with the high-level correlations, enabling the model to capture holistic spatio-temporal relations. Experiments conducted on taxi and bike datasets confirm the effectiveness and efficiency of our model, demonstrating its practical value. The code is available at https://github.com/decisionintelligence/ADFormer.
6276: ID-RemovalNet: Identity Removal Network for EEG Privacy Protection with Enhancing Decoding Tasks
Authors: Huabin Wang, Jie Ruan, Cunhang Fan, Yingfan Cheng, Zhao Lv
Location: Guangzhou | Day: TBD
Show Abstract
Electroencephalogram (EEG) contains not only decoding task information but also personal identity privacy information. If it is stolen or attacked, the user’s brain-computer interaction behavior may be maliciously manipulated. Existing EEG identity privacy protection generally adopts generative or adding tiny perturbation methods, which can protect the identity privacy in EEG signals to some extent. However, these methods also damage the performance of decoding task. In order to solve these problems, this paper proposes an identity removal network (ID-RemovalNet) to achieve EEG privacy protection while improving the classification accuracy of decoding task. Firstly, an identity decorrelation separation module is constructed to accurately remove the identity features to achieve privacy protection while reducing the interference with the task decoding features. Secondly, a multi-domain multi-level fusion feature extraction module is designed to extract the high-quality EEG time-frequency features. Finally, the feature enhancement module is used to compensate for the loss of task decoding features and excitation of dominant feature selection during identity feature removal. The experimental results show that ID-RemoveNet removes identity information to 0.43% on four EEG datasets with two different paradigms, and significantly improves the EEG task decoding accuracy by 3.28%, and achieves the state-of-the-art performance in cross-subject EEG experiment.
6286: GPL4SRec: Graph Multi-Level Aware Prompt Learning for Streaming Recommendation
Authors: Hao Cang, Huanhuan Yuan, Jiaqing Fan, Lei Zhao, Guanfeng Liu, Pengpeng Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Streaming Recommendation (SRec) aims to capture evolving user preferences in the streaming scenarios. Recently, Graph Prompt Learning (GPL) methods have demonstrated their effectiveness and adaptability within SRec. However, existing graph prompt solutions rarely consider the evolution of multi-hop cascading relationships between users and items, which are crucial for modeling the shifts in user preferences. To address this problem, we propose a novel Graph Multi-Level Aware Prompt Learning for Streaming Recommendation, named GPL4SRec. Specifically, a graph encoder is first pre-trained on extensive historical data to capture user long-term preferences. Then, we design three types of prompts, namely node-aware, structure-aware, and layer-aware prompts, which are used to guide the pre-trained encoder to better capture user short-term preferences. This is accomplished by accounting for both the incremental changes in users and items, as well as the cascading evolution in multi-hop relationships. Furthermore, we provide a theoretical analysis showing that our prompt templates are critical to achieving superior performance. Finally, experimental results also prove that our model significantly outperforms the state-of-the-art approaches in SRec.
6301: DualCast: A Model to Disentangle Aperiodic Events from Traffic Series
Authors: Xinyu Su, Feng Liu, Yanchuan Chang, Egemen Tanin, Majid Sarvi, Jianzhong Qi
Location: Guangzhou | Day: TBD
Show Abstract
Traffic forecasting is crucial for transportation systems optimisation. Current models minimise the mean forecasting errors, often favouring periodic events prevalent in the training data, while overlooking critical aperiodic ones like traffic incidents. To address this, we propose DualCast, a dual-branch framework that disentangles traffic signals into intrinsic spatial-temporal patterns and external environmental contexts, including aperiodic events. DualCast also employs a cross-time attention mechanism to capture high-order spatial-temporal relationships from both periodic and aperiodic patterns. DualCast is versatile. We integrate it with recent traffic forecasting models, consistently reducing their forecasting errors by up to 9.6% on multiple real datasets.
6311: Omni-Dimensional State Space Model-driven SAM for Pixel-level Anomaly Detection
Authors: Chao Huang, Qianyi Li, Jie Wen, Bob Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Pixel-level anomaly detection is indispensable in industrial defect detection and medical diagnosis. Recently, Segment Anything Model (SAM) has achieved promising results in many vision tasks. However, direct application of the SAM to pixel-level anomaly detection tasks results in unsatisfactory performance, meanwhile SAM needs the manual prompt. Although some automatically prompt-based SAM has been proposed, these automated prompting approaches merely utilize partial image features as prompts and fail to incorporate crucial features such as multi-scale image features to generate more suitable prompts. In this paper, we propose a novel Omni Dimensional State Space Model-driven SAM (ODS-SAM) for pixel-level anomaly detection. Specifically, the proposed method adopts the SAM architecture, ensuring easy implementation and avoiding the need for fine-tuning. A State-Space Model-based residual Omni Dimensional module is designed to automatically generate suitable prompts. This module can effectively leverage multi-scale and global information, facilitating an iterative search for optimal prompts in the prompt space. The identified optimal prompts are then fed into SAM as high-dimensional tensors. Experimental results demonstrate that the proposed ODS-SAM outperforms state-of-the-art models on both industrial and medical image datasets.
6338: A Generalized Diffusion Framework with Learnable Propagation Dynamics for Source Localization
Authors: Dongpeng Hou, Yuchen Wang, Chao Gao, Xianghua Li
Location: Guangzhou | Day: TBD
Show Abstract
Source localization has been widely studied in recent years due to its crucial role in controlling the spread of harmful information. Existing methods only achieve satisfactory performance within a specific propagation model, which restricts their applicability and generalizability across different scenarios. To address this, we propose a Generalized Diffusion Framework for Source Localization (GDFSL), which enhances probabilistic diffusion models to flexibly capture the underlying dynamics of various propagation scenarios. By redefining the forward diffusion process, GDFSL ensures convergence to a real distribution of infected states that accurately represents the targeted dynamics, enabling the model to learn unbiased noise in a self-supervised manner that encodes fine-grained propagation characteristics. A closed-form reverse diffusion process is then derived to trace the propagation back to the source. The process does not rely on an explicit source label term, facilitating direct inference of sources from observed data. Experimental results show that GDFSL outperforms SOTA methods in various propagation models, particularly in scenarios where historical training data is limited or unavailable. The code is available at https://github.com/cgao-comp/GDFSL.
6341: Object-Level Backdoor Attacks in RGB-T Semantic Segmentation with Cross-Modality Trigger Optimization
Authors: Xianghao Jiao, Di Wang, Jiawei Liang, Jianjie Huang, Wei Wang, Xiaochun Cao
Location: Guangzhou | Day: TBD
Show Abstract
The escalating threat of backdoor risks in deep vision models is a pressing concern. Existing research on backdoor attacks is often confined to a single modality, neglecting the challenges posed by multi-modality scene perception. This work is a pioneer of backdoor attacks in RGB-Thermal (RGB-T) semantic segmentation. We overcome the critical limitation of current segmentation backdoor attacks that indiscriminately compromise all objects of a victim class, failing to provide fine-grained control for selectively targeting specific objects as required by adversaries. To address this, we introduce a novel Object-level Backdoor Attack pipeline, termed OBA. The OBA first employs a precise data poisoning (PDP) to lock a specific victim object. Specifically, the PDP embeds the trigger into the only victim object and modifies its label’s pixels at the corresponding positions, thus enabling object-level attacks. In addition, the domain gap between static single-modality triggers and multi-modality scenarios limits the PDP. We therefore introduce a Cross-Modality Trigger Generation (CMTG) method. Through style designs of triggers and cross-modality trigger co-optimization, the target domain semantics and multi-modality model perception patterns are encoded into triggers, achieving high effectiveness, stealth, and physical feasibility of triggers. Extensive experiments show that the proposed OBA enables precise manipulation of the designated object within the specific class.
6354: CoLA-Former: Graph Transformer Using Communal Linear Attention for Lightweight Sequential Recommendation
Authors: Zhongying Zhao, Jinyu Zhang, Chuanxu Jia, Chao Li, Yanwei Yu, Qingtian Zeng
Location: Guangzhou | Day: TBD
Show Abstract
Graph Transformer has shown great promise in capturing the dynamics of user preferences for sequential recommendations. However, the self-attention mechanism within its structure is of quadratic complexity, posing challenges for deployment on devices with limited resources. To this end, we propose a Communal Linear Attention-enhanced Graph TransFormer for lightweight sequential recommendation, namely CoLA-Former. Specifically, we introduce a Communal Linear Attention (CoLAttention) mechanism. It utilizes low-rank yet reusable communal units to calculate the global correlations on sequential graphs. The weights from the units are also made communal across different training batches, enabling inter-batch global weighting. Moreover, we devise a low-rank approximation component. It utilizes weights distillation to reduce the scale of the trainable parameters in the Graph Transformer network. Extensive experimental results on three real-world datasets demonstrate that the proposed CoLA-Former significantly outperforms twelve state-of-the-art methods in accuracy and efficiency. The datasets and codes are available at https://github.com/ZZY-GraphMiningLab/CoLA_Former.
6362: Automated Detection of Pre-training Text in Black-box LLMs
Authors: Ruihan Hu, Yu-Ming Shang, Jiankun Peng, Wei Luo, Yazhe Wang, Xi Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Detecting whether a given text is a member in the pre-training data of Large Language Models (LLMs) is crucial for ensuring data privacy and copyright protection. Most existing methods rely on the LLM’s hidden information (e.g., model parameters or token probabilities), making them ineffective in the black-box setting, where only input and output texts are accessible. Although some methods have been proposed for the black-box setting, they rely on massive manual efforts such as designing complicated questions or instructions. To address these issues, we propose VeilProbe, the first framework for automatically detecting LLMs’ pre-training texts in a black-box setting without human intervention. VeilProbe utilizes a sequence-to-sequence mapping model to infer the latent mapping feature between the input text and the corresponding output suffix generated by the LLM. Then it performs the key token perturbations to obtain more distinguishable membership features. Additionally, considering real-world scenarios where the ground-truth training text samples are limited, a prototype-based membership classifier is introduced to alleviate the overfitting issue. Extensive evaluations on three widely used datasets demonstrate that our framework is effective and superior in the black-box setting.
6367: Good Advisor for Source Localization: Using Large Language Model to Guide the Source Inference Process
Authors: Dongpeng Hou, Wenfei Wei, Chao Gao, Xianghua Li, Zhen Wang
Location: Guangzhou | Day: TBD
Show Abstract
With the rapid development of AI large model technology, large language models (LLMs) provide a new solution for source localization tasks due to the deep linguistic understanding and generation capabilities. However, it is difficult to understand complex propagation patterns and network structures when LLMs are directly applied to source localization, resulting in limited accuracy of source localization. Meanwhile, the high-dimensional embedding of the textual representation introduces significant amounts of redundant features, which also reduces its efficiency in source localization task to some extent. To solve the above problems, this paper proposes a multi-modal fusion framework for rumor source localization, namely Contrastive Rumor Source Localization via LLM (CRSLL), based on the idea of contrastive learning. Specifically, the framework constructs propagation embeddings by comprehensively capturing both propagation dynamics and user profile features, adopts a contrastive learning approach to enhance the representation ability of comment embeddings of rumor cascades by differentiating them from non-rumor cascade comments, filters out invalid features through a differentiable masking strategy, and fuses comment modality embeddings with propagation embeddings through an attention mechanism, so as to better capture the multi-modal data interactions. It is worth mentioning that the framework uses LLM as a good “advisor” to provide a rich deep semantic representation, which improves the accuracy of rumor source localization. The code is available at https://github.com/cgao-comp/CRSLL.
6368: High-Confident Local Structure Guided Consensus Graph Learning For Incomplete Multi-view Clustering
Authors: Shuping Zhao, Lunke Fei, Qi Lai, Jie Wen, Jinrong Cui, Tingting Chai
Location: Guangzhou | Day: TBD
Show Abstract
Current existing clustering methods for handling incomplete multi-view data primarily concentrate on learning a common representation or graph from the available views, while overlooking the latent information contained in the missing views and the imbalance of information among different views. Furthermore, instances with weak discriminative features usually degrading the precision of consistent representation or graph across all views. To address these problems, in this paper, we propose a simple but efficient method, called high-confident local structure guided consensus graph learning for incomplete multi-view clustering (HLSCG_IMC). Specifically, this method can adaptively learn a strict block diagonal structure from the available samples using a block diagonal representation regularizer. Different from the existing methods using a simple pairwise affinity graph for structure construction, we consider the influence of instances located at the edge of two clusters on the construction of graph for each view. By harnessing the proposed high-confident strict block diagonal structures, the approach seeks to directly guide the learning of the robust consensus graph. A number of experiments have been conducted to verify the efficacy of our approach.
6372: RegionMatch: Pixel-Region Collaboration for Semi-Supervised Semantic Segmentation in Remote Sensing Images
Authors: Xiaoqian Zhu, Xiangrong Zhang, Tianyang Zhang, Chaowei Fang, Xu Tang, Licheng Jiao
Location: Guangzhou | Day: TBD
Show Abstract
Semi-supervised semantic segmentation (S4) has shown significant promise in reducing the burden of labor-intensive data annotation. However, existing methods mainly rely on pixel-level information, neglecting the strong region consistency inherent in remote sensing images (RSIs), which limits their effectiveness in handling the complex and diverse backgrounds of RSIs. To address this, we propose RegionMatch, a novel approach that leverages unlabeled data from a fresh object-level perspective, which is more tailored to the nature of semantic segmentation. We design the Pixel-Region Synergy Pseudo-Labeling strategy, which explicitly injects object-level contextual information into the S4 pipeline and promotes knowledge collaboration between pixel and region perspectives for generating high-quality pseudo-labels. In addition, we propose the Region Structure-Aware Correlation Consistency, which models object-level relationships by establishing inter-region correlations across images and pixel correlations within regions, providing more effective supervision signals for unlabeled data. Experimental results demonstrate that RegionMatch outperforms state-of-the-art methods on multiple authoritative remote sensing datasets, highlighting its superiority in the RSIs.
6376: MHANet: Multi-scale Hybrid Attention Network for Auditory Attention Detection
Authors: Lu Li, Cunhang Fan, Hongyu Zhang, Jingjing Zhang, Xiaoke Yang, Jian Zhou, Zhao Lv
Location: Guangzhou | Day: TBD
Show Abstract
Auditory attention detection (AAD) aims to detect the target speaker in a multi-talker environment from brain signals, such as electroencephalography (EEG), which has made great progress. However, most AAD methods solely utilize attention mechanisms sequentially and overlook valuable multi-scale contextual information within EEG signals, limiting their ability to capture long-short range spatiotemporal dependencies simultaneously. To address these issues, this paper proposes a multi-scale hybrid attention network (MHANet) for AAD, which consists of the multi-scale hybrid attention (MHA) module and the spatiotemporal convolution (STC) module. Specifically, MHA combines channel attention and multi-scale temporal and global attention mechanisms. This effectively extracts multi-scale temporal patterns within EEG signals and captures long-short range spatiotemporal dependencies simultaneously. To further improve the performance of AAD, STC utilizes temporal and spatial convolutions to aggregate expressive spatiotemporal representations. Experimental results show that the proposed MHANet achieves state-of-the-art performance with fewer trainable parameters across three datasets, 3 times lower than that of the most advanced model. Code is available at: https://github.com/fchest/MHANet.
6378: Generate or Re-Weight? A Mutual-Guidance Method for Class-Imbalanced Graphs
Authors: Zhongying Zhao, Gen Liu, Qi Meng, Chao Li, Qingtian Zeng
Location: Guangzhou | Day: TBD
Show Abstract
Class imbalance is a widespread problem in graph-structured data. The existing studies tailored for class-imbalanced graphs are typically categorized into generative and re-weighting methods. However, the former merely focuses on quantity balance rather than learning balance. The latter performs the fine-tuning in a majority-minority paradigm, overlooking the authentic-generative one. In fact, the collaboration of them is capable of relieving their respective limitations. To this end, we propose a Mutual-Guidance method for class-imbalanced graphs, namely GraphMuGu. Specifically, we first design an uncertainty-aware method to quantify the number of synthesized samples for each category. Furthermore, we devise a similarity-aware method to re-weight the importance of the authentic and generative samples. To the best our knowledge, the proposed GraphMuGu is the first try to incorporate the generative and re-weighting methods into a unified framework. The experimental results on five class-imbalanced datasets demonstrate the superiority of the proposed method. The source codes are available at https://github.com/ZZY-GraphMiningLab/GraphMuGu.
6381: Efficient Dynamic Graphs Learning with Refined Batch Parallel Training
Authors: Zhengzhao Feng, Rui Wang, Longjiao Zhang, Tongya Zheng, Ziqi Huang, Mingli Song
Location: Guangzhou | Day: TBD
Show Abstract
Memory-based temporal graph neural networks (MTGNN) use node memory to store historical information, enabling efficient processing of large dynamic graphs through batch parallel training, with larger batch sizes leading to increased training efficiency. However, this approach overlooks the interdependency among edges within the same batch, leading to outdated memory states and reduced training accuracy. Previous studies have attempted to mitigate this issue through methods such as measuring memory loss, overlap training, and additional compensation modules. Despite these efforts, challenges persist, including imprecise coarse-grained memory loss measurement and ineffective compensation modules. To address these challenges, we propose the Refined Batch parallel Training (RBT) framework, which accurately evaluates intra-batch information loss and optimizes batch partitioning to minimize loss, enhancing the training process’s effectiveness and efficiency. RBT also includes a precise and efficient memory compensation algorithm. Experimental results demonstrate RBT’s superior performance compared to existing MTGNN frameworks like TGL, ETC, and PRES in terms of training efficiency and accuracy across various dynamic graph datasets. Our code is made publicly available at https://github.com/fengwudi/RBT.
6402: A Prior-based Discrete Diffusion Model for Social Graph Generation
Authors: Shu Yin, Dongpeng Hou, Lianwei Wu, Xianghua Li, Chao Gao
Location: Guangzhou | Day: TBD
Show Abstract
Graph generation is essential in social network analysis, particularly for modeling information flow and user interactions. However, existing probabilistic diffusion models face challenges when applied to social propagation graphs. The continuous noise does not apply to the discrete nature of graph generation tasks, and the random Gaussian initialization in the reverse process can introduce biases that deviate from real-world propagation patterns. To address these issues, this paper introduces a Prior-based Discrete Diffusion Model (PDDM) for social graph generation. PDDM redefines the forward process as a discrete process for node denoising and edge generation, and the task of the denoising module is transformed into the connection probability learning of node-level tasks. Further, PDDM employs a new starting point of the reverse process by incorporating user similarity as the probability matrix, which can better leverage the social context. These developments mitigate reverse-starting bias and enhance model robustness. Moreover, PDDM integrates lightweight deep graph networks such as GAT, demonstrating both scalability and applicability to graph generation scenarios. Comprehensive experiments on real-world social network datasets demonstrate PDDM’s superiority in terms of the MMD metric and downstream tasks. The code is available at https://github.com/cgao-comp/PDDM.
6413: Universal Graph Self-Contrastive Learning
Authors: Liang Yang, Yukun Cai, Hui Ning, Jiaming Zhuo, Di Jin, Ziyi Ma, Yuanfang Guo, Chuan Wang, Zhen Wang
Location: Guangzhou | Day: TBD
Show Abstract
As a pivotal architecture in Self-Supervised Learning (SSL), Graph Contrastive Learning (GCL) has demonstrated substantial application value in scenarios with limited labeled nodes (samples). However, existing GCLs encounter critical issues in the graph augmentation and positive and negative sampling stemming from the lack of explicit supervision, which collectively restrict their efficiency and universality. On the one hand, the reliance on graph augmentations in existing GCLs can lead to increased training times and memory usage, while potentially compromising the semantic integrity. On the other hand, the difficulty in selecting TRUE positive and negative samples for GCLs limits their universality to both homophilic and heterophilic graphs. To address these drawbacks, this paper introduces a novel GCL framework called GRAph learning via Self-contraSt (GRASS). The core mechanism is node-attribute self-contrast, which specifically involves increasing the feature similarities between nodes and their included attributes while decreasing the similarities between nodes and their non-included attributes. Theoretically, the self-contrast mechanism implicitly ensures accurate node-node contrast by capturing high-hop co-inclusion relationships, thereby enabling GRASS to be universally applicable to graphs with varying degrees of homophily. Evaluations on diverse benchmark datasets demonstrate the universality and efficiency of GRASS. The dataset and code are available at URL: https://github.com/YukunCai/GRASS.
6421: Image-Enhanced Hybrid Encoding with Reinforced Contrastive Learning for Spatial Domain Identification in Spatial Transcriptomics
Authors: Daoyuan Wang, Lu Gao, Wenlan Chen, Cheng Liang, Fei Guo
Location: Guangzhou | Day: TBD
Show Abstract
Spatial transcriptomics integrates spatial, gene expression, and multichannel immunohistochemistry image data, enabling advanced insights into cellular organization. However, existing methods often struggle to effectively fuse these multimodal data, limiting their potential for accurate spatial domain identification. Here, we propose IE-HERCL (Image-Enhanced Hybrid Encoding with Reinforced Contrastive Learning), a novel framework designed to address this challenge. Specifically, IE-HERCL employs hybrid encoding to capture both the non-spatial features and spatial dependencies for both gene and image modalities via autoencoders and GraphSAGE, respectively. These features are then fused using cross-view attention mechanisms to generate the unified informative embedding. To enhance the representation learning capability, we introduce a reinforced contrastive learning strategy to mitigate the influences of false negative samples, where we detect potential positive counterparts with high-order random walks. In addition, the cluster alignment is dynamically refined through optimal transport, which ensures that the fused consensus representation is coherent and robust, enabling accurate spatial domain identification. Our approach achieves state-of-the-art performance on five image-enhanced spatial transcriptomics datasets, demonstrating its robustness and effectiveness in multimodal integration and spatial domain identification. IE-HERCL offers a powerful and innovative solution for advancing spatial transcriptomics analysis. The code is released on https://github.com/wdyi701/IE-HERCL.
6447: A Fast-Adaptive Cognitive Diagnosis Framework for Computerized Adaptive Testing Systems
Authors: Yuanhao Liu, Yiya You, Shuo Liu, Hong Qian, Ying Qian, Aimin Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Computerized Adaptive Testing (CAT) measures student ability by iteratively selecting informative questions, with core components being the Cognitive Diagnosis Model (CDM) and selection strategy. Current research focuses on optimizing the selection strategy, assuming relatively accurate CDM results. However, existing static CDMs struggle with rapid and accurate diagnosis in the early stage of CAT. To this end, this paper proposes a Fast Adaptive Cognitive Diagnosis (FACD) framework, which incorporates dynamic collaborative and personalized diagnosis modules. Specifically, the collaborative module in FACD uses a dynamic response graph to quickly build student cognitive profiles, while the personalized module leverages each student’s response sequence for robust and individualized diagnosis. Extensive experiments on real-world datasets show that, compared with existing static CDMs, FACD not only achieves superior prediction performance across various selection strategies with an improvement between roughly 5%-10% in the early stage of CAT, but also maintains a commendable inference speed.
6451: MA-RAG: Automating Role Engineering for RESTful APIs with Multi-Head Attention and Retrieval-Augmented Generation
Authors: Yang Luo, Qingni Shen, Zhonghai Wu
Location: Guangzhou | Day: TBD
Show Abstract
This paper addresses the role engineering problem for RESTful applications and proposes a role engineering method based on multi-head attention and Retrieval Augmented Generation called MA-RAG. The method first performs fine-grained control flow analysis on the system source code to extract permission information of API handlers. Then, using basic blocks as units, it employs pre-trained code models to convert the source code into semantic vectors, which are stored in the retrieval augmented generation model. On this basis, a call chain structure tree is constructed with permissions as the center, utilizing the multi-head attention mechanism to aggregate semantic information of different code granularities from bottom to top, with each attention head corresponding to a role engineering objective. Finally, the root vectors of each permission tree are subjected to self-supervised clustering to adaptively determine the number of roles and perform division. We evaluated MA-RAG on 284 real-world software systems, and the results show that compared with other methods, MA-RAG can significantly save time overhead, reduce the number of generated roles, lower the role permission overlap rate, and improve the interpretability score.
6457: Mamba-Based Graph Convolutional Networks: Tackling Over-smoothing with Selective State Space
Authors: Xin He, Yili Wang, Wenqi Fan, Xu Shen, Xin Juan, Rui Miao, Xin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) have shown great success in various graph-based learning tasks. However, it often faces the issue of over-smoothing as the model depth increases, which causes all node representations to converge to a single value and become indistinguishable. This issue stems from the inherent limitations of GNNs, which struggle to distinguish the importance of information from different neighborhoods. In this paper, we introduce MbaGCN, a novel graph convolutional architecture that draws inspiration from the Mamba paradigm—originally designed for sequence modeling. MbaGCN presents a new backbone for GNNs, consisting of three key components: the Message Aggregation Layer, the Selective State Space Transition Layer, and the Node State Prediction Layer. These components work in tandem to adaptively aggregate neighborhood information, providing greater flexibility and scalability for deep GNN models. While MbaGCN may not consistently outperform all existing methods on each dataset, it provides a foundational framework that demonstrates the effective integration of the Mamba paradigm into graph representation learning. Through extensive experiments on benchmark datasets, we demonstrate that MbaGCN paves the way for future advancements in graph neural network research. Our code is in https://github.com/hexin5515/MbaGCN.
6507: Multimodal Fake News Detection: MFND Dataset and Shallow-Deep Multitask Learning
Authors: Ye Zhu, Yunan Wang, Zitong Yu
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal news contains a wealth of information and is easily affected by deepfake modeling attacks. To combat the latest image and text generation methods, we present a new Multimodal Fake News Detection dataset (MFND) containing 11 manipulated types, designed to detect and localize highly authentic fake news. Furthermore, we propose a Shallow-Deep Multitask Learning (SDML) model for fake news, which fully uses unimodal and mutual modal features to mine the intrinsic semantics of news. Under shallow inference, we propose the momentum distillation-based light punishment contrastive learning for fine-grained uniform spatial image and text semantic alignment, and an adaptive cross-modal fusion module to enhance mutual modal features. Under deep inference, we design a two-branch framework to augment the image and text unimodal features, respectively merging with mutual modalities features, for four predictions via dedicated detection and localization projections. Experiments on both mainstream and our proposed datasets demonstrate the superiority of the model. Codes and dataset are released at https://github.com/yunan-wang33/sdml.
6519: Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction
Authors: Xinhe Li, Jiajun Liu, Peng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Recent studies have demonstrated that Large Language Models (LLMs) have strong mathematical reasoning abilities but rely on hundreds of billions of parameters. To tackle the challenge of poor reasoning in Small Language Models (SLMs), existing methods typically leverage LLMs to generate massive amounts of data for cramming training. In psychology, they are akin to System 1 thinking, which resolves reasoning problems rapidly based on experience and intuition. However, human learning also requires System 2 thinking, where knowledge is first acquired and then reinforced through practice. Inspired by such two distinct modes of thinking, we propose a novel method based on the multi-LoRA Interaction for mathematical reasoning Distillation (LoRID). First, we input the question and reasoning of each sample into an LLM to create knowledge-enhanced datasets. Subsequently, we train a LoRA block on the student model as an Intuitive Reasoner (IR), which directly generates Chain-of-Thoughts for problem-solving. Then, to imitate System 2 thinking, we train the Knowledge Generator (KG) and Deep Reasoner (DR), respectively. The former outputs only knowledge after receiving problems, while the latter uses that knowledge to perform reasoning. Finally, to address the randomness in the generation of IR and DR, we evaluate whether their outputs are consistent, and the inference process needs to be iterated if not. This step can enhance the mathematical reasoning ability of SLMs through mutual feedback. Experimental results show that LoRID achieves state-of-the-art performance, especially on the GSM8K dataset, where it outperforms the second-best method by 2.3%, 16.1%, 2.4%, 12.3%, and 1.8% accuracy across the five base models, respectively. Meanwhile, we select four strong baselines as System 1, and after integrating them with our method, the reasoning ability of student models is consistently and significantly improved. The datasets and codes are available at https://github.com/Xinhe-Li/LoRID.
6523: Disentangling Multi-view Representations via Curriculum Learning with Learnable Prior
Authors: Kai Guo, Jiedong Wang, Xi Peng, Peng Hu, Hao Wang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-view representation learning methods typically follow a consistent-and-specific pipeline that aims at extracting latent representations for an entity from its multiple observable views to facilitate downstream tasks. However, most of them overlook the complex underlying correlation between different views. To solve this issue, we delve into a well-known property of neural networks (NNs) that NNs tend to learn simple patterns first and then hard ones. In our case, view-consistent representations are simple patterns and view-specific representations are hard. To this end, we propose to disentangle view-consistency and view-specificity and learn them gradually. Specifically, we devise a novel curriculum learning approach that adjusts the whole model to learn view-consistent representations first and then progressively view-specific representations. Besides, we saddle each view with a learnable prior that allows each view-specific representation to appropriate its distribution. Moreover, we incorporate a mixture-of-experts layer and a disentangling module to further enhance the quality of the learned representations. Extensive experiments on five real-world datasets show that the proposed model outperforms its counterparts markedly. The code is available at https://github.com/XLearning-SCU/2025-IJCAI-CL2P.
6526: FADE: Towards Fairness-aware Data Generation for Domain Generalization via Classifier-Guided Score-based Diffusion Models
Authors: Yujie Lin, Dong Li, Minglai Shao, Guihong Wan, Chen Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Fairness-aware domain generalization (FairDG) has emerged as a critical challenge for deploying trustworthy AI systems, particularly in scenarios involving distribution shifts. Traditional methods for addressing fairness have failed in domain generalization due to their lack of consideration for distribution shifts. Although disentanglement has been used to tackle FairDG, it is limited by its strong assumptions. To overcome these limitations, we propose Fairness-aware Classifier-Guided Score-based Diffusion Models (FADE) as a novel approach to effectively address the FairDG issue. Specifically, we first pre-train a score-based diffusion model (SDM) and two classifiers to equip the model with strong generalization capabilities across different domains. Then, we guide the SDM using these pre-trained classifiers to effectively eliminate sensitive information from the generated data. Finally, the generated fair data is used to train downstream classifiers, ensuring robust performance under new data distributions. Extensive experiments on three real-world datasets demonstrate that FADE not only enhances fairness but also improves accuracy in the presence of distribution shifts. Additionally, FADE outperforms existing methods in achieving the best accuracy-fairness trade-offs.
6528: Cross-modal Collaborative Representation Learning for Text-to-Image Person Retrieval
Authors: Shuanglin Yan, Jun Liu, Neng Dong, Liyan Zhang, Jinhui Tang
Location: Guangzhou | Day: TBD
Show Abstract
Text-to-image person retrieval (TIPR) aims to find images of the same identity that match a given text description. Current TIPR methods mainly focus on mining the association between images and texts, ignoring their potential complementarity. Besides, existing matching losses treat all positive pairs from the same identity equally, leading to noisy correspondences. In this paper, we propose CoRL: a cross-modal Collaborative Representation Learning framework designed to improve TIPR by effectively leveraging the complementarity between modalities. The text typically contains identity details with less noise, which helps distinguish visually similar pedestrians. This inspires us to integrate it into the corresponding image to emphasize identity-related and modality-shared visual information. However, corresponding text for each image is not always available, especially during inference. Accordingly, we introduce a Virtual-text Embedding Synthesizer that generates high-quality virtual-text features for cross-modal collaboration, eliminating the need for actual texts. We then design a Cross-Modal Collaboration learning process, incorporating a Cross-modal Relation Consistency loss to promote interaction and fusion between image and virtual-text features for mutual enhancement. Additionally, an Identity-bounded Matching loss is proposed to handle different types of image-text pairs distinctly, leading to more accurate cross-modal correspondences. Extensive experiments on multiple benchmarks demonstrate the superiority of CoRL over existing TIPR methods.
6553: Federated Domain Generalization with Decision Insight Matrix
Authors: Tianchi Liao, Binghui Xie, Lele Fu, Sheng Huang, Bowen Deng, Chuan Chen, Zibin Zheng
Location: Guangzhou | Day: TBD
Show Abstract
Federated domain generalization addresses the crucial challenge of developing models that can generalize across diverse domains while maintaining data privacy in federated learning settings. Current approaches either compromise privacy constraints or focus narrowly on specific aspects of model invariance, often incurring significant computational overhead. We propose a novel approach FedDIM, which leverages the concept of “insight matrix” – a fine-grained representation of the model’s decision-making process derived from element-wise products between feature vectors and classifier weights. By introducing a regularization term that promotes consistency between individual sample insight matrices and their class-wise mean representations, our method effectively captures both feature and classifier invariance. This approach not only maintains strict privacy requirements but also introduces minimal computational overhead as it utilizes intermediate computations already present in the forward pass. Extensive experiments demonstrate that our method achieves superior out-of-distribution generalization compared to existing federated learning approaches while being simple to implement. Our work provides a new perspective on achieving robust generalization in federated learning settings through the lens of decision-making processes.
6558: Flexible Generalized Low-Rank Regularizer for Tensor RPCA
Authors: Zhiyang Gong, Jie Yu, Yutao Hu, Yulong Wang
Location: Guangzhou | Day: TBD
Show Abstract
Tensor Robust Principal Component Analysis (TRPCA) has emerged as a powerful technique for low-rank tensor recovery. To achieve better recovery performance, a variety of TNN (Tensor Nuclear Norm) based low-rank regularizers have been proposed case by case, lacking a general and flexible framework. In this paper, we design a novel tensor low-rank regularization framework coined FGTNN (Flexible Generalized Tensor Nuclear Norm). Equipped with FGTNN, we develop the FGTRPCA (Flexible Generalized TRPCA) framework, which has two desirable properties. 1) Generalizability: Many existing TRPCA methods can be viewed as special cases of our framework; 2) Flexibility: Using FGTRPCA as a general platform, we derive a series of new TRPCA methods by tuning a continuous parameter to improve performance. In addition, we develop another novel smooth and low-rank regularizer coined t-FGJP and the resulting SFGTRPCA (Smooth FGTRPCA) method by leveraging the low-rankness and smoothness priors simultaneously. Experimental results on various tensor denoising and recovery tasks demonstrate the superiority of our methods.
6581: Bridging Generative and Discriminative Learning: Few-Shot Relation Extraction via Two-Stage Knowledge-Guided Pre-training
Authors: Quanjiang Guo, Jinchuan Zhang, Sijie Wang, Ling Tian, Zhao Kang, Bin Yan, Weidong Xiao
Location: Guangzhou | Day: TBD
Show Abstract
Few-Shot Relation Extraction (FSRE) remains a challenging task due to the scarcity of annotated data and the limited generalization capabilities of existing models. Although large language models (LLMs) have shown potential in FSRE through in-context learning, their general-purpose training objectives often result in suboptimal performance for task-specific relation extraction. To overcome these challenges, we propose TKRE (Two-Stage Knowledge-Guided Pre-training for Relation Extraction), a novel framework that synergistically integrates LLMs with traditional relation extraction models, bridging generative and discriminative learning paradigms. TKRE introduces two key innovations: (1) leveraging LLMs to generate explanation-driven knowledge and schema-constrained synthetic data, addressing the issue of data scarcity; and (2) a two-stage pre-training strategy combining Masked Span Language Modeling (MSLM) and Span-Level Contrastive Learning (SCL) to enhance relational reasoning and generalization. Together, these components enable TKRE to effectively handle FSRE tasks. Comprehensive experiments on benchmark datasets demonstrate the efficacy of TKRE, achieving new state-of-the-art performance in FSRE and underscoring its potential for broader application in low-resource scenarios. The code and data are released on https://github.com/UESTC-GQJ/TKRE.
6586: Relation-Augmented Dueling Bayesian Optimization via Preference Propagation
Authors: Xiang Xia, Xiang Shu, Shuo Liu, Yiyi Zhu, Yijie Zhou, Weiye Wang, Bingdong Li, Hong Qian
Location: Guangzhou | Day: TBD
Show Abstract
In black-box optimization, when directly evaluating the function values of solutions is very costly or infeasible, access to the objective function is often limited to comparing pairs of solutions, which yields dueling black-box optimization. Dueling optimization is solely based on pairwise preferences, and thus notably reduces cost compared with function value based methods. However, the optimization performance of dueling optimization is often limited due to that most existing dueling optimization methods do not make full use of the pairwise preferences collected. To better utilize these preferences, this paper proposes relation-augmented dueling Bayesian optimization (RADBO) via preference propagation. By considering solution similarity, RADBO aims to uncover the potential dueling relations between solutions within different preferences through the proposed preference propagation technique. Specifically, RADBO first clusters solutions using a Gaussian mixture model. After obtaining the solution set with the highest intra-cluster similarity, RADBO utilizes a directed hypergraph to model the potential dueling relations between solutions, thereby realizing relation augmentation. Extensive experiments are conducted on both synthetic functions and real-world tasks such as motion control, car cab design and spacecraft trajectory optimization. The experimental results disclose the satisfactory accuracy of augmented preferences in RADBO, and show the superiority of RADBO compared with existing dueling optimization methods. Notably, it is verified that, under the same evaluation cost budget, RADBO can be competitive with or even surpass the function value based Bayesian optimization methods with respect to optimization performance.
6600: PerfSeer: An Efficient and Accurate Deep Learning Models Performance Predictor
Authors: Xinlong Zhao, Jiande Sun, Jia Zhang, Tong Liu, Ke Liu
Location: Guangzhou | Day: TBD
Show Abstract
Predicting the performance of deep learning (DL) models, such as execution time and resource utilization, is crucial for Neural Architecture Search (NAS), DL cluster schedulers, and other technologies that advance deep learning. The representation of a model is the foundation for its performance prediction. However, existing methods cannot comprehensively represent diverse model configurations, resulting in unsatisfactory accuracy. To address this, we represent a model as a graph that includes the topology, along with node, edge, and global features, all of which are crucial for effectively capturing the performance of the model. Based on this representation, we propose PerfSeer, a novel predictor that uses a Graph Neural Network (GNN)-based performance prediction model, SeerNet. SeerNet fully leverages the topology and various features, while incorporating optimizations such as Synergistic Max-Mean aggregation (SynMM) and Global-Node Perspective Boost (GNPB) to more effectively capture the critical performance information, enabling it to predict the performance of models accurately. Furthermore, SeerNet can be extended to SeerNet-Multi by using Project Conflicting Gradients (PCGrad), enabling efficient simultaneous prediction of multiple performance metrics without significantly affecting accuracy. We constructed a dataset containing performance metrics for 53k+ model configurations, including execution time, memory usage, and Streaming Multiprocessor (SM) utilization during both training and inference. The evaluation results show that PerfSeer outperforms nn-Meter, Brp-NAS, and DIPPM.
6602: RPMIL: Rethinking Uncertainty-Aware Probabilistic Multiple Instance Learning for Whole Slide Pathology Diagnosis
Authors: Zhikang Zhao, Kaitao Chen, Jing Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Whole slide images (WSIs) are gigapixel digital scans of traditional pathology slides, offering substantial support for cancer diagnosis. Current multiple instance learning (MIL) methods for WSIs typically extract instance features and aggregate these into a single bag feature for prediction. We observe that these MIL methods rely on point estimation, where each bag is mapped to a deterministic embedding. Such MIL methods based on point estimation fail to capture the full spectrum of data variability due to the reliance on fixed embedding, especially when the number of trainable bags is limited. In this paper, we rethink probabilistic modeling in MIL and propose RPMIL, an uncertainty-aware probabilistic MIL method for whole slide pathology diagnosis. RPMIL learns a probabilistic aggregator to consolidate instance features into dynamic bag feature distributions instead of a deterministic bag feature. Specifically, we employ a variational autoencoder approach to compress multiple instance features into a low-dimension space with probabilistic representation and obtain the bag feature distribution formulated by the mean and variance. Furthermore, we drive the prediction by jointly leveraging the instance feature distribution and bag feature distribution. We evaluate the WSI classification performance on two public datasets: Camelyon16 and TCGA-NSCLC. Extensive experiments demonstrate that our method surpasses point estimation methods in MIL, achieving state-of-the-art levels.
6606: SourceDetMamba: A Graph-aware State Space Model for Source Detection in Sequential Hypergraphs
Authors: Le Cheng, Peican Zhu, Yangming Guo, Chao Gao, Zhen Wang, Keke Tang
Location: Guangzhou | Day: TBD
Show Abstract
Source detection on graphs has demonstrated high efficacy in identifying rumor origins. Despite advances in machine learning-based methods, many fail to capture intrinsic dynamics of rumor propagation. In this work, we present SourceDetMamba: A Graph-aware State Space Model for Source Detection in Sequential Hypergraphs, which harnesses the recent success of the state space model Mamba, known for its superior global modeling capabilities and computational efficiency, to address this challenge. Specifically, we first employ hypergraphs to model high-order interactions within social networks. Subsequently, temporal network snapshots generated during the propagation process are sequentially fed in reverse order into Mamba to infer underlying propagation dynamics. Finally, to empower the sequential model to effectively capture propagation patterns while integrating structural information, we propose a novel graph-aware state update mechanism, wherein the state of each node is propagated and refined by both temporal dependencies and topological context. Extensive evaluations on eight datasets demonstrate that SourceDetMamba consistently outperforms state-of-the-art approaches.
6627: HyperDet: Source Detection in Hypergraphs via Interactive Relationship Construction and Feature-rich Attention Fusion
Authors: Le Cheng, Peican Zhu, Yangming Guo, Keke Tang, Chao Gao, Zhen Wang
Location: Guangzhou | Day: TBD
Show Abstract
Hypergraphs offer superior modeling capabilities for social networks, particularly in capturing group phenomena that extend beyond pairwise interactions in rumor propagation. Existing approaches in rumor source detection predominantly focus on dyadic interactions, which inadequately address the complexity of more intricate relational structures. In this study, we present a novel approach for Source Detection in Hypergraphs (HyperDet) via Interactive Relationship Construction and Feature-rich Attention Fusion. Specifically, our methodology employs an Interactive Relationship Construction module to accurately model both the static topology and dynamic interactions among users, followed by the Feature-rich Attention Fusion module, which autonomously learns node features and discriminates between nodes using a self-attention mechanism, thereby effectively learning node representations under the framework of accurately modeled higher-order relationships. Extensive experimental validation confirms the efficacy of our HyperDet approach, showcasing its superiority relative to current state-of-the-art methods.
6634: A Timestep-Adaptive Frequency-Enhancement Framework for Diffusion-based Image Super-Resolution
Authors: Yueying Li, Hanbin Zhao, Jiaqing Zhou, Guozhi Xu, Tianlei Hu, Gang Chen, Haobo Wang
Location: Guangzhou | Day: TBD
Show Abstract
Image super-resolution (ISR) is a classic and challenging problem in computer vision because of complex and unknown degradation patterns in the data collection process. Leveraging powerful generative priors, diffusion-based methods have recently established new state-of-the-art ISR performance, but their characteristics in the frequency domain are still underexplored. In this paper, we innovatively investigate their frequency-domain behaviors from a sampling timestep perspective. Experimentally, we find that current diffusion-based ISR algorithms exhibit insufficiency in different frequency components in distinct groups of timesteps during the sampling. To address this, we first propose a Timestep Division Controller that is able to adaptively divide the timesteps into groups based on the performance gradient across different components. Next, we design two dedicated modules — the Amplitude and Phase Enhancement Module (APEM) and the High- and Low-Frequency Enhancement Module (HLEM), to regulate the information flow of distinct frequency-domain features. By adaptively enhancing specific frequency components at different stages of the sampling process, the two modules effectively compensate for the insufficient frequency-domain perception of diffusion-based ISR models. Extensive experiments on three benchmark datasets verify the superior ISR performance of our method, e.g., achieving an average 5.40% improvement on CLIP-IQA compared to the best diffusion-based ISR baseline.
6638: Towards Robust Incremental Learning Under Ambiguous Supervision
Authors: Rui Wang, Mingxuan Xia, Haobo Wang, Lei Feng, Junbo Zhao, Gang Chen, Chang Yao
Location: Guangzhou | Day: TBD
Show Abstract
Traditional Incremental Learning (IL) targets to handle sequential fully-supervised learning problems where novel classes emerge from time to time. However, due to inherent annotation uncertainty and ambiguity, collecting high-quality annotated data in a dynamic learning system can be extremely expensive. To mitigate this problem, we propose a novel weakly-supervised learning paradigm called Incremental Partial Label Learning (IPLL), where the sequentially arrived data relate to a set of candidate labels rather than the ground truth. Technically, we develop the Prototype-Guided Disambiguation and Replay Algorithm (PGDR) which leverages the class prototypes as a proxy to mitigate two intertwined challenges in IPLL, i.e., label ambiguity and catastrophic forgetting. To handle the former, PGDR encapsulates a momentum-based pseudo-labeling algorithm along with prototype-guided initialization, resulting in a balanced perception of classes. To alleviate forgetting, we develop a memory replay technique that collects well-disambiguated samples while maintaining representativeness and diversity. By jointly distilling knowledge from curated memory data, our framework exhibits a great disambiguation ability for samples of new tasks and achieves less forgetting of knowledge. Extensive experiments demonstrate that PGDR achieves superior performance over the baselines in the IPLL task.
6669: NeuBM: Mitigating Model Bias in Graph Neural Networks Through Neutral Input Calibration
Authors: Jiawei Gu, Ziyue Qiao, Xiao Luo
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) have shown remarkable performance across various domains, yet they often struggle with model bias, particularly in the presence of class imbalance. This bias can lead to suboptimal performance and unfair predictions, especially for underrepresented classes. We introduce NeuBM (Neutral Bias Mitigation), a novel approach to mitigate model bias in GNNs through neutral input calibration. NeuBM leverages a dynamically updated neutral graph to estimate and correct the inherent biases of the model. By subtracting the logits obtained from the neutral graph from those of the input graph, NeuBM effectively recalibrates the model’s predictions, reducing bias across different classes. Our method integrates seamlessly into existing GNN architectures and training procedures, requiring minimal computational overhead. Extensive experiments on multiple benchmark datasets demonstrate that NeuBM significantly improves the balanced accuracy and recall of minority classes, while maintaining strong overall performance. The effectiveness of NeuBM is particularly pronounced in scenarios with severe class imbalance and limited labeled data, where traditional methods often struggle. We provide theoretical insights into how NeuBM achieves bias mitigation, relating it to the concept of representation balancing. Our analysis reveals that NeuBM not only adjusts the final predictions but also influences the learning of balanced feature representations throughout the network.
6674: Disconfounding Fake News Video Explanation with Causal Inference
Authors: Lizhi Chen, Zhong Qian, Peifeng Li, Qiaoming Zhu
Location: Guangzhou | Day: TBD
Show Abstract
The proliferation of fake news videos on social media has heightened the demand for credible verification systems. While existing methods focus on detecting false content, generating human-readable explanations for such predictions remains a critical challenge. Current approaches suffer from spurious correlations caused by two key confounders: 1) video object bias, where co-occurring objects entangle features leading to incorrect semantic associations; and 2) explanation aspect bias, where models over-rely on frequent aspects while neglecting rare ones. To address these issues, we propose CIFE, a causal inference framework that disentangles confounding factors to generate unbiased explanations. First, we formalize the problem through a Structural Causal Model (SCM) to identify confounding factors. We then introduce two novel modules: 1) the Interventional Video-Object Detector (IVOD), which employs backdoor adjustment to decouple object-level visual semantics; and 2) the Interventional Explanation Aspect Module (IEAM), which balances aspect selection during multimodal fusion. Extensive experiments on the FakeVE dataset demonstrate the effectiveness of CIFE, which generates more faithful explanations by mitigating object entanglement and aspect imbalance. Our code is available at https://github.com/Lieberk/CIFE.
6688: More Efforts Towards Fixed-Parameter Approximability of Multiwinner Rules
Authors: Sushmita Gupta, Pallavi Jain, Souvik Saha, Saket Saurabh, Anannya Upasana
Location: Guangzhou | Day: TBD
Show Abstract
Multiwinner Elections have emerged as a prominent area of research with numerous practical applications. Given a set of candidates, C, a set of voters, V, approving a subset of candidates (called approval set of a voter), and an integer k, we consider the problem of
selecting a “good” committee using Thiele rules. This problem is computationally challenging for most Thiele rules with monotone submodular satisfaction functions, as there is no (1-1/e- epsilon) approximation algorithm in f(k)(|C| + |V|)^(o(k)) time for any fixed epsilon > 0 and any computable function f, and no PTAS even when the length of approval set is two. Skowron designed an approximation scheme running in FPT time parameterized by the combined parameter, size of the approval set, and k. In this paper, we consider a parameter d+k (no d voters approve the same set of d candidates), where d is upper bounded by the size of the approval set (thus, can be much smaller). With respect to this parameter, we design parameterized approximation schemes, a lossy polynomial-time preprocessing method, and show that an extra committee member suffices to achieve the desired score (i.e., 1-additive approximation). Additionally, we resolve an open question by Yang and Wang regarding the fixed-parameter tractability of the problem under the PAV rule with the total score as the parameter, demonstrating that it admits an FPT algorithm.
6729: CGI: Identifying Conditional Generative Models with Example Images
Authors: Zhi Zhou, Hao-Zhe Tan, Peng-Xiao Song, Lan-Zhe Guo
Location: Guangzhou | Day: TBD
Show Abstract
Generative models have achieved remarkable performance recently, and thus model hubs have emerged. Existing model hubs typically assume basic text matching is sufficient to search for models. However, in reality, due to different abstractions and the large number of models in model hubs, it is not easy for users to review model descriptions and example images, choosing which model best meets their needs. Therefore, it is necessary to describe model functionality wisely so that future users can efficiently search for the most suitable model for their needs. Efforts to address this issue remain limited. In this paper, we propose Conditional Generative Model Identification (CGI), which aims to provide an effective way to identify the most suitable model using user-provided example images rather than requiring users to manually review a large number of models with example images. To address this problem, we propose the Prompt-Based Model Identification (PMI), which can adequately describe model functionality and precisely match requirements with specifications. To evaluate PMI approach and promote related research, we provide a benchmark comprising 65 models and 9100 identification tasks. Extensive experimental and human evaluation results demonstrate that PMI is effective. For instance, 92% of models are correctly identified with significantly better FID scores when four example images are provided.
6742: Synthesis of Communication Policies for Multi-Agent Systems Robust to Communication Restrictions
Authors: Saleh Soudijani, Rayna Dimitrova
Location: Guangzhou | Day: TBD
Show Abstract
We study stochastic multi-agent systems in which agents must cooperate to maximize the probability of achieving a common reach-avoid objective.
In many applications, during the execution of the system, the communication between the agents can be constrained by restrictions on the bandwidth currently available for exchanging local-state information between the agents.
In this paper, we propose a method for computing joint action and communication policies for the group of agents that aim to satisfy the communication restrictions as much as possible while achieving the optimal reach-avoid probability when communication is unconstrained. Our method synthesizes a pair of action and communication policies robust to restrictions on the number of agents allowed to communicate. To this end, we introduce a novel cost function that measures the amount of information exchanged beyond what the communication policy allows. We evaluate our approach experimentally on a range of benchmarks and demonstrate that it is capable of computing pairs of action and communication policies that satisfy the communication restrictions, if such exist.
6755: VideoHumanMIB: Unlocking Appearance Decoupling for Video Human Motion In-betweening
Authors: Haiwei Xue, Zhensong Zhang, Minglei Li, Zonghong Dai, Fei Yu, Fei Ma, Zhiyong Wu
Location: Guangzhou | Day: TBD
Show Abstract
We propose VideoHumanMIB, a novel framework for Video Human Motion In-betweening that enables seamless transitions between different motion video clips, facilitating the generation of longer and more natural digital human videos. While existing video frame interpolation methods work well for similar motions in adjacent frames, they often struggle with complex human movements, resulting in artifacts and unrealistic transitions. To address these challenges, we introduce a two-stage approach: First, we design an Appearance Reconstruction AutoEncoder to decouple appearance and motion information, extracting robust appearance-invariant features. Second, we develop an enhanced diffusion pretrained network that leverages both motion optical flow and human pose as guidance conditions, enabling the model to learn comprehensive latent distributions of possible motions. Rather than operating directly in pixel space, our model works in a learned latent space, allowing it to better capture the underlying motion dynamics. The framework is optimized with a dual-frame constraint loss and a motion flow loss to ensure temporal consistency and natural movement transitions. Extensive experiments demonstrate that our approach generates highly realistic transition sequences that significantly outperform existing methods, particularly in challenging scenarios with large motion variations. The proposed VideoHumanMIB establishes a new baseline for human motion synthesis and enables more natural and controllable digital human animation.
6769: FSDFormer: Progressive Rain Removal Network Based on Fourier-Spatial Dual Transformer
Authors: Shuying Huang, Jiaxuan Yang, Yong Yang, Weiguo Wan
Location: Guangzhou | Day: TBD
Show Abstract
Most rain removal methods based on deep learning typically adopt a single-stage network architecture to remove the rain streaks in rainy images by increasing the depth of the network. The increase in network depth will increase the computational complexity of the model, and the lack of guidance for intermediate features will lead to inaccurate feature learning. To address this issue, we proposed a progressive rain removal network based on Fourier-spatial dual Transformer, called FSDFormer. The network consists of multiple rain removal stages, each with the same structure, which can utilize background prior features to guide the network to reconstruct rainless images with more texture information. Each stage consists of a prior extraction module (PEM), a prior attention fusion module (PAFM), and a U-Net including multiple Fourier-spatial dual Transformers (FSD-Transformers). Firstly, PEM is constructed to extract the background prior features from the input rainy image or the output of each stage. Then, a PAFM is designed to reconstruct accurate image background features by utilizing background prior features to guide the network. Finally, U-Net extracts and reconstructs features at different scales by constructing multiple FSD-Transformers to obtain rainless features at each stage. Extensive experimental results on synthetic and real datasets have shown that the proposed method outperforms some state-of-the-art (SOTA) rain removal methods in terms of visual quality and quantitative indicators. The source code is available at https: //github.com/yangjiaxuan6250/FSDFormer.
6771: Strategy-Architecture Synergy: A Multi-View Graph Contrastive Paradigm for Consistent Representations
Authors: Shuman Zhuang, Zhihao Wu, Yuhong Chen, Zihan Fang, Jiali Yin, Ximeng Liu
Location: Guangzhou | Day: TBD
Show Abstract
Facing the growing diversity of multi-view data, multi-view graph-based models have made encouraging progress in handling multi-view data modeled as graphs. Graph Contrastive Learning (GCL) naturally fits multi-view graph data by treating their inherent views as augmentations. However, the development of GCL on multi-view graph data is still in the infant stage. Challenges remain in designing strategies that coordinate preprocessing and contrastive learning, and in developing model architectures that automatically meet the needs of diverse views. To tackle these, we propose a framework named CAMEL, which refines consistency learning by introducing a tailored contrastive paradigm for multi-view graphs. Initially, we theoretically analyze the positive effect of edge-dropping preprocessing on the consistency and quantify the factors that influence it. Paired with a learnable model architecture, the proposed adaptive edge-dropping preprocessing strategy is guided by dynamic topology, making the heterogeneity of views more controllable and better aligned with contrastive learning. Finally, we design a neighborhood consistency multi-view contrastive objective that enhances consistency information interaction by extending positive samples. Extensive experiments on downstream tasks, including node classification and clustering, validate the superiority of our proposed model.
6773: Finite-Time Analysis of Heterogeneous Federated Temporal Difference Learning
Authors: Ye Zhu, Xiaowen Gong, Shiwen Mao
Location: Guangzhou | Day: TBD
Show Abstract
Federated Temporal Difference (FTD) learning has emerged as a promising framework for collaboratively evaluating policies without sharing raw data. Despite its potential, existing approaches often yield biased convergence results due to the inherent challenges of federated reinforcement learning, such as multiple local updates and environment heterogeneity. In response, we investigate federated temporal difference (TD) learning, focusing on collaborative policy evaluation with linear function approximation among agents operating in heterogeneous environments. We devise a heterogeneous federated temporal difference (HFTD) algorithm which iteratively aggregates agents’ local stochastic gradients for TD learning. The HFTD algorithm involves two major novel contributions: 1) it aims to find the optimal value function model for the mixture environment which is the environment randomly drawn from agents’ heterogeneous environments, using the local gradients of agents’ mean squared Bellman errors (MSBEs) for their respective environments; 2) it allows agents to perform different numbers of local iterations for TD learning based on their heterogeneous computational capabilities. We analyze the finite-time convergence of the HFTD algorithm for the scenarios of IID sampling and Markovian sampling respectively. By characterizing bounds on the convergence error, we show that the HFTD algorithm can exactly converge to the optimal model and also achieves linear speedups as the number of agents increases.
6785: Avoiding Undesired Future with Sequential Decisions
Authors: Lue Tao, Tian-Zuo Wang, Yuan Jiang, Zhi-Hua Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Machine learning has advanced in predictive tasks, but practitioners often need to proactively avoid undesired outcomes rather than just predicting them. To this end, a framework called rehearsal has been introduced, which tackles the avoiding undesired future (AUF) problem by modeling how variables influence each other and searching for a decision that leads to desired results. In this paper, we propose a novel rehearsal approach for addressing the AUF problem by making a sequence of decisions, where each decision is dynamically informed by the latest observations via retrospective inference. Theoretically, we show that sequential decisions in our approach tend to achieve a higher success rate in avoiding undesired outcomes by more reliably inferring the outcome of actions compared with existing solutions. Perhaps surprisingly, our approach remains advantageous even under imprecise modeling of relations between variables, and we provide a sufficient condition under which the advantage holds. Finally, experimental results confirm the practical effectiveness of the proposed approach in both simulated and real-world tasks.
6787: Most General Explanations of Tree Ensembles
Authors: Yacine Izza, Akexey Ignatiev, Sasha Rubin, Joao Marques-Silva, Peter J. Stuckey
Location: Guangzhou | Day: TBD
Show Abstract
Explainable Artificial Intelligence (XAI) is critical for attaining trust in the operation of AI systems. A key question of an AI system is “why was this decision made this way”. Formal approaches to XAI use a formal model of the AI system to identify abductive explanations. While abductive explanations may be applicable to a large number of inputs sharing the same concrete values, more general explanations may be preferred for numeric inputs.
So-called inflated abductive explanations give intervals for each feature ensuring that any input whose values fall withing these intervals is still guaranteed to make the same prediction. Inflated explanations cover a larger portion of the input space, and hence are deemed more general explanations. But there can be many (inflated) abductive explanations for an instance. Which is the best? In this paper, we show how to find a most general abductive explanation for an AI decision. This explanation covers as much of the input space as possible, while still being a correct formal explanation of the model’s behaviour. Given that we only want to give a human one explanation for a decision, the most general explanation gives us the explanation with the broadest applicability, and hence the one most likely to seem sensible.
6788: Attention-based Conditional Random Field for Financial Fraud Detection
Authors: Xiaoguang Wang, Chenxu Wang, Luyue Zhang, Xiaole Wang, Mengqin Wang, Huanlong Liu, Tao Qin
Location: Guangzhou | Day: TBD
Show Abstract
Financial fraud detection is critical for market transparency and regulatory compliance. Existing methods often ignore the temporal patterns in financial data, which are essential for understanding dynamic financial behaviors and detecting fraud. Moreover, they also treat companies as independent entities, overlooking the valuable interrelationships. To address these issues, we propose ACRF-RNN, a Recurrent Neural Network (RNN) with Attention-based Conditional Random Field (CRF) for fraud detection. Specifically, we use an RNN with a sliding window to capture temporal dependencies from historical data, and an attention-based CRF feature transformer to model inter-company relationships. This transforms raw financial data into optimized features, fed into a multi-layer perceptron for classification. Besides, we also use the focal loss to alleviate the class imbalance problem caused by rare fraudulent cases. This work presents a novel real-world dataset to evaluate the performance of ACRF-RNN. Extensive experiments show that ACRF-RNN outperforms the state-of-the-art methods by 15.28% in KS and 4.04% in Recall.
Data and code are available at: https://github.com/XNetLab/ACRF-RNN.git.
6819: A Novel Sparse Active Online Learning Framework for Fast and Accurate Streaming Anomaly Detection Over Data Streams
Authors: Zhong Chen, Yi He, Di Wu, Chen Zhao, Meikang Qiu
Location: Guangzhou | Day: TBD
Show Abstract
Online Anomaly Detection (OAD) is critical for identifying rare yet important data points in large, dynamic, and complex data streams. A key challenge lies in achieving accurate and consistent detection of anomalies while maintaining computational and memory efficiency. Conventional OAD approaches, which depend on distributional deviations and static thresholds, struggle with model update delays and catastrophic forgetting, leading to missed detections and high false positive rates. To address these limitations, we propose a novel Streaming Anomaly Detection (SAD) method, grounded in a sparse active online learning framework. Our approach uniquely integrates ℓ1,2-norm sparse online learning with CUR decomposition-based active learning, enabling simultaneous fast feature selection and dynamic instance selection. The efficient CUR decomposition further supports real-time residual analysis for anomaly scoring, eliminating the need for manual threshold settings about temporal data distributions. Extensive experiments on diverse streaming datasets demonstrate SAD’s superiority, achieving a 14.06% reduction in detection error rates compared to five state-of-the-art competitors.
7027: Dynamic Seed-GrowthCM: A Dynamic Benefit-Oriented Algorithm for Core Maximization on Large Graphs
Authors: Dongyuan Ma, Dongxiao He, Xin Huang
Location: Guangzhou | Day: TBD
Show Abstract
The k-core has garnered significant attention in recent research as an effective measure of node importance within a graph. A k-core is defined as the maximal induced subgraph where each node has a degree of at least k. This paper addresses the core maximization problem: given a graph G, an integer k, and a budget b, the objective is to insert b new distinct edges into G to maximize the size of its k-core. This problem is theoretically proven to be NP-hard and APX-hard. However, the existing heuristic methods often struggle to achieve a good balance between efficiency and answer quality. In this paper, we propose a novel dynamic approach that, for the first time, uncovers the dynamic changes in node degrees. We introduce a new concept using the contribution of edges across different λ-shell components to the final solution. Based on these findings, we present the Dynamic Seed-GrowthCM method. This method selects the λ-shell component with the largest estimated benefit as the initial seed. In each iteration, depending on complete/partial growth, either a new seed is incorporated into the solution, or an existing seed undergoes growth, becoming a larger seed by adding connected components of the λ-shell component to the solution. Experimental results on ten datasets demonstrate that our algorithm significantly outperforms state-of-the-art methods in terms of solution quality on large graphs, while achieving a high computational efficiency.
7029: SpectralGap: Graph-Level Out-of-Distribution Detection via Laplacian Eigenvalue Gaps
Authors: Jiawei Gu, Ziyue Qiao, Zechao Li
Location: Guangzhou | Day: TBD
Show Abstract
The task of graph-level out-of-distribution (OOD) detection is crucial for deploying graph neural networks in real-world settings. In this paper, we observe a significant difference in the relationship between the largest and second-largest eigenvalues of the Laplacian matrix for in-distribution (ID) and OOD graph samples: OOD samples often exhibit anomalous spectral gaps (the difference between the largest and second-largest eigenvalues). This observation motivates us to propose SpecGap, an effective post-hoc approach for OOD detection on graphs. SpecGap adjusts features by subtracting the component associated with the second-largest eigenvalue, scaled by the spectral gap, from the high-level features (i.e., X – (λn – λn-1) u_n-1 v_n-1^T). SpecGap achieves state-of-the-art performance across multiple benchmark datasets. We present extensive ablation studies and comprehensive theoretical analyses to support our empirical results. As a parameter-free post-hoc method, SpecGap can be easily integrated into existing graph neural network models without requiring any additional training or model modification.
7204: Dynamic Replanning for Improved Public Transport Routing
Authors: Abdallah Abuaisha, Bojie Shen, Daniel D. Harabor, Peter J. Stuckey, Mark Wallace
Location: Guangzhou | Day: TBD
Show Abstract
Delays in public transport are common, often impacting users through prolonged travel times and missed transfers. Existing solutions for handling delays remain limited; backup plans based on historical data miss opportunities for earlier arrivals, while snapshot planning accounts for current delays but not future ones. With the growing availability of live delay data, users can adjust their journeys in real-time. However, the literature lacks a framework that fully exploits this advantage for system-scale dynamic replanning. To address this, we formalise the dynamic replanning problem in public transport routing and propose two solutions: a "pull" approach, where users manually request replanning, and a novel "push" approach, where the server proactively monitors and adjusts journeys. Our experiments show that the push approach outperforms the pull approach, achieving significant speedups. The results also reveal substantial arrival time savings enabled by dynamic replanning.
7279: Latte: Transfering LLMs’ Latent-level Knowledge for Few-shot Tabular Learning
Authors: Ruxue Shi, Hengrui Gu, Hangting Ye, Yiwei Dai, Xu Shen, Xin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Few-shot tabular learning, in which machine learning models are trained with a limited amount of labeled data, provides a cost-effective approach to addressing real-world challenges. The advent of Large Language Models (LLMs) has sparked interest in leveraging their pre-trained knowledge for few-shot tabular learning. Despite promising results, existing approaches either rely on test-time knowledge extraction, which introduces undesirable latency, or text-level knowledge, which leads to unreliable feature engineering. To overcome these limitations, we propose Latte, a training-time knowledge extraction framework that transfers the latent prior knowledge within LLMs to optimize a more generalized downstream model. Latte enables general knowledge-guided downstream tabular learning, facilitating the weighted fusion of information across different feature values while reducing the risk of overfitting to limited labeled data. Furthermore, Latte is compatible with existing unsupervised pre-training paradigms and effectively utilizes available unlabeled samples to overcome the performance limitations imposed by an extremely small labeled dataset. Extensive experiments on various few-shot tabular learning benchmarks demonstrate the superior performance of Latte, establishing it as a state-of-the-art approach in this domain. Our code is available at https://github.com/ruxueshi/Latte.git.
7315: Decoupling and Reconstructing: A Multimodal Sentiment Analysis Framework Towards Robustness
Authors: Mingzheng Yang, Kai Zhang, Yuyang Ye, Yanghai Zhang, Runlong Yu, Min Hou
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal sentiment analysis (MSA) has shown promising results but often poses significant challenges in real-world applications due to its dependence on the complete and aligned multimodal sequences. While existing approaches attempt to address missing modalities through feature reconstruction, they often neglect the complex interplay between homogeneous and heterogeneous relationships in multimodal features. To address this problem, we propose Decoupled-Adaptive Reconstruction (DAR), a novel framework that explicitly addresses these limitations through two key components: (1) a mutual information-based decoupling module that decomposes features into common and independent representations, and (2) a reconstruction module that independently processes these decoupled features before fusion for downstream tasks. Extensive experiments on two benchmark datasets demonstrate that DAR significantly outperforms existing methods in both modality reconstruction and sentiment analysis tasks, particularly in scenarios with missing or unaligned modalities. Our results show improvements of 2.21% in bi-classification accuracy and 3.9% in regression error compared to state-of-the-art baselines on the MOSEI dataset.
7323: Stochasticity-aware No-Reference Point Cloud Quality Assessment
Authors: Songlin Fan, Wei Gao, Zhineng Chen, Ge Li, Guoqing Liu, Qicheng Wang
Location: Guangzhou | Day: TBD
Show Abstract
The evolution of point cloud processing algorithms necessitates an accurate assessment for their quality. Previous works consistently regard point cloud quality assessment (PCQA) as a MOS regression problem and devise a deterministic mapping, ignoring the stochasticity in generating MOS from subjective tests. This work presents the first probabilistic architecture for no-reference PCQA, motivated by the labeling process of existing datasets. The proposed method can model the quality judging stochasticity of subjects through a tailored conditional variational autoencoder (CVAE) and produces multiple intermediate quality ratings. These intermediate ratings simulate the judgments from different subjects and are then integrated into an accurate quality prediction, mimicking the generation process of a ground truth MOS. Specifically, our method incorporates a Prior Module, a Posterior Module, and a Quality Rating Generator, where the former two modules are introduced to model the judging stochasticity in subjective tests, while the latter is developed to generate diverse quality ratings. Extensive experiments indicate that our approach outperforms previous cutting-edge methods by a large margin and exhibits gratifying crossdataset robustness. Codes are available at https://git.openi.org.cn/OpenPointCloud/nrpcqa.
7330: Solving QNP and FOND+ with Generating, Testing and Forbidding
Authors: Zheyuan Shi, Hao Dong, Yongmei Liu
Location: Guangzhou | Day: TBD
Show Abstract
Qualitative Numerical Planning (QNP) extends classical planning with numerical variables that can be changed by arbitrary amounts. FOND+ extends Fully Observable Non-Deterministic (FOND) planning by introducing explicit fairness assumptions, resulting in a more expressive model that can also capture QNP as a special case. However, existing QNP and FOND+ solvers still face significant scalability challenges. To address this, we propose a novel framework for solving QNP and FOND+ by generating strong cyclic solutions of the associated FOND problem, testing their validity, and forbidding non-solutions in conducting further searches. For this, we propose a procedure called SIEVE*, which generalizes the QNP termination testing algorithm SIEVE to determine whether a strong cyclic solution is a FOND+ solution. Additionally, we propose several optimization techniques to further improve the performance of our basic framework. We implemented our approach based on the advanced FOND solver PRP; experimental results show that our solver shows superior scalability over the existing QNP and FOND+ solvers.
7350: Enhancing Table Recognition with Vision LLMs: A Benchmark and Neighbor-Guided Toolchain Reasoner
Authors: Yitong Zhou, Mingyue Cheng, Qingyang Mao, Jiahao Wang, Feiyang Xu, Xin Li
Location: Guangzhou | Day: TBD
Show Abstract
Pre-trained foundation models have recently made significant progress in table-related tasks such as table understanding and reasoning. However, recognizing the structure and content of unstructured tables using Vision Large Language Models (VLLMs) remains under-explored. To bridge this gap, we propose a benchmark based on a hierarchical design philosophy to evaluate the recognition capabilities of VLLMs in training-free scenarios. Through in-depth evaluations, we find that low-quality image input is a significant bottleneck in the recognition process. Drawing inspiration from this, we propose the Neighbor-Guided Toolchain Reasoner (NGTR) framework, which is characterized by integrating diverse lightweight tools for visual operations aimed at mitigating issues with low-quality images. Specifically, we transfer a tool selection experience from a similar neighbor to the input and design a reflection module to supervise the tool invocation process. Extensive experiments on public datasets demonstrate that our approach significantly enhances the recognition capabilities of the vanilla VLLMs. We believe that the benchmark and framework could provide an alternative solution to table recognition.
7378: MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion
Authors: Wei Hua, Chenlin Zhou, Jibin Wu, Yansong Chua, Yangyang Shu
Location: Guangzhou | Day: TBD
Show Abstract
The combination of Spiking Neural Networks (SNNs) with Vision Transformer architectures has attracted significant attention due to the great potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT, a novel spike-driven Transformer architecture, which firstly uses multi-scale spiking attention (MSSA) to enrich the capability of spiking attention blocks. We validate our approach across various main data sets. The experimental results indicate that our MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among NN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.
7387: Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning
Authors: Hao Tang, Shengfeng He, Jing Qin
Location: Guangzhou | Day: TBD
Show Abstract
Few-shot learning (FSL) addresses the challenge of classifying novel classes with limited training samples. While some methods leverage semantic knowledge from smaller-scale models to mitigate data scarcity, these approaches often introduce noise and bias due to the data’s inherent simplicity. In this paper, we propose a novel framework, Synergistic Knowledge Transfer (SynTrans), which effectively transfers diverse and complementary knowledge from large multimodal models to empower the off-the-shelf few-shot learner. Specifically, SynTrans employs CLIP as a robust teacher and uses a few-shot vision encoder as a weak student, distilling semantic-aligned visual knowledge via an unsupervised proxy task. Subsequently, a training-free synergistic knowledge mining module facilitates collaboration among large multimodal models to extract high-quality semantic knowledge. Building upon this, a visual-semantic bridging module enables bi-directional knowledge transfer between visual and semantic spaces, transforming explicit visual and implicit semantic knowledge into category-specific classifier weights. Finally, SynTrans introduces a visual weight generator and a semantic weight reconstructor to adaptively construct optimal multimodal FSL classifiers. Experimental results on four FSL datasets demonstrate that SynTrans, even when paired with a simple few-shot vision encoder, significantly outperforms current state-of-the-art methods.
7398: RTdetector: Deep Transformer Networks for Time Series Anomaly Detection Based on Reconstruction Trend
Authors: Xinhong Liu, Xiaoliang Li, Yangfan Li, Fengxiao Tang, Ming Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Anomaly detection in multivariate time series data is critical across a variety of real-life applications. The predominant anomaly detection techniques currently rely on reconstruction-based methods. However, these methods often overfit the abnormal pattern and fail to diagnose the anomaly. Although some studies have attempted to prevent the incorrect fitting of anomalous data by enabling models to learn the trend of data variations, they fail to account for the dynamic nature of data distribution. This oversight can lead to the erroneous reconstruction of anomalies that do not exist. To address these challenges, we propose RTdetector, a Transformer-based time series anomaly detection model leveraging reconstruction trends. RTdetector employs a novel global attention mechanism based on reconstruction trends to learn distinguishable attention from the original sequence, thereby preserving the global trend information intrinsic to the time series. Additionally, it incorporates a self-conditioning transformer, based on reconstruction trend enhancement to achieve superior predictive performance. Extensive experiments on four datasets demonstrate that RTdetector achieves state-of-the-art results in multivariate time series data anomaly detection. Our code is available at https://github.com/CSUFUNLAB/RTdetector.
7408: Neuromorphic Sequential Arena: A Benchmark for Neuromorphic Temporal Processing
Authors: Xinyi Chen, Chenxiang Ma, Yujie Wu, Kay Chen Tan, Jibin Wu
Location: Guangzhou | Day: TBD
Show Abstract
Temporal processing is vital for extracting meaningful information from time-varying signals. Recent advancements in Spiking Neural Networks (SNNs) have shown immense promise in efficiently processing these signals. However, progress in this field has been impeded by the lack of effective and standardized benchmarks, which complicates the consistent measurement of technological advancements and limits the practical applicability of SNNs. To bridge this gap, we introduce the Neuromorphic Sequential Arena (NSA), a comprehensive benchmark that offers an effective, versatile, and application-oriented evaluation framework for neuromorphic temporal processing. The NSA includes seven real-world temporal processing tasks from a diverse range of application scenarios, each capturing rich temporal dynamics across multiple timescales. Utilizing NSA, we conduct extensive comparisons of recently introduced spiking neuron models and neural architectures, presenting comprehensive baselines in terms of task performance, training speed, memory usage, and energy efficiency. Our findings emphasize an urgent need for efficient SNN designs that can consistently deliver high performance across tasks with varying temporal complexities while maintaining low computational costs. NSA enables systematic tracking of advancements in neuromorphic algorithm research and paves the way for developing effective and efficient neuromorphic temporal processing systems.
7431: DASS: A Dual-Branch Attention-based Framework for Trajectory Similarity Learning with Spatial and Semantic Fusion
Authors: Jiayi Li, Junhua Fang, Pingfu Chao, Jiajie Xu, Pengpeng Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Trajectory similarity aims to identify pairs of similar trajectories, serving as a crucial operation in spatial-temporal data mining. Although several approaches have been proposed, they encounter the following two issues: 1) An overemphasis on spatial similarity in road networks while the rich semantic information embedded in trajectories is not fully exploited; 2) Dependence on Recurrent Neural Network (RNN) architectures would struggle to capture long-term dependencies. To address these limitations, we propose a Dual-branch Attention-based framework with Spatial and Semantic information (DASS) based on self-supervised learning. Specifically, DASS comprises two core components: 1) A trajectory representation module that models spatial-temporal adjacent relationships in the form of graph and converts semantics into numerical embeddings. 2) A backbone encoder with a co-attention module to independently process two features before they are integrated. Extensive experiments on real-world datasets demonstrate that DASS outperforms state-of-the-art methods, establishing itself as a novel paradigm.
7459: Attribute Association Driven Multi-Task Learning for Session-based Recommendation
Authors: Xinyao Wang, Zhizhi Yu, Dongxiao He, Liang Yang, Jianguo Wei, Di Jin
Location: Guangzhou | Day: TBD
Show Abstract
Session-based Recommendation (SBR) aims to predict users’ next interaction based on their current session without relying on long-term profiles. Despite its effectiveness in privacy-preserving and real-time scenarios, SBR remains challenging due to limited behavioral signals. Prior methods often overfit co-occurrence patterns, neglecting semantic priors like item attributes. Recent studies have attempted to incorporate item attributes (e.g., category) by assigning fixed embeddings shared across all sessions. However, such approaches suffer from two key limitations: 1) Static attribute encoding fails to reflect semantic shifts under different session contexts. 2) Semantic misalignment between attribute and item ID embeddings. To address these issues, we propose attribute association driven multi-task learning for SBR, dubbed A²D-MTL. It explicitly models item categories using cross-session context to capture user potential interests and designs an adaptive sparse attention mechanism to suppress noise. Experimental results on three public datasets demonstrate the superiority of our method in recommendation accuracy (P@20) and ranking quality (MRR@20), validating the model’s effectiveness.
7487: CSAHFL:Clustered Semi-Asynchronous Hierarchical Federated Learning for Dual-layer Non-IID in Heterogeneous Edge Computing Networks
Authors: Aijing Li, Junping Du, Dandan Liu, Yingxia Shao, Tong Zhao, Guanhua Ye
Location: Guangzhou | Day: TBD
Show Abstract
Federated Learning (FL) enables collaborative model training across distributed devices without sharing raw data. Hierarchical Federated Learning (HFL) is a new paradigm of FL that leverages the Edge Servers (ESs) layer as an intermediary to perform partial local model aggregation in proximity, reducing core network transmission overhead. However, HFL faces new challenges: (1) The two-stage aggregation process between client-edge and edge-cloud results in a dual-layer non-IID issue, which may significantly compromise model training accuracy. (2) The heterogeneity and mobility of clients further impact model training efficiency. To address these challenges, we propose a novel Clustered Semi-Asynchronous Hierarchical Federated Learning (CSAHFL) framework that integrates adaptive semi-asynchronous intra-cluster aggregation at client-edge layer and dynamic distribution-aware inter-cluster aggregation at edge-cloud layer, collaboratively enhancing model performance and scalability in heterogeneous and mobile environments. We conducte experiments under varying degrees of dual-layer non-IID in both static and high-mobility scenarios. The results demonstrate significant advantages of CSAHFL over representative state-of-the-art methods.
7502: From General Relation Patterns to Task-Specific Decision-Making in Continual Multi-Agent Coordination
Authors: Chang Yao, Youfang Lin, Shoucheng Song, Hao Wu, Yuqing Ma, Sheng Han, Kai Lv
Location: Guangzhou | Day: TBD
Show Abstract
Continual Multi-Agent Reinforcement Learning (Co-MARL) requires agents to address catastrophic forgetting issues while learning new coordination policies with the dynamics team. In this paper, we delve into the core of Co-MARL, namely Relation Patterns, which refer to agents’ general understanding of interactions. In addition to generality, relation patterns exhibit task-specificity when mapped to different action spaces. To this end, we propose a novel method called General Relation Patterns-Guided Task-specific Decision-Maker (RPG). In RPG, agents extract relation patterns from dynamic observation spaces using a relation capturer. These task-agnostic relation patterns are then mapped to different action spaces via a task-specific decision-maker generated by a conditional hypernetwork. To combat forgetting, we further introduce regularization items on both the relation capturer and the conditional hypernetwork. Results on SMAC and LBF demonstrate that RPG effectively prevents catastrophic forgetting when learning new tasks and achieves zero-shot generalization to unseen tasks.
7524: CLLMRec: Contrastive Learning with LLMs-based View Augmentation for Sequential Recommendation
Authors: Fan Lu, Xiaolong Xu, Haolong Xiang, Lianyong Qi, Xiaokang Zhou, Fei Dai, Wanchun Dou
Location: Guangzhou | Day: TBD
Show Abstract
Sequential recommendation generates embedding representations from historical user-item interactions to recommend the next potential interaction item. Due to the complexity and variability of historical user-item interactions, extracting effective user features is quite challenging. Recent studies have employed sequential networks such as time series networks and Transformers to capture the intricate dependencies and temporal patterns in historical user-item interactions, extracting more effective user features. However, limited by the scarcity and suboptimal quality of data, these methods struggle to capture subtle differences in user sequences, which results in diminished recommendation accuracy. To address the above issue, we propose a contrastive learning framework with LLMs-based view augmentation (CLLMRec), which effectively mines differences in behavioral sequences through sample generation. Specifically, CLLMRec utilizes LLMs (Large Language Models) to augment views and expand user behavior sequence representations, providing high-quality positive and negative samples. Subsequently, CLLMRec employs the augmented views for effective contrastive learning, capturing subtle differences in behavioral sequences to suppress interference from irrelevant noise. Experimental results on three public datasets demonstrate that the proposed method outperforms state-of-the-art baseline models, and significantly enhances recommendation performance.
7578: MEGAD: A Memory-Efficient Framework for Large-Scale Attributed Graph Anomaly Detection
Authors: Yifan Zhang, Haolong Xiang, Xiaolong Xu, Zishun Rui, Xiaoyong Li, Lianyong Qi, Fei Dai
Location: Guangzhou | Day: TBD
Show Abstract
Graph anomaly detection (GAD), with its ability to accurately identify anomalous patterns in graph data, plays a vital role in areas such as network security, social media platforms, and fraud detection. Graph autoencoder-based methods are widely used for GAD due to their efficiency and effectiveness in capturing complex patterns and learning meaningful representations. However, the above methods are constrained by hardware memory, hindering the detection for large-scale graph data. In this paper, we propose a Memory-Efficient framework for large-scale attributed Graph Anomaly Detection (MEGAD). Specifically, MEGAD first generates node embeddings and then refines them through a lightweight joint optimization model, ensuring minimal memory overhead. The optimized embeddings are subsequently fed into a detector to compute anomaly scores. Extensive experiments demonstrate that our framework achieves comparable accuracy to state-of-the-art methods across multiple datasets while significantly reducing memory consumption on large-scale graphs.
7588: BinMetric: A Comprehensive Binary Code Analysis Benchmark for Large Language Models
Authors: Xiuwei Shang, Guoqiang Chen, Shaoyin Cheng, Benlong Wu, Li Hu, Gangyang Li, Weiming Zhang, Nenghai Yu
Location: Guangzhou | Day: TBD
Show Abstract
Binary analysis is crucial for software security, offering insights into compiled programs without source code. As large language models (LLMs) excel in language tasks, their potential for complex decoding binary data structures is growing. However, the lack of standardized benchmarks hinders their evaluation and progress in this domain.
To bridge this gap, we introduce BinMetric, a first comprehensive benchmark designed specifically to evaluate LLMs performance on binary analysis tasks. BinMetric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks, including decompilation, code summarization, etc., which reflect actual reverse engineering scenarios. Our empirical study on this benchmark investigates various state-of-the-art LLMs, revealing their strengths and limitations. The findings indicate that while LLMs show strong potential, challenges still exist, particularly in the areas of precise binary lifting and assembly synthesis. In summary, BinMetric makes a significant step forward in measuring binary analysis capabilities of LLMs, establishing a new benchmark leaderboard, and our study offers valuable insights for advancing LLMs in software security.
7592: GPI-Net: Gestalt-Guided Parallel Interaction Network via Orthogonal Geometric Consistency for Robust Point Cloud Registration
Authors: Weikang Gu, Mingyue Han, Li Xue, Heng Dong, Changcai Yang, Riqing Chen, Lifang Wei
Location: Guangzhou | Day: TBD
Show Abstract
The accurate identification of high-quality correspondences is a prerequisite task in feature-based point cloud registration. However, it is extremely challenging to handle the fusion of local and global features due to feature redundancy and complex spatial relationships. Given that Gestalt principles provide key advantages in analyzing local and global relationships, we propose a novel Gestalt-guided Parallel Interaction Network via orthogonal geometric consistency (GPI-Net) in this paper. It utilizes Gestalt principles to facilitate complementary communication between local and global information. Specifically, we introduce an orthogonal integration strategy to optimally reduce redundant information and generate a more compact global structure for high-quality correspondences. To capture geometric features in correspondences, we leverage a Gestalt Feature Attention (GFA) block through a hybrid utilization of self-attention and cross-attention mechanisms. Furthermore, to facilitate the integration of local detail information into the global structure, we design an innovative Dual-path Multi-Granularity parallel interaction aggregation (DMG) block to promote information exchange across different granularities. Extensive experiments on various challenging tasks demonstrate the superior performance of our proposed GPI-Net in comparison to existing methods. The code will be released at https://github.com/XXX/GPI-Net.
7616: Temporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action Recognition
Authors: Ping Li, Jianan Ni, Bo Pang
Location: Guangzhou | Day: TBD
Show Abstract
Action recognition models using deep learning are vulnerable to adversarial examples, which are transferable across other models trained on the same data modality. Existing transferable attack methods face two major challenges: 1) they heavily rely on the assumption that the decision boundaries of the surrogate (a.k.a., source) model and the target model are similar, which limits the adversarial transferability; and 2) their decision boundary difference makes the attack direction uncertain, which may result in the gradient oscillation, weakening the adversarial attack. This motivates us to propose a Background Mixup-induced Temporal Consistency (BMTC) attack method for action recognition. From the input transformation perspective, we design a model-agnostic background adversarial mixup module to reduce the surrogate-target model dependency. In particular, we randomly sample one video from each category and make its background frame, while selecting the background frame with the top attack ability for mixup with the clean frame by reinforcement learning. Moreover, to ensure an explicit attack direction, we leverage the background category as guidance for updating the gradient of adversarial example, and design a temporal gradient consistency loss, which strengthens the stability of the attack direction on subsequent frames. Empirical studies on two video datasets, i.e., UCF101 and Kinetics-400, and one image dataset, i.e., ImageNet, demonstrate that our method significantly boosts the transferability of adversarial examples across several action/image recognition models.
7622: External Memory Matters: Generalizable Object-Action Memory for Retrieval-Augmented Long-Term Video Understanding
Authors: Jisheng Dang, Huicheng Zheng, Xudong Wu, Jingmei Jiao, Bimei Wang, Jun Yang, Bin Hu, Jianhuang Lai, Tat Seng Chua
Location: Guangzhou | Day: TBD
Show Abstract
Long video understanding with Large Language Models (LLMs) enables the description of objects that are not explicitly present in the training data. However, continuous changes in known objects and the emergence of new ones require up-to-date knowledge of objects and their dynamics for effective understanding of the open world. To alleviate this, we propose an efficient Retrieval-Enhanced Video Understanding method, dubbed REVU, which leverages external knowledge to enhance the performance of open-world learning. First, REVU introduces an extensible external text-object memory with minimal text-visual mapping, involving static and dynamic multimodal information to help LLMs-based models align text and vision features. Second, REVU retrieves object information from external databases and dynamically integrates frame-specific data from videos, enabling effective knowledge aggregation to comprehend the open world. We conducted experiments on multiple benchmark datasets, and our model demonstrates strong adaptability to out-of-domain data without requiring additional fine-tuning or re-training. Experiments on benchmark video understanding datasets reveal that our model achieves state-of-the-art performance and robust generalization.
7628: Multi-Agent Communication with Information Preserving Graph Contrastive Learning
Authors: Wei Du, Shifei Ding, Wei Guo, Yuqing Sun, Guoxian Yu, Lizhen Cui
Location: Guangzhou | Day: TBD
Show Abstract
Recent research in cooperative Multi-Agent Reinforcement Learning (MARL) has shown significant interest in utilizing Graph Neural Networks (GNNs) for communication learning due to their strong ability to process feature and topological information of agents into message representations for downstream action selection and coordination. However, GNNs generally assume network homogeneity that nodes of the same class tend to be interconnected. In real-world multi-agent systems, such assumptions are often unrealistic, as agents within the same class can be distant from each other. Furthermore, GNN-based MARL methods overlook the crucial role of feature similarity of agents in action coordination, which also restricts their performance. To overcome these limitations, we propose a Multi-Agent communication mechanism with Information preserving graph contrastive Learning (MAIL), which enhances message representation by preserving the comprehensive features of adjacent agents while integrating topological information. Specifically, MAIL considers three distinct graph views: original view, agent feature view, and global topological view. MAIL performs contrastive learning across three views to extract comprehensive information. MAIL effectively learns robust and expressive message representations for downstream tasks. Extensive experiments across various environments demonstrate that MAIL outperforms existing GNN-based MARL methods.
7631: Dirichlet Process-Based Robust Clustering Using the Median-of-Means Estimator
Authors: Supratik Basu, Jyotishka Ray Choudhury, Debolina Paul, Swagatam Das
Location: Guangzhou | Day: TBD
Show Abstract
Clustering stands as one of the most prominent challenges in unsupervised machine learning. Among centroid-based methods, the classic $k$-means algorithm, based on Lloyd’s heuristic, is widely used. Nonetheless, it is a well-known fact that $k$-means and its variants face several challenges, including heavy reliance on initial cluster centroids, susceptibility to converging into local minima of the objective function, and sensitivity to outliers and noise in the data. When data contains noise or outliers, the Median-of-Means (MoM) estimator offers a robust alternative for stabilizing centroid-based methods. On a different note, another limitation in many commonly used clustering methods is the need to specify the number of clusters beforehand. Model-based approaches, such as Bayesian nonparametric models, address this issue by incorporating infinite mixture models, eliminating the predefined cluster count requirement. Motivated by these facts, we propose an efficient and automatic clustering technique in this article by integrating the strengths of model-based and centroid-based methodologies. Our method mitigates the effect of noise on the quality of clustering while simultaneously estimating the number of clusters. Statistical guarantees on an upper bound of clustering error and rigorous assessment through simulated and real datasets suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.
7635: Diff-LMM: Diffusion Teacher-Guided Spatio-Temporal Perception for Video Large Multimodal Models
Authors: Jisheng Dang, Ligen Chen, Jingze Wu, Ronghao Lin, Bimei Wang, Yun Wang, Liting Wang, Nannan Zhu, Teng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Dynamic spatio-temporal understanding is essential for video-based multimodal tasks, yet existing methods often struggle to capture fine-grained temporal and spatial relationships in long videos. Current approaches primarily rely on pre-trained CLIP encoders, which excel in semantic understanding but lack spatially-aware visual context. This leads to hallucinated results when interpreting fine-grained objects or scenes. To address these limitations, we propose a novel framework that integrates diffusion models into multimodal video models. By employing diffusion encoders at intermediate layers, we enhance visual representations through feature alignment and knowledge distillation losses, significantly improving the model’s ability to capture spatial patterns over time. Additionally, we introduce a multi-level alignment strategy to learn robust feature correspondence from pre-trained diffusion models. Extensive experiments on benchmark datasets demonstrate our approach’s state-of-the-art performance across multiple video understanding tasks. These results establish diffusion models as a powerful tool for enhancing multimodal video models in complex, dynamic scenarios.
7647: ARMR: Adaptively Responsive Network for Medication Recommendation
Authors: Feiyue Wu, Tianxing Wu, Shenqi Jing
Location: Guangzhou | Day: TBD
Show Abstract
Medication recommendation is a crucial task in healthcare, especially for patients with complex medical conditions. However, existing methods often struggle to effectively balance the reuse of historical medications with the introduction of new drugs in response to the changing patient conditions. In order to address this challenge, we propose an Adaptively Responsive network for Medication Recommendation (ARMR), a new method which incorporates 1) a piecewise temporal learning component that distinguishes between recent and distant patient history, enabling more nuanced temporal understanding, and 2) an adaptively responsive mechanism that dynamically adjusts attention to new and existing drugs based on the patient’s current health state and medication history. Experiments on the MIMIC-III and MIMIC-IV datasets indicate that ARMR has better performance compared with the state-of-the-art baselines in different evaluation metrics, which contributes to more personalized and accurate medication recommendations. The source code is publicly avaiable at: https://github.com/seucoin/armr2.
7661: M3ANet: Multi-scale and Multi-Modal Alignment Network for Brain-Assisted Target Speaker Extraction
Authors: Cunhang Fan, Ying Chen, Jian Zhou, Zexu Pan, Jingjing Zhang, Youdian Gao, Xiaoke Yang, Zhengqi Wen, Zhao Lv
Location: Guangzhou | Day: TBD
Show Abstract
The brain-assisted target speaker extraction (TSE) aims to extract the attended speech from mixed speech by utilizing the brain neural activities, for example Electroencephalography (EEG). However, existing models overlook the issue of temporal misalignment between speech and EEG modalities, which hampers TSE performance. In addition, the speech encoder in current models typically uses basic temporal operations (e.g., one-dimensional convolution), which are unable to effectively extract target speaker information. To address these issues, this paper proposes a multi-scale and multi-modal alignment network (M3ANet) for brain-assisted TSE. Specifically, to eliminate the temporal inconsistency between EEG and speech modalities, the modal alignment module that uses a contrastive learning strategy is applied to align the temporal features of both modalities. Additionally, to fully extract speech information, multi-scale convolutions with GroupMamba modules are used as the speech encoder, which scans speech features at each scale from different directions, enabling the model to capture deep sequence information. Experimental results on three publicly available datasets show that the proposed model outperforms current state-of-the-art methods across various evaluation metrics, highlighting the effectiveness of our proposed method. The source code is available at: https://github.com/fchest/M3ANet.
7669: WDMIR: Wavelet-Driven Multimodal Intent Recognition
Authors: Weiyin Gong, Kai Zhang, Yanghai Zhang, Qi Liu, Xinjie Sun, Junyu Lu, Linbo Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal intent recognition (MIR) seeks to accurately interpret user intentions by integrating verbal and non-verbal information across video, audio and text modalities. While existing approaches prioritize text analysis, they often overlook the rich semantic content embedded in non-verbal cues. This paper presents a novel Wavelet-Driven Multimodal Intent Recognition (WDMIR) framework that enhances intent understanding through frequency-domain analysis of non-verbal information. To be more specific, we propose: (1) a wavelet-driven fusion module that performs synchronized decomposition and integration of video-audio features in the frequency domain, enabling fine-grained analysis of temporal dynamics; (2) a cross-modal interaction mechanism that facilitates progressive feature enhancement from bimodal to trimodal integration, effectively bridging the semantic gap between verbal and non-verbal information. Extensive experiments on MIntRec demonstrate that our approach achieves state-of-the-art performance, surpassing previous methods by 1.13% on accuracy. Ablation studies further verify that the wavelet-driven fusion module significantly improves the extraction of semantic information from non-verbal sources, with a 0.41% increase in recognition accuracy when analyzing subtle emotional cues.
7683: Where Does This Data Come From? Enhanced Source Inference Attacks in Federated Learning
Authors: Haiyang Chen, Xiaolong Xu, Xiang Zhu, Xiaokang Zhou, Fei Dai, Yansong Gao, Xiao Chen, Shuo Wang, Hongsheng Hu
Location: Guangzhou | Day: TBD
Show Abstract
Federated learning (FL) enables collaborative model training without exposing raw data, offering a privacy-aware alternative to centralized learning. However, FL remains vulnerable to various privacy attacks that exploit shared model updates, including membership inference, property inference, and gradient inversion. Source inference attacks further threaten FL by identifying which client contributed a specific training sample, posing severe risks to user and institutional privacy. Existing source inference attacks mainly assume passive adversaries and overlook more realistic scenarios where the server actively manipulates the training process. In this paper, we present an enhanced source inference attack that demonstrates how a malicious server can amplify behavioral differences between clients to more accurately infer data origin. Our approach introduces active training manipulation and data augmentation to expose client-specific patterns. Experimental results across five representative FL algorithms and multiple datasets show that our method significantly outperforms prior passive attacks. These findings reveal a deeper level of privacy vulnerability in FL and call for stronger defense mechanisms under active threat models.
7685: Inferring Causal Protein Signaling Networks with Reinforcement Learning via Artificial Bee Colony Neural Architecture Search
Authors: Jihao Zhai, Junzhong Ji, Jinduo Liu
Location: Guangzhou | Day: TBD
Show Abstract
Inferring causal protein signaling networks from human immune system cellular data is an important approach to reveal underlying tissue signaling biology and dysfunction in diseased cells. In recent years, reinforcement learning (RL) methods have shown excellent performance in the field of causal protein signaling network inference. However, the complexity of RL models and the need for manual hyperparameter tuning can hinder performance. In this paper, we propose a actor-critic RL model via artificial bee colony (ABC) neural architecture search, called ABCNAS-RL. Specifically, the entire method is divided into two phases: ABC neural architecture search and actor-critic RL search. In phase one, we represent each bee as a set of hyperparameter, utilizing the ABC algorithm searching for optimal hyperparameters of the actor-critic RL model on the training set. In phase two, we use the actor-critic RL model to infer the causal protein signaling network on the test set. The actor network consists of an encoder-decoder architecture, composed of a transformer and a bidirectional gated recurrent unit (BiGRU) with an integrated attention mechanism. The critic network consists of a fully connected neural network that estimates the output state of the actor network. By maximizing cumulative rewards, we ultimately derive the causal protein signaling network. Extensive experimental results on simulated and real datasets verify that ABCNAS-RL outperforms the comparison methods and has superior performance.
7700: Always Clear Depth: Robust Monocular Depth Estimation Under Adverse Weather
Authors: Kui Jiang, Jing Cao, Zhaocheng Yu, Junjun Jiang, Jingchun Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Monocular depth estimation is critical for applications such as autonomous driving and scene reconstruction. While existing methods perform well under normal scenarios, their performance declines in adverse weather, due to challenging domain shifts and difficulties in extracting scene information. To address this issue, we present a robust monocular depth estimation method called ACDepth from the perspective of high-quality training data generation and domain adaptation. Specifically, we introduce a one-step diffusion model for generating samples that simulate adverse weather conditions, constructing a multi-tuple degradation dataset during training. To ensure the quality of the generated degradation samples, we employ LoRA adapters to fine-turn the generation weights of diffusion model. Additionally, we integrate circular consistency loss and adversarial training to guarantee the fidelity and naturalness of the scene contents. Furthermore, we elaborate on a multi-granularity knowledge distillation strategy (MKD) that encourages the student network to absorb knowledge from both the teacher model and pretrained Depth Anything V2. This strategy guides the student model in learning degradation-agnostic scene information from various degradation inputs. In particular, we introduce an ordinal guidance distillation mechanism (OGD) that encourages the network to focus on uncertain regions through differential ranking, leading to a more precise depth estimation. Experimental results demonstrate that our ACDepth surpasses md4all-DD by 2.50% for night scene and 2.61% for rainy scene on the nuScenes dataset in terms of the absRel metric.
7709: Imputation-free Incomplete Multi-view Clustering via Knowledge Distillation
Authors: Benyu Wu, Wei Du, Jun Wang, Guoxian Yu
Location: Guangzhou | Day: TBD
Show Abstract
Incomplete multi-view data presents a significant challenge for multi-view clustering (MVC). Existing incomplete MVC solutions commonly rely on data imputation to convert incomplete data into complete data. However, this paradigm suffers from the risk of error accumulation when clustering unreliable imputed data, causing suboptimal clustering performance. Moreover, using imputation to fulfill missing data is inefficient, while inferring data categories based solely on the existing views is extremely challenging. To this end, we propose an Imputation-free Incomplete MVC (I2MVC) via pseudo-supervised knowledge distillation. Specifically, I2MVC decomposes the incomplete MVC problem into two tasks: an MVC task for complete data and a pseudo-supervised classification task for fully incomplete data. A self-supervised simple contrastive Teacher network is trained for clustering complete data, and its knowledge is distilled into a lightweight pseudo-supervised Student network. The Student network, unrestricted by view completeness, further guides the clustering of fully incomplete data. Finally, the clustering results from both tasks are merged to generate the final clustering outcome. Experimental results on benchmark datasets demonstrate the effectiveness of I2MVC.
7711: Aligning Contrastive Multiple Clusterings with User Interests
Authors: Shan Zhang, Liangrui Ren, Jun Wang, Yanyu Xu, Carlotta Domeniconi, Guoxian Yu
Location: Guangzhou | Day: TBD
Show Abstract
Multiple clustering approaches aim to partition complex data in different ways. These methods often exhibit a one-to-many relationship in their results, and relying solely on the data context may be insufficient to capture the patterns relevant to the user. User’s expectation is key for the multiple clustering task. Two main challenges exist: identifying the significant features to represent user interests and aligning those interests with the clustering results. To address this issue, we propose Contrastive Multiple Clusterings (CMClusts), which extends contrastive learning to multiple clustering by elevating traditional instance-level contrast to clustering-level contrast. Furthermore, CMClusts integrates user expectations or interests by extracting desired features through tailored data augmentations, enabling the model to effectively capture user-relevant clustering features. Experimental results on benchmark datasets show that CMClusts can generate interpretable and high-quality clusterings, which reflect different user interests.
7736: Diffuse&Refine: Intrinsic Knowledge Generation and Aggregation for Incremental Object Detection
Authors: Jianzhou Wang, Yirui Wu, Lixin Yuan, Wenxiao Zhang, Jun Liu, Junyang Chen, Huan Wang, Wenhai Wang
Location: Guangzhou | Day: TBD
Show Abstract
Incremental Object Detection(IOD) targets at progressively extending capability of object detectors to recognize new classes. However, representation confusion between old and new classes leads to catastrophic forgetting. To alleviate this problem, we propose DiffKA, with intrinsic knowledge generated and aggregated by forward and backward diffusion, gradually establishing rigid class boundary. With incremental streaming data, forward diffusion spreads information to generate potential inter-class associations among new- and old-class prototypes within a hierarchical tree, named as Intrinsic Correlation Tree(ICTree), to store intrinsic knowledge. Afterwards, backward diffusion refines and aggregates the generated knowledge in ICTree, explicitly establishing rigid class boundary to mitigate representation confusion. To keep semantic consistency with extreme IOD settings, we reorganize semantic relevance of old- and new-class prototypes in paradigms to adaptively and effectively update DiffKA. Experiments on MS COCO dataset show DiffKA achieves state-of-the-art performance on IOD tasks with significant advantages.
7737: ExpTalk: Diverse Emotional Expression via Adaptive Disentanglement and Refined Alignment for Speech-Driven 3D Facial Animation
Authors: Zhan Qu, Shengyu Zhang, Mengze Li, Zhuo Chen, Chengfei Lv, Zhou Zhao, Fei Wu
Location: Guangzhou | Day: TBD
Show Abstract
Speech-driven 3D facial animation aims to create lifelike facial expressions that synchronize accurately with speech. Despite significant progress, many existing methods may focus on generating facial animation with a fixed emotional state, neglecting the diverse transformations of facial emotions under a given speech input. To solve this issue, we focus on exploring the refined alignment between speech representations and multiple domains in facial expression information. We aim to disentangle the spoken language and emotion facial priors from speech expressions, to guide the refinement of the facial vertices based on speech. To accomplish this objective, we propose ExpTalk, which first applies an Adaptive Disentanglement Variational Autoencoder (AD-VAE) to decouple facial expression aligned with spoken language and emotions of speech through contrastive learning. Then a Refined Alignment Diffusion (RAD) is employed to iteratively refine the decoupled facial expression priors through diffusion-based perturbations, producing facial animations that align with the emotional variations of the given speech. Extensive experiments prove the effectiveness of our ExpTalk by surpassing state-of-the-arts by a large margin.
7753: Cause-Effect Driven Optimization for Robust Medical Visual Question Answering with Language Biases
Authors: Huanjia Zhu, Yishu Liu, Xiaozhao Fang, Guangming Lu, Bingzhi Chen
Location: Guangzhou | Day: TBD
Show Abstract
Existing Medical Visual Question Answering (Med-VQA) models often suffer from language biases, where spurious correlations between question types and answer categories are inadvertently established.
To address these issues, we propose a novel Cause-Effect Driven Optimization framework called CEDO, that incorporates three well-established mechanisms, i.e., Modality-driven Heterogeneous Optimization (MHO), Gradient-guided Modality Synergy (GMS), and Distribution-adapted Loss Rescaling (DLR), for comprehensively mitigating language biases from both causal and effectual perspectives.
Specifically, MHO employs adaptive learning rates for specific modalities to achieve heterogeneous optimization, thus enhancing robust reasoning capabilities.
Additionally, GMS leverages the Pareto optimization method to foster synergistic interactions between modalities and enforce gradient orthogonality to eliminate bias updates, thereby mitigating language biases from the effect side, i.e., shortcut bias. Furthermore, DLR is designed to assign adaptive weights to individual losses to ensure balanced learning across all answer categories, effectively alleviating language biases from the cause side, i.e., imbalance biases within datasets. Extensive experiments on multiple traditional and bias-sensitive benchmarks consistently demonstrate the robustness of CEDO over state-of-the-art competitors.
7755: Balancing Imbalance: Data-Scarce Urban Flow Prediction via Spatio-Temporal Balanced Transfer Learning
Authors: Xinyan Hao, Huaiyu Wan, Shengnan Guo, Youfang Lin
Location: Guangzhou | Day: TBD
Show Abstract
Advanced deep spatio-temporal networks have become the mainstream for traffic prediction, but the widespread adoption of these models is impeded by the prevalent scarcity of available data. Despite cross-city transfer learning emerging as a common strategy to address this issue, it overlooks the inherent distribution imbalances within each city, which could potentially hinder the generalization capabilities of pre-trained models. To overcome this limitation, we propose a Spatio-Temporal Balanced Transfer Learning (STBaT) framework to enhance existing spatio-temporal prediction networks, ensuring both universality and precision in predictions for new urban environments. A Regional Imbalance Acquisition Module is designed to model the regional imbalances of source cities. Besides, to promote generalizable knowledge acquisition, a Spatio-Temporal Balanced Learning Module is devised to balance the predictive learning process. Extensive experiments on real-world datasets validate the efficacy of our proposed approach compared with several state-of-the-art methods.
7756: Enhancing Nighttime Semantic Segmentation with Visual-Linguistic Priors and Wavelet Transform
Authors: Jianhou Zhou, Xiaolong Zhou, Sixian Chan, Zhaomin Chen, Xiaoqin Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Nighttime semantic segmentation is a critical yet challenging task in autonomous driving. Most existing methods are designed for daytime scenarios, resulting in poor nighttime performance due to texture loss and decreased object visibility. Low-light enhancement was applied before segmentation but failed to recover nighttime-specific details, introducing noise or losing delicate structures. Recent work shows that large-scale image-text pairs can effectively leverage natural language priors to guide visual representation, achieving remarkable performance across various downstream visual tasks. However, effectively employing visual-linguistic priors for nighttime semantic segmentation remains underexplored. To address these issues, we propose Text-WaveletFormer, a novel end-to-end framework that integrates text prompts and wavelet-based texture enhancement. Specifically, to compensate for the low recognizability of objects in nighttime scenes, we design a Text-Image Fusion Module (TIFM) to incorporate textual priors to improve nighttime object recognition. In addition, to alleviate the lack of texture details in nighttime conditions, we introduce a Wavelet Guided Texture Amplifier Module (WTAM) to fuse wavelet and raw image features via cross-attention, restoring low-light details. Finally, extensive experiments on benchmarks including NightCity, NightCity-fine, BDD100K, and CityScapes demonstrate our method’s superior performance over existing approaches.
7759: Representation Learning with Mutual Influence of Modalities for Node Classification in Multi-Modal Heterogeneous Networks
Authors: Jiafan Li, Jiaqi Zhu, Liang Chang, Yilin Li, Miaomiao Li, Yang Wang, Yi Yang, Hongan Wang
Location: Guangzhou | Day: TBD
Show Abstract
Nowadays, numerous online platforms can be described as multi-modal heterogeneous networks (MMHNs), such as Douban’s movie networks and Amazon’s product review networks. Accurately categorizing nodes within these networks is crucial for analyzing the corresponding entities, which requires effective representation learning on nodes. However, existing multi-modal fusion methods often adopt either early fusion strategies which may lose the unique characteristics of individual modalities, or late fusion approaches overlooking the cross-modal guidance in GNN-based information propagation. In this paper, we propose a novel model for node classification in MMHNs, named Heterogeneous Graph Neural Network with Inter-Modal Attention (HGNN-IMA). It learns node representations by capturing the mutual influence of multiple modalities during the information propagation process, within the framework of heterogeneous graph transformer. Specifically, a nested inter-modal attention mechanism is integrated into the inter-node attention to achieve adaptive multi-modal fusion, and modality alignment is also taken into account to encourage the propagation among nodes with consistent similarities across all modalities. Moreover, an attention loss is augmented to mitigate the impact of missing modalities. Extensive experiments validate the superiority of the model in the node classification task, providing an innovative view to handle multi-modal data, especially when accompanied with network structures. The full version including Appendix is available at http://arxiv.org/abs/2505.07895.
7768: Mitigating Over-Smoothing in Graph Neural Networks via Separation Coefficient-Guided Adaptive Graph Structure Adjustment
Authors: Hanyang Meng, Jielong Yang, Li Peng
Location: Guangzhou | Day: TBD
Show Abstract
As the number of layers in Graph Neural Networks (GNNs) increases, over-smoothing becomes more severe, causing intra-class feature distances to shrink, while heterogeneous representations tend to converge. Most existing methods attempt to address this issue by employing heuristic shortcut mechanisms or optimizing objectives to constrain inter-class feature differences. However, these approaches fail to establish a theoretical connection between message passing and the variation in inter-class feature differences, making it challenging to design methods that target the key influencing factors. To address this gap, this paper first introduces the concept of the separation coefficient, which quantifies the contraction of feature distances between classes during multi-layer message passing. Based on this theory, we propose a low-complexity, pluggable, pseudo-label-based adaptive graph structure adjustment method. This approach effectively enhances the separation coefficient of inter-class features while maintaining intra-class compactness, thereby alleviating the convergence of heterogeneous representations caused by multi-layer aggregation. Experimental results demonstrate that the proposed method significantly improves the discriminability of node representations and enhances node classification performance across various datasets and foundational models.
7793: Semantic-Space-Intervened Diffusive Alignment for Visual Classification
Authors: Zixuan Li, Lei Meng, Guoqing Chao, Wei Wu, Yimeng Yang, Xiaoshuo Yan, Zhuang Qi, Xiangxu Meng
Location: Guangzhou | Day: TBD
Show Abstract
Cross-modal alignment is an effective approach to improving visual classification. Existing studies typically enforce a one-step mapping that uses deep neural networks to project the visual features to mimic the distribution of textual features. However, they typically face difficulties in finding such a projection due to the two modalities in both the distribution of class-wise samples and the range of their feature values. To address this issue, this paper proposes a novel Semantic-Space-Intervened Diffusive Alignment method, termed SeDA, models a semantic space as a bridge in the visual-to-textual projection, considering both types of features share the same class-level information in classification. More importantly, a bi-stage diffusion framework is developed to enable the progressive alignment between the two modalities. Specifically, SeDA first employs a Diffusion-Controlled Semantic Learner to model the semantic feature space of visual features by constraining the interactive features of the diffusion model and the category centers of visual features. In the later stage of SeDA, the Diffusion-Controlled Semantic Translator focuses on learning the distribution of textual features from the semantic space. Meanwhile, the Progressive Feature Interaction Network introduces stepwise feature interactions at each alignment step, progressively integrating textual information into mapped features. Experimental results show that SeDA achieves stronger cross-modal feature alignment, leading to superior performance over existing methods across multiple scenarios.
7815: General Incomplete Time Series Analysis via Patch Dropping Without Imputation
Authors: Yangyang Wu, Yi Yuan, Mengying Zhu, Xiaoye Miao, Meng Xi
Location: Guangzhou | Day: TBD
Show Abstract
Missing values in multivariate time series data present significant challenges to effective analysis. Existing methods for multivariate time series analysis either ignore missing data, sacrificing performance, or follow the impute-then-analyze paradigm, which suffers from redundant training and error accumulation, leading to biased results and suboptimal performance. In this paper, we propose INTER, a novel end-to-end framework for incomplete multivariate time series analysis, which bypasses imputation by leveraging pre-trained language models to learn the distribution of incomplete time series data. INTER incorporates two novel components: the missing-rate-aware time series patch-dropping (MPD) strategy and the missing-aware Transformer block, both of which we propose to enhance model generalization, robustness, and the ability to capture underlying patterns in the observed incomplete time series. Moreover, we theoretically prove that the MPD strategy exhibits lower sample variance for time series with the same dropout rate compared to other dropping strategies. Extensive experiments on 11 public real-world time series datasets demonstrate that INTER improves accuracy by over 20% compared to state-of-the-art methods, while maintaining competitive computational efficiency.
7823: Do Mentioned Items Truly Matter? Enhancing Conversational Recommender Systems with Causal Intervention and Large Language Models
Authors: Lingzhi Wang, Xingshan Zeng, Kam-Fai Wong
Location: Guangzhou | Day: TBD
Show Abstract
Conversational Recommender Systems (CRS) have become increasingly important due to their ability to recommend items through interactive dialogue, adapting to user preferences in real time. Traditional CRS approaches face challenges in generating high-quality, diverse responses due to the limited availability of training data and the inherited biases from domain-specific fine-tuning. Furthermore, existing systems often overlook the impact of confounding variables during user interactions, leading to suboptimal recommendations. In this work, we propose a novel hybrid framework that integrates large language models (LLMs) with traditional recommendation techniques to address these limitations. Our approach leverages the strengths of LLMs in generating fluent, contextually appropriate responses while employing a traditional recommendation module to capture complex interaction structures. To ensure unbiased recommendations, we introduce causal interventions that disentangle confounding variables, improving recommendation accuracy. We evaluate our framework on established CRS datasets, demonstrating significant improvements in recommendation quality and response generation. Our results highlight the effectiveness of the causal intervention mechanism in producing more reliable and personalized recommendations, while the LLM-based response generation offers scalability across multiple domains.
7827: Empowering Vision Transformers with Multi-Scale Causal Intervention for Long-Tailed Image Classification
Authors: Xiaoshuo Yan, Zhaochuan Li, Lei Meng, Zhuang Qi, Wei Wu, Zixuan Li, Xiangxu Meng
Location: Guangzhou | Day: TBD
Show Abstract
Causal inference has emerged as a promising approach to mitigate long-tail classification by handling the biases introduced by class imbalance. However, along with the change of advanced backbone models from Convolutional Neural Networks (CNNs) to Visual Transformers (ViT), existing causal models may not achieve an expected performance gain. This paper investigates the influence of existing causal models on CNNs and ViT variants, highlighting that ViT’s global feature representation makes it hard for causal methods to model associations between fine-grained features and predictions, which leads to difficulties in classifying tail classes with similar visual appearance. To address these issues, this paper proposes TSCNet, a two-stage causal modeling method to discover fine-grained causal associations through multi-scale causal interventions. Specifically, in the hierarchical causal representation learning stage (HCRL), it decouples the background and objects, applying backdoor interventions at both the patch and feature level to prevent model from using class-irrelevant areas to infer labels which enhances fine-grained causal representation. In the counterfactual logits’ bias calibration stage (CLBC), it refines the optimization of model’s decision boundary by adaptive constructing counterfactual balanced data distribution to remove the spurious associations in the logits caused by data distribution. Extensive experiments conducted on various long-tail benchmarks demonstrate that the proposed TSCNet can eliminate multiple biases introduced by data imbalance, which outperforms existing methods.
7833: AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents
Authors: Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, Evgeny Burnaev
Location: Guangzhou | Day: TBD
Show Abstract
Advancements in the capabilities of Large Language Models (LLMs) have created a promising foundation for developing autonomous agents. With the right tools, these agents could learn to solve tasks in new environments by accumulating and updating their knowledge. Current LLM-based agents process past experiences using a full history of observations, summarization, retrieval augmentation. However, these unstructured memory representations do not facilitate the reasoning and planning essential for complex decision-making. In our study, we introduce AriGraph, a novel method wherein the agent constructs and updates a memory graph that integrates semantic and episodic memories while exploring the environment. We demonstrate that our Ariadne LLM agent, consisting of the proposed memory architecture augmented with planning and decision-making, effectively handles complex tasks within interactive text game environments difficult even for human players. Results show that our approach markedly outperforms other established memory methods and strong RL baselines in a range of problems of varying complexity. Additionally, AriGraph demonstrates competitive performance compared to dedicated knowledge graph-based methods in static multi-hop question-answering.
7842: Indirect Alignment and Relationship Preservation for Domain Generalization
Authors: Wei Wei, Zixiong Li, Jing Yan, Mingwen Shao, Lin Li
Location: Guangzhou | Day: TBD
Show Abstract
Domain generalization (DG) aims to train models on multiple source domains to generalize effectively to unseen target domains, addressing performance degradation caused by domain shifts. Many existing methods rely on direct feature alignment, which disrupts natural sequence relationships, causes misalignment and feature distortion, and leads to overfitting, especially with significant domain gaps. To tackle these issues, we propose a novel DG approach with two key modules: the Sample Difference Keeping (SDK) module, which preserves natural sequence relationships to enhance feature diversity and separability, and the Sample Consistency Alignment (SCA) module, which achieves indirect alignment by modeling inter-class and inter-domain relationship consistencies. This approach mitigates overfitting and misalignment, ensuring adaptability to significant domain gaps. Extensive experiments demonstrate that our framework consistently outperforms state-of-the-art methods.
7876: Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization
Authors: Zhuang Qi, Sijin Zhou, Lei Meng, Han Hu, Han Yu, Xiangxu Meng
Location: Guangzhou | Day: TBD
Show Abstract
Attribute bias in federated learning (FL) typically leads local models to optimize inconsistently due to the learning of non-causal associations, resulting degraded performance. Existing methods either use data augmentation for increasing sample diversity or knowledge distillation for learning invariant representations to address this problem. However, they lack a comprehensive analysis of the inference paths, and the interference from confounding factors limits their performance. To address these limitations, we propose the Federated Deconfounding and Debiasing Learning (FedDDL) method. It constructs a structured causal graph to analyze the model inference process, and performs backdoor adjustment to eliminate confounding paths. Specifically, we design an intra-client deconfounding learning module for computer vision tasks to decouple background and objects, generating counterfactual samples that establish a connection between the background and any label, which stops the model from using the background to infer the label. Moreover, we design an inter-client debiasing learning module to construct causal prototypes to reduce the proportion of the background in prototype components. Notably, it bridges the gap between heterogeneous representations via causal prototypical regularization. Extensive experiments on 2 benchmarking datasets demonstrate that FedDDL significantly enhances the model capability to focus on main objects in unseen data, leading to 4.5% higher Top-1 Accuracy on average over 9 state-of-the-art existing methods.
7881: Can We Translate Code Better with LLMs and Call Graph Analysis?
Authors: Yang Luo
Location: Guangzhou | Day: TBD
Show Abstract
This paper proposes an innovative code translation method aimed at addressing the accuracy issues encountered by large language models (LLMs) in translating code of complex large-scale software projects. The method utilizes the Language Server Protocol to obtain the call graph of the entire codebase, and optimizes the input prompt of the LLM accordingly, significantly improving the correctness of translation at the compilation stage. Moreover, this method introduces the bridged debuggers technique based on the Debug Adapter Protocol and dynamic test case generation, effectively fixing runtime errors. Experiments on multiple mainstream datasets demonstrate that, compared to existing code translation methods and LLMs, this method achieves a significant improvement in translation accuracy.
7893: DepthART: Monocular Depth Estimation as Autoregressive Refinement Task
Authors: Bulat Gabdullin, Nina Konovalova, Nikolay Patakin, Dmitry Senushkin, Anton Konushin
Location: Guangzhou | Day: TBD
Show Abstract
Monocular depth estimation has seen significant advances through discriminative approaches, yet their performance remains constrained by the limitations of training datasets. While generative approaches have addressed this challenge by leveraging priors from internet-scale datasets, with recent studies showing state-of-the-art results using fine-tuned text-to-image diffusion models, there is still room for improvement. Notably, autoregressive generative approaches, particularly Visual AutoRegressive modeling, have demonstrated superior results compared to diffusion models in conditioned image synthesis, while offering faster inference times.
In this work, we apply Visual Autoregressive Transformer (VAR) to the monocular depth estimation problem. However, the conventional GPT-2-style training procedure (teacher forcing) inherited by VAR yields suboptimal results for depth estimation. To address this limitation, we introduce DepthART – a novel training method formulated as a Depth Autoregressive Refinement Task. Unlike traditional VAR training with static inputs and targets, our method implements a dynamic target formulation based on model outputs, enabling self-refinement. By utilizing the model’s own predictions as inputs instead of ground truth token maps during training, we frame the objective as residual minimization, effectively reducing the discrepancy between training and inference procedures.
Our experimental results demonstrate that the proposed training approach significantly enhances the performance of VAR in depth estimation tasks. When trained on Hypersim dataset using our approach, the model achieves superior results across multiple unseen benchmarks compared to existing generative and discriminative baselines.
7903: Wave-driven Graph Neural Networks with Energy Dynamics for Over-smoothing Mitigation
Authors: Peihan Wu, Hongda Qi, Sirong Huang, Dongdong An, Jie Lian, Qin Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Over-smoothing is a persistent challenge in Graph Neural Networks (GNNs), where node embeddings become indistinguishable as network depth increases, fundamentally limiting their effectiveness on tasks requiring fine-grained distinctions. This issue arises from the reliance on diffusion-based propagation mechanisms, which suppress high-frequency information essential for preserving feature diversity. To mitigate this, we propose a wave-driven GNN framework that redefines feature propagation through the wave equation. Unlike diffusion, the wave equation incorporates second-order dynamics, balancing smoothing with oscillatory behavior to retain high-frequency components while ensuring effective information flow. To enhance the stability and convergence of wave equation discretization on graphs, an energy-based mechanism inspired by kinetic and potential energy dynamics is introduced, balancing temporal evolution and structural alignment to stabilize propagation. Extensive experiments on benchmark datasets, including Cora, Citeseer, and PubMed, as well as real-world graphs, demonstrate that the proposed framework achieves state-of-the-art performance, effectively mitigating over-smoothing and enabling deeper, more expressive architectures. The code is available at https://github.com/rene0329/EWGNN/.
7930: A Fast Neural Architecture Search Method for Multi-Modal Classification via Knowledge Sharing
Authors: Zhihua Cui, Shiwu Sun, Qian Guo, Xinyan Liang, Yuhua Qian, Zhixia Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Neural architecture search-based multi-modal classification (NAS-MMC) aims to automatically find optimal network structures for improving the multi-modal classification performance. However, most current NAS-MMC methods are quite time-consuming during the training process. In this paper, we propose a knowledge sharing-based neural architecture search (KS-NAS) method for multi-modal classification. The KS-NAS optimizes the search process by introducing a dynamically updated knowledge base to reduce the consumption of computational resource. Specifically, during the deep evolutionary search, individuals in the initial population acquire initial parameters from a knowledge base, and then undergo training and optimization until convergence is reached, avoiding the need for training from scratch. The knowledge base is dynamically updated by aggregating the parameters of high-quality individuals trained within the population, thus progressively improving the quality of the knowledge base. As the population evolves, the knowledge base continues to optimize, ensuring that subsequent individuals can obtain higher-quality initialization parameters, which significantly accelerates the training speed of the population. Experimental results show that the KS-NAS method achieves state-of-the-art results in terms of classification performance and training efficiency across multiple popular multi-modal tasks.
7957: DGL: Dynamic Global-Local Information Aggregation for Scalable VRP Generalization with Self-Improvement Learning
Authors: Yubin Xiao, Yuesong Wu, Rui Cao, Di Wang, Zhiguang Cao, Xuan Wu, Peng Zhao, Yuanshu Li, You Zhou, Yuan Jiang
Location: Guangzhou | Day: TBD
Show Abstract
The Vehicle Routing Problem (VRP) is a critical combinatorial optimization problem with wide-reaching real-world applications, particularly in logistics, transportation. While neural network-based VRP solvers have shown impressive results on test instances similar to training data, their performance often degrades when faced with varying scales and unseen distributions, limiting their practical applicability. To overcome these limitations, we introduce DGL (Dynamic Global-Local Information Aggregation), a novel model that combines global and local information to effectively solve VRPs. DGL dynamically adjusts local node selections within a localized range, capturing local invariance across problems of different scales and distributions, thereby enhancing generalization. At the same time, DGL integrates global context into the decision-making process, providing richer information for more informed decisions. Additionally, we propose a replacement-based self-improvement learning framework that leverages data augmentation and random replacement techniques, further enhancing DGL’s robustness. Extensive experiments on synthetic datasets, benchmark datasets, and real-world country map instances demonstrate that DGL achieves state-of-the-art performance, particularly in generalizing to large-scale VRPs and real-world scenarios. These results showcase DGL’s effectiveness in solving complex, realistic optimization challenges and highlight its potential for practical applications.
7964: Improving Generalization in Meta-Learning via Meta-Gradient Augmentation
Authors: Ren Wang, Haoliang Sun, Yuxiu Lin, Xinxin Zhang, Yilong Yin
Location: Guangzhou | Day: TBD
Show Abstract
Meta-learning methods typically follow a two-loop framework, where each loop potentially suffers from notorious overfitting, hindering rapid adaptation and generalization to new tasks. Existing methods address this by enhancing the mutual-exclusivity or diversity of training samples, but these data manipulation strategies are data-dependent and insufficiently flexible. This work proposes a data-independent Meta-Gradient Augmentation (MGAug) method from the perspective of gradient regularization. The key idea is first to break the rote memories by network pruning to address memorization overfitting in the inner loop, then use the gradients of pruned sub-networks to augment meta-gradients, alleviating overfitting in the outer loop. Specifically, we explore three pruning strategies, including random width pruning, random parameter pruning, and a newly proposed catfish pruning that measures a Meta-Memorization Carrying Amount (MMCA) score for each parameter and prunes high-score ones to break rote memories. The proposed MGAug is theoretically guaranteed by the generalization bound from the PAC-Bayes framework. Extensive experiments on multiple few-shot learning benchmarks validate MGAug’s effectiveness and significant improvement over various meta-baselines.
8000: DM-POSA: Enhancing Open-World Test-Time Adaptation with Dual-Mode Matching and Prompt-Based Open Set Adaptation
Authors: Shiji Zhao, Shao-Yuan Li, Chuanxing Geng, Sheng-Jun Huang, Songcan Chen
Location: Guangzhou | Day: TBD
Show Abstract
The need to generalize the pre-trained deep learning models to unknown test-time data distributions has spurred research into test-time adaptation (TTA). Existing studies have mainly focused
on closed-set TTA with only covariate shifts, while largely overlooking open-set TTA that involves semantic shifts, i.e., unknown open-set classes. However, addressing adaptation to unknown classes is crucial for open-world safety-critical applications such as autonomous driving. In this paper, we emphasize that accurate identification of the open-set samples is rather challenging in TTA. The entanglement of semantic shift and covariate shift mutually confuse the network’s discriminative capability. This co-interference further exacerbates considering the single-pass data nature and low latency requirements. With this under standing, we propose Dual-mode Matching and Prompt-based Open Set Adaptation (DM-POSA) for open-set TTA to enhance discriminative feature learning and unknown classes distinguishment with minimal time cost. DM-POSA identifies open-set samples via dual-mode matching strategies, including model-parameter-based and feature space-based matching. It also optimizes the model with a random pairing discrepancy loss, enhancing the distributional difference between open-set and closed-set samples, thus improving the model’s ability to recognize unknown categories. Extensive
experiments show the superiority of DM-POSA over state-of-the-art baselines on both closed-set class adaptation and open-set class detection.
8018: Parameterized Approximation Algorithm for Doubly Constrained Fair Clustering
Authors: Xiaoliang Wu, Qilong Feng, Junyu Huang, Jianxin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Fair clustering has recently received considerable attention where numerous distinct fairness notions are developed. Despite being well-justified, these fairness notions are frequently studied in isolation, leaving the need to explore how they can be combined. Building on prior work, we focus on the doubly constrained fair clustering that incorporates two widely adopted demographic representation fairness notions in clustering: group fairness and data summarization fairness. Both fairness notions extend classical clustering formulation by associating each data point with a demographic label, where group fairness requires each cluster to proportionally reflect the population-level distribution of demographic groups, and data summarization fairness ensures the chosen facilities maintaining the population-level demographic representation of each group. In this paper, we study the Fixed-Parameter Tractable (FPT) approximation algorithms for doubly constrained fair clustering under the k-median objective, referred to Df-k-Med. The previous algorithms typically enumerate different demographic groups or construct fairness coreset, parameterized by both the number of opened facilities and demographic labels. By further leveraging the local fairness information, we propose a color-agnostic structural method that obtains the parameterized result independent of the number of demographic labels while effectively handling the combination of both fairness constraints. Specifically, we design a constant factor approximation for the Df-k-Med problem with fairness violation by one, which runs in FPT(k)-time, where k is the number of opened facilities.
8041: Filling the Missings: Spatiotemporal Data Imputation by Conditional Diffusion
Authors: Wenying He, Jieling Huang, Junhua Gu, Ji Zhang, Yude Bai
Location: Guangzhou | Day: TBD
Show Abstract
Missing data in spatiotemporal systems presents a significant challenge for modern applications, ranging from environmental monitoring to urban traffic management. The integrity of spatiotemporal data often deteriorates due to hardware malfunctions and software failures in real-world deployments. Current approaches based on machine learning and deep learning struggle to model the intricate interdependencies between spatial and temporal dimensions effectively and, more importantly, suffer from cumulative errors during the data imputation process, which propagate and amplify through iterations. To address these limitations, we propose CoFILL, a novel Conditional Diffusion Model for spatiotemporal data imputation. CoFILL builds on the inherent advantages of diffusion models to generate high-quality imputations without relying on potentially error-prone prior estimates. It incorporates an innovative dual-stream architecture that processes temporal and frequency domain features in parallel. By fusing these complementary features, CoFILL captures both rapid fluctuations and underlying patterns in the data, which enables more robust imputation. The extensive experiments demonstrate that CoFILL’s noise prediction network successfully transforms random noise into meaningful values that align with the true data distribution. The results also show that CoFILL outperforms state-of-the-art methods in terms of imputation accuracy. The source code is publicly available at https://github.com/joyHJL/CoFILL.
8054: Towards VLM-based Hybrid Explainable Prompt Enhancement for Zero-Shot Industrial Anomaly Detection
Authors: Weichao Cai, Weiliang Huang, Yunkang Cao, Chao Huang, Fei Yuan, Bob Zhang, Jie Wen
Location: Guangzhou | Day: TBD
Show Abstract
Zero-Shot Industrial Anomaly Detection (ZSIAD) aims to identify and localize anomalies in industrial images from unseen categories. Owing to the powerful generalization capabilities, Vision-Language Models (VLMs) have achieved growing interest in ZSIAD. To guide the model toward understanding and localizing the semantically complex industrial anomalies, existing VLM-based methods have attempted to provide additional prompts to the model through learnable text prompt templates. However, these zero-shot methods lack detailed descriptions of specific anomalies, making it difficult to classify and segment the diverse range of industrial anomalies accurately. To address the aforementioned issue, we firstly propose the multi-stage prompt generation agent for ZSIAD. Specifically, we leverage the Multi-modal Language Large Model (MLLM) to articulate the detailed differential information between normal and test samples, which can provide detailed text prompts to the model through further refinement and anti-false alarm constraint. Moreover, we introduce the Visual Fundamental Model (VFM) to generate anomaly-related attention prompts for more accurate localization of anomalies with varying sizes and shapes. Extensive experiments on seven real-world industrial anomaly detection datasets have shown that the proposed method not only outperforms recent SOTA methods, but also its explainable prompts provide the model with a more intuitive basis for anomaly identification.
8098: Formal Synthesis of Safe Kolmogorov-Arnold Network Controllers with Barrier Certificates
Authors: Xiongqi Zhang, Ning Lv, Wang Lin, Zuohua Ding
Location: Guangzhou | Day: TBD
Show Abstract
Control barrier certificate generation is an efficient and powerful technique for the safe control of cyber-physical systems. Feed-forward neural networks (FNNs) are commonly used to synthesize control barrier certificates and safe controllers, but they struggle to effectively address the challenges posed by high-dimensional complex systems. In this paper, we propose a novel method for generating control barrier certificates and controllers using Kolmogorov-Arnold Networks (KANs). Specifically, it utilizes KANs to replace FNNs as the template of control barrier certificates and contrllers. Since KAN has learnable activation functions, it can efficiently improve the representation power. Then, it leverages the pruning and symbolization properties of KANs, which significantly simplify the network structure, allowing for more efficient formal verification of the simplified candidate KAN control barrier certificates and controllers using Satisfiability Modulo Theories. We implement the tool KAN4CBC, and evaluate its performance over a set of benchmarks. The experimental results demonstrate that our method addresses the issues of system dimension expansion and improved solution efficiency.
8107: LPDetective: Dusting the LLM Chats for Prompt Template Abusers
Authors: Yang Luo, Qingni Shen, Zhonghai Wu
Location: Guangzhou | Day: TBD
Show Abstract
The abuse of LLM Chatbot interfaces by web robots leads to a significant waste of GPU and server resources, posing a serious security challenge. To address this issue, we propose LPDetective, an unsupervised method for detecting robot prompt templates. This method is based on the assumption that robot-generated text repeatedly uses the same or highly similar phrases and sentence structures across multiple sessions, differing from human natural conversations. We design a multi-stage workflow, including message grouping, text similarity measurement, hierarchical clustering analysis, and regular expression extraction, to automatically extract potential robot behavior patterns from chat logs. LPDetective does not require predefined templates or rely on training data, enabling it to adaptively discover new, unknown patterns. We conduct systematic experiments on three large-scale real-world datasets: Bing Copilot, Wildchat, and ChatLog. The results show that LPDetective can efficiently and accurately detect robot prompt templates in various scenarios, achieving a 7.5% improvement in F1 score compared to the state-of-the-art XLNet method and reducing detection latency by 178 times on the Bing Copilot dataset.
8147: Understanding Matters: Semantic-Structural Determined Visual Relocalization for Large Scenes
Authors: Jingyi Nie, Liangliang Cai, Qichuan Geng, Zhong Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Scene Coordinate Regression (SCR) estimates 3D scene coordinates from 2D images, and has become an important approach in visual relocalization. Existing methods exhibit high localization accuracy in small scenes, but still face substantial challenges in large-scale scenes, which usually have significant variations in depth, scale, and occlusion. Although structure-guided scene partitioning is commonly adopted, the over-partitioned elements and large feature variances within subscenes impede the estimation of the 3D coordinates, introducing misleading information for subsequent processing. To address the above-mentioned issues, we propose the Semantic-Structural Determined Visual Relocalization method for SCR, which leverages semantic-structural partition learning and partition-determined pose refinement to better understand the semantic and structural information on large scenes. Firstly, we partition the scene into small subscenes with label assignments, ensuring semantic consistency and structural continuity within each subscene. A classifier is then trained with sampling-based learning to predict these labels. Secondly, the partition predictions are encoded into embeddings and integrated with local features for intra-class compactness and inter-class separation, producing partition-aware features. To further decrease feature variances, we employ a discriminability metric and suppress ambiguous points, improving subsequent computations. Experimental results on the Cambridge Landmarks dataset demonstrate that the proposed method achieves significant improvements with fewer training costs on large-scale scenes, reducing the median error by 38% compared to the state-of-the-art SCR method DSAC*. Code is available: https://gitee.com/VR_NAVE/ss-dvr.
8176: Q-Detection: A Quantum-Classical Hybrid Poisoning Attack Detection Method
Authors: Haoqi He, Xiaokai Lin, Jiancai Chen, Yan Xiao
Location: Guangzhou | Day: TBD
Show Abstract
Data poisoning attacks pose significant threats to machine learning models by introducing malicious data into the training process, thereby degrading model performance or manipulating predictions. Detecting and sifting out poisoned data is an important method to prevent data poisoning attacks. Limited by classical computation frameworks, upcoming larger-scale and more complex datasets may pose difficulties for detection. We introduce the unique speedup of quantum computing for the first time in the task of detecting data poisoning. We present Q-Detection, a quantum-classical hybrid defense method for detecting poisoning attacks. Q-Detection also introduces the Quantum Weight-Assigning Network, which is optimized using quantum computing devices. Experimental results using multiple quantum simulation libraries show that Q-Detection effectively defends against label manipulation and backdoor attacks. The metrics demonstrate that Q-Detection consistently outperforms the baseline methods and is comparable to the state-of-the-art. Theoretical analysis shows that Q-Detection is expected to achieve more than a 20% speedup using quantum computing power.