Main track accepted papers

56: Multi-View Learning with Context-Guided Receptance for Image Denoising
Authors: Binghong Chen, Tingting Chai, Wei Jiang, Yuanrong Xu, Guanglu Zhou, Xiangqian Wu
Location: Guangzhou | Day: TBD
Show Abstract
Image denoising is essential in low-level vision applications such as photography and automated driving. Existing methods struggle with distinguishing complex noise patterns in real-world scenes and consume significant computational resources due to reliance on Transformer-based models. In this work, the Context-guided Receptance Weighted Key-Value (CRWKV) model is proposed, combining enhanced multi-view feature integration with efficient sequence modeling. The Context-guided Token Shift (CTS) mechanism is introduced to effectively capture local spatial dependencies and enhance the model’s ability to model real-world noise distributions. Also, the Frequency Mix (FMix) module extracting frequency-domain features is designed to isolate noise in high-frequency spectra, and is integrated with spatial representations through a multi-view learning process. To improve computational efficiency, the Bidirectional WKV (BiWKV) mechanism is adopted, enabling full pixel-sequence interaction with linear complexity while overcoming the causal selection constraints. The model is validated on multiple real-world image denoising datasets, outperforming the state-of-the-art methods quantitatively and reducing inference time up to 40%. Qualitative results further demonstrate the ability of our model to restore fine details in various scenes. The code is publicly available at https://github.com/Seeker98/CRWKV.
57: Time-Frequency Disentanglement Boosted Pre-Training: A Universal Spatio-Temporal Modeling Framework
Authors: Yudong Zhang, Zhaoyang Sun, Xu Wang, Xuan Yu, Kai Wang, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Current spatio-temporal modeling techniques largely rely on the abundant data and the design of task-specific models. However, many cities lack well-established digital infrastructures, making data scarcity and the high cost of model development significant barriers to application deployment. Therefore, this work aims to enable spatio-temporal learning to cope with the problems of few-shot data modeling and model generalizability. To this end, we propose a Universal Spatio-Temporal Correlationship pre-training framework (USTC), for spatio-temporal modeling across different cities and tasks. To enhance the spatio-temporal representations during pre-training, we propose to decouple the time-frequency patterns within data, and leverage contrastive learning to maintain the time-frequency consistency. To further improve the adaptability to downstream tasks, we design a prompt generation module to mine personalized spatio-temporal patterns on the target city, which can be integrated with the learned common spatio-temporal representations to collaboratively serve downstream tasks. Extensive experiments conducted on real-world datasets demonstrate that USTC significantly outperforms the advanced baselines in forecasting, imputation, and extrapolation across cities.
62: Robustness to Spurious Correlations via Dynamic Knowledge Transfer
Authors: Xiaoling Zhou, Wei Ye, Zhemg Lee, Shikun Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Spurious correlations pose a significant challenge to the robustness of statistical models, often resulting in unsatisfactory performance when distributional shifts occur between training and testing data. To address this, we propose to transfer knowledge across spuriously correlated categories within the deep feature space. Specifically, samples’ deep features are enriched using semantic vectors extracted from both their respective category distributions and those of their spuriously correlated counterparts, enabling the generation of diverse class-specific factual and counterfactual augmented deep features. We then demonstrate the feasibility of optimizing a surrogate robust loss instead of conducting explicit augmentations by considering an infinite number of augmentations. As spurious correlations between samples and classes evolve during training, we develop a reinforcement learning-based training framework called Dynamic Knowledge Transfer (DKT) to facilitate dynamic adjustments in the direction and intensity of knowledge transfer. Within this framework, a target network is trained using the derived robust loss to enhance robustness, while a strategy network generates sample-wise augmentation strategies in a dynamic and automatic way. Extensive experiments validate the effectiveness of the DKT framework in mitigating spurious correlations, achieving state-of-the-art performance across three typical learning scenarios susceptible to such correlations.
63: MVP-CBM: Multi-layer Visual Preference-enhanced Concept Bottleneck Model for Explainable Medical Image Classification
Authors: Chunjiang Wang, Kun Zhang, Yandong Liu, Zhiyang He, Xiaodong Tao, S. Kevin Zhou
Location: Guangzhou | Day: TBD
Show Abstract
The concept bottleneck model (CBM), as a technique improving interpretability via linking predictions to human-understandable concepts, makes high-risk and life-critical medical image classification credible. Typically, existing CBM methods associate the final layer of visual encoders with concepts to explain the model’s predictions. However, we empirically discover the phenomenon of concept preference variation, that is, the concepts are preferably associated with the features at different layers than those only at the final layer; yet a blind last-layer-based association neglects such a preference variation and thus weakens the accurate correspondences between features and concepts, impairing model interpretability. To address this issue, we propose a novel Multi-layer Visual Preference-enhanced Concept Bottleneck Model (MVP-CBM), which comprises two key novel modules: (1) intra-layer concept preference modeling, which captures the preferred association of different concepts with features at various visual layers, and (2) multi-layer concept sparse activation fusion, which sparsely aggregates concept activations from multiple layers to enhance performance. Thus, by explicitly modeling concept preferences, MVP-CBM can comprehensively leverage multi-layer visual information to provide a more nuanced and accurate explanation of model decisions. Extensive experiments on several public medical classification benchmarks demonstrate that MVP-CBM achieves state-of-the-art accuracy and interoperability, verifying its superiority. Code is available at https://github.com/wcj6/MVP-CBM.
66: Variational Graph Auto-Encoder Driven Graph Enhancement for Sequential Recommendation
Authors: Yuwen Liu, Lianyong Qi, Xingyuan Mao, Weiming Liu, Shichao Pei, Fan Wang, Xuyun Zhang, Amin Beheshti, Xiaokang Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Recommender systems play a critical role in many applications by providing personalized recommendations based on user interactions. However, it remains a major challenge to capture complex sequential patterns and address noise in user interaction data. While advanced neural networks have enhanced sequential recommendation by modeling high-order item dependencies, they typically assume that the noisy interaction data as the user’s preferred preferences. This assumption can lead to suboptimal recommendation results. We propose a Variational Graph Auto-Encoder driven Graph Enhancement (VGAE-GE) method for robust augmentation in sequential recommendation. Specifically, our method first constructs an item transition graph to capture higher-order interactions and employs a Variational Graph Auto-Encoder (VGAE) to generate latent variable distributions. By utilizing these latent variable distributions for graph reconstruction, we can improve the item representation. Next, we use a Graph Convolutional Network (GCN) to transform these latent variables into embeddings and infer more robust user representations from the updated item embeddings. Finally, we obtain the reconstructed user check-in data, and then use a Mamba-based recommender to make the recommendation process more efficient and the recommendation results more accurate. Extensive experiments on five public datasets demonstrate that our VGAE-GE model improves recommendation performance and robustness.
70: Adversarial Propensity Weighting for Debiasing in Collaborative Filtering
Authors: Kuiyu Zhu, Tao Qin, Pinghui Wang, Xin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Debiased recommendation focuses on alleviating the negative impact of various biases on recommendation quality to achieve fairer personalized recommendations. Current research mainly relies on propensity score estimation or causal inference methods to alleviate selection bias; at the same time, research on prevalence bias has proposed a variety of methods based on causal graphs and contrastive learning. However, these methods have shortcomings in dealing with unstable propensity score estimates, bias interactions, and decoupling of interest and bias signals, which limits the performance improvement of recommender systems. To this end, this paper proposes APWCF, a collaborative filtering debiased method that combines dynamic propensity modeling and adversarial learning. APWCF solves the problem of high variance in propensity scores through the dynamic propensity factor, and decouples user interests and bias signals through the adversarial learning to effectively remove multiple biases. Experiments show that APWCF significantly outperforms existing methods across various benchmark datasets from different domains. Compared with the current optimal baseline PDA, Recall@10 and NDCG@10 improve by 0.10%-5.42% and 1.01%-8.60% respectively.
71: GSDNet: Revisiting Incomplete Multimodality-Diffusion Emotion Recognition from the Perspective of Graph Spectrum
Authors: Yuntao Shou, Jun Yao, Tao Meng, Wei Ai, Cen Chen, Keqin Li
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal Emotion Recognition (MER) combines technologies from multiple fields (e.g., computer vision, natural language processing, and audio signal processing), aiming to infer an individual’s emotional state by analyzing information from different sources (i.e., video, audio, and text). Compared with single modality, by fusing complementary semantic information from different modalities, the model can obtain more robust knowledge representation. However, the modality missing problem limits the performance of MERC in practical scenarios. Recent work has achieved impressive performance on modality completion using graph neural networks and diffusion models, respectively. This inspires us to combine these two dimensions in the completion network to obtain more powerful representation capabilities. However, we argue that directly running a full-rank score-based diffusion model on the entire graph adjacency matrix space may adversely affect the learning process of the diffusion model. This is because the model assumes a direct relationship between each pair of nodes and ignores local structural features and sparse connections between nodes, thereby significantly reducing the quality of the generated data. Based on the above ideas, we propose a novel Graph Spectral Diffusion Network (GSDNet), which utilizes a low-rank score-based diffusion model to map Gaussian noise to the graph spectral distribution space of missing modalities and recover the missing data according to its original distribution. Extensive experiments have demonstrated that GSDNet achieves state-of-the-art emotion recognition performance in various modality loss scenarios.
73: A Methodological Framework for Measuring Spatial Labeling Similarity
Authors: Yihang Du, Jiaying Hu, Suyang Hou, Yueyang Ding, Xiaobo Sun
Location: Guangzhou | Day: TBD
Show Abstract
Spatial labeling assigns labels to specific spatial locations to characterize their spatial properties and relationships, with broad applications in scientific research and practice. Measuring the similarity between two spatial labelings is essential for understanding their differences and the contributing factors, such as changes in location properties or labeling methods. An adequate and unbiased measurement of spatial labeling similarity should consider the number of matched labels (label agreement), the topology of spatial label distribution, and the heterogeneous impacts of mismatched labels. However, existing methods often fail to account for all these aspects. To address this gap, we propose a methodological framework to guide the development of methods that meet these requirements.
Given two spatial labelings, the framework transforms them into graphs based on location organization, labels, and attributes (e.g., location significance). The distributions of their graph attributes are then extracted, enabling an efficient computation of distributional discrepancy to reflect the dissimilarity level between the two labelings. We further provide a concrete implementation of this framework, termed Spatial Labeling Analogy Metric (SLAM), along with an analysis of its theoretical foundation, for evaluating spatial labeling results in spatial transcriptomics (ST) as per their similarity with ground truth labeling. Through a series of carefully designed experimental cases involving both simulated and real ST data, we demonstrate that SLAM provides a comprehensive and accurate reflection of labeling quality compared to other well-established evaluation metrics. Our code is available at https://github.com/YihDu/
SLAM.
79: Dual-Perspective United Transformer for Object Segmentation in Optical Remote Sensing Images
Authors: Yanguang Sun, Jiexi Yan, Jianjun Qian, Chunyan Xu, Jian Yang, Lei Luo
Location: Guangzhou | Day: TBD
Show Abstract
Automatically segmenting objects from optical remote sensing images (ORSIs) is an important task. Most existing models are primarily based on either convolutional or Transformer features, each offering distinct advantages. Exploiting both advantages is valuable research, but it presents several challenges, including the heterogeneity between the two types of features, high complexity, and large parameters of the model. However, these issues are often overlooked in existing the ORSIs methods, causing sub-optimal segmentation. For that, we propose a novel Dual-Perspective United Transformer (DPU-Former) with a unique structure designed to simultaneously integrate long-range dependencies and spatial details. In particular, we design the global-local mixed attention, which captures diverse information through two perspectives and introduces a Fourier-space merging strategy to obviate deviations for efficient fusion. Furthermore, we present a gated linear feed-forward network to increase the expressive ability. Additionally, we construct a DPU-Former decoder to aggregate and strength features at different layers. Consequently, the DPU-Former model outperforms the state-of-the-art methods on multiple datasets. Code: https://github.com/CSYSI/DPU-Former.
80: Rethinking Contrastive Learning in Graph Anomaly Detection: A Clean-View Perspective
Authors: Di Jin, Jingyi Cao, Xiaobao Wang, Bingdao Feng, Dongxiao He, Longbiao Wang, Jianwu Dang
Location: Guangzhou | Day: TBD
Show Abstract
Graph anomaly detection aims to identify unusual patterns in graph-based data, with wide applications in fields such as web security and financial fraud detection. Existing methods typically rely on contrastive learning, assuming that a lower similarity between a node and its local subgraph indicates abnormality. However, these approaches overlook a crucial limitation: the presence of interfering edges invalidates this assumption, since it introduces disruptive noise that compromises the contrastive learning process. Consequently, this limitation impairs the ability to effectively learn meaningful representations of normal patterns, leading to suboptimal detection performance. To address this issue, we propose a Clean-View Enhanced Graph Anomaly Detection framework (CVGAD), which includes a multi-scale anomaly awareness module to identify key sources of interference in the contrastive learning process. Moreover, to mitigate bias from the one-step edge removal process, we introduce a novel progressive purification module. This module incrementally refines the graph by iteratively identifying and removing interfering edges, thereby enhancing model performance. Extensive experiments on five benchmark datasets validate the effectiveness of our approach.
86: A Dynamic Knowledge Update-Driven Model with Large Language Models for Fake News Detection
Authors: Di Jin, Jun Yang, Xiaobao Wang, Junwei Zhang, Shuqi Li, Dongxiao He
Location: Guangzhou | Day: TBD
Show Abstract
As the Internet and social media evolve rapidly, distinguishing credible news from a vast amount of complex information poses a significant challenge. Due to the suddenness and instability of news events, the authenticity labels of news can potentially shift as events develop, making it crucial for fake news detection to obtain the latest event updates. Existing methods employ retrieval-augmented generation to fill knowledge gaps, but they suffer from issues such as insufficient credibility of retrieved content and interference from noisy information. We propose a dynamic knowledge update-driven model for fake news detection (DYNAMO), which leverages knowledge graphs to achieve continuous updating of new knowledge and integrates with large language models to fulfill dual functions: news authenticity detection and verification of new knowledge correctness, solving the two key problems of ensuring the authenticity of new knowledge and deeply mining news semantics. Specifically, we first construct a news-domain-specific knowledge graph. Then, we use Monte Carlo Tree Search to decompose complex news and verify them step by step. Finally, we extract and update new knowledge from verified real news texts and reasoning paths. Experimental results demonstrate that DYNAMO achieves the best performance on two real-world datasets.
99: Multi-Modal Point Cloud Completion with Interleaved Attention Enhanced Transformer
Authors: Chenghao Fang, Jianqing Liang, Jiye Liang, Hangkun Wang, Kaixuan Yao, Feilong Cao
Location: Guangzhou | Day: TBD
Show Abstract
Multi-modal point cloud completion, which utilizes a complete image and a partial point cloud as input, is a crucial task in 3D computer vision. Previous methods commonly employ a cross-attention mechanism to fuse point clouds and images. However, these approaches often fail to fully leverage image information and overlook the intrinsic geometric details of point clouds that could complement the image modality. To address these challenges, we propose an interleaved attention enhanced Transformer (IAET) with three main components, i.e., token embedding, bidirectional token supplement, and coarse-to-fine decoding. IAET incorporates a novel interleaved attention mechanism to enable bidirectional information supplementation between the point cloud and image modalities. Additionally, to maximize the use of the supplemented image information, we introduce a view-guided upsampling module that leverages image tokens as queries to guide the generation of detailed point cloud structures. Extensive experiments demonstrate the effectiveness of IAET, highlighting its state-of-the-art performance on multi-modal point cloud completion benchmarks in various scenarios. The source code is freely accessible at https://github.com/doldolOuO/IAET.
101: ActiveHAI: Active Collection Based Human-AI Diagnosis with Limited Expert Predictions
Authors: Xuehan Zhao, Jiaqi Liu, Xin Zhang, Zhiwen Yu, Bin Guo
Location: Guangzhou | Day: TBD
Show Abstract
Recent studies indicate that human-AI collaboration performs better than either alone, particularly in medical diagnosis. Beyond collaboration methods that focus on assigning tasks to humans or AI, like deferral, combining human and AI decisions with their confidence scores is emerging as a promising strategy. Due to high cognitive load, doctors often struggle to provide confidence assessments, necessitating explicit human uncertainty evaluation through a limited number of additional expert predictions. There are two challenges. (1) how to actively collect limited yet representative expert predictions? (2) how to accurately evaluate human uncertainty with limited expert predictions? To address the challenges, we propose ActiveHAI, an active human-AI diagnosis method that reduces expert costs through a median-window sampling strategy that actively selects representative samples near the estimated median; and evaluate expert confidence through an evaluator module that integrates sample features and expert predictions, converting them into probability distributions. Experiments on three real-world datasets show that ActiveHAI surpasses doctor and other human-AI methods by 16.3% and 3.6% in accuracy, respectively. Furthermore, ActiveHAI reaches 97.2% relative accuracy, even with just eight expert predictions per class.
113: INFP: INdustrial Video Anomaly Detection via Frequency Prioritization
Authors: Qianzi Yu, Kai Zhu, Yang Cao, Yu Kang
Location: Guangzhou | Day: TBD
Show Abstract
Industrial video anomaly detection aims to perform real-time analysis of video streams from industrial production lines and provide anomaly alerts. Conventional video anomaly detection methods focus more on the overall image, as they aim to identify anomalies among multiple normal samples appearing simultaneously. However, industrial scenarios, where the primary focus is on a single type of product, require attention to local areas to capture fine-grained details and specific patterns. Directly applying conventional methods to industrial scenarios can result in an inability to focus on products moving along fixed trajectories, ineffective utilization of their equidistant periodicity, and greater susceptibility to lighting variations. To address these issues, we propose FreqNet, an encoder-decoder framework that learns frequency-domain features from videos to capture periodic and dynamic characteristics, enhancing the model’s robustness. Specifically, a trajectory filter is proposed that takes advantage of the significant difference between moving objects and static backgrounds in the frequency domain by assigning higher weights to fixed moving trajectories. Moreover, a multi-feature fusion module is proposed, in which the frequency domain features of the video are first extracted to leverage the unique equidistant periodicity information of videos from industrial production lines. The extracted frequency domain features are subsequently fused with spatio-temporal features and contextual information is further integrated from the fused representation, effectively mitigating the impact of lighting variations on production lines. Extensive experiments on the benchmark IPAD dataset demonstrate the superiority of our proposed method over the state-of-the-art.
125: Optimizing Personalized Federated Learning Through Adaptive Layer-Wise Learning
Authors: Weihang Chen, Cheng Yang, Jie Ren, Zhiqiang Li, Zheng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Real-life deployment of federated Learning (FL) often faces non-IID data, which leads to poor accuracy and slow convergence. Personalized FL (pFL) tackles these issues by tailoring local models to individual data sources and using weighted aggregation methods for client-specific learning. However, existing pFL methods often fail to provide each local model with global knowledge on demand while maintaining low computational overhead. Additionally, local models tend to over-personalize their data during the training process, potentially dropping previously acquired global information. We propose FLAYER, a novel layer-wise learning method for pFL that optimizes local model personalization performance. FLAYER considers the different roles and learning abilities of neural network layers of individual local models. It incorporates global information for each local model as needed to initialize the local model cost-effectively. It then dynamically adjusts learning rates for each layer during local training, optimizing the personalized learning process for each local model while preserving global knowledge. Additionally, to enhance global representation in pFL, FLAYER selectively uploads parameters for global aggregation in a layer-wise manner. We evaluate FLAYER on four representative datasets in computer vision and natural language processing domains. Compared to eight state-of-the-art pFL methods, FLAYER improves the inference accuracy, on average, by 5.20% (up to 14.29%). Code is available at https://github.com/lancasterJie/FLAYER/.
128: Pre-defined Keypoints Promote Category-level Articulation Pose Estimation via Multi-Modal Alignment
Authors: Wenbo Xu, Li Zhang, Liu Liu, Yan Zhong, Haonan Jiang, Xue Wang, Rujing Wang
Location: Guangzhou | Day: TBD
Show Abstract
Articulations are essential in everyday interactions, yet traditional RGB-based pose estimation methods often struggle with issues such as lighting variations and shadows. To overcome these challenges, we propose a novel Pre-defined keypoint based framework for category-level articulation pose estimation via multi-modal Alignment, coined PAGE. Specifically, we first propose a customized keypoint estimation method, aiming to avoid the divergent distance pattern between heuristically generated keypoints and visible points. In addition, to reduce the mutual information redundancy between point clouds and RGB images, we design the geometry-color alignment, which fuses the features after aligning two modalities. This is followed by decoding the radius for each visible point, and applying our proposal integration scoring strategy to predict keypoints. Ultimately, the framework outputs the per-part 6D pose of the articulation. We conduct extensive experiments to evaluate PAGE across a variety of datasets, from synthetic to real-world scenarios, demonstrating its robustness and superior performance.
137: Responsibility Gap in Collective Decision Making
Authors: Pavel Naumov, Jia Tao
Location: Montreal | Day: August 21st | Time: 10:00 | Session: KRR: Learning and reasoning
Show Abstract
The responsibility gap is a set of outcomes of a collective decision-making mechanism in which no single agent is individually responsible. In general, when designing a decision-making process, it is desirable to minimise the gap.

The paper studies the class of mechanisms for which the gap is empty and proposes a concept of an elected dictatorship. It shows that, in a perfect information setting, the gap is empty if and only if the mechanism is an elected dictatorship. It also proves that in an imperfect information setting, the class of gap-free mechanisms is positioned strictly between two variations of the class of elected dictatorships.
149: Robust Misinformation Detection by Visiting Potential Commonsense Conflict
Authors: Bing Wang, Ximing Li, Changchun Li, Bingrui Zhao, Bo Fu, Renchu Guan, Shengsheng Wang
Location: Guangzhou | Day: TBD
Show Abstract
The development of Internet technology has led to an increased prevalence of misinformation, causing severe negative effects across diverse domains. To mitigate this challenge, Misinformation Detection (MD), aiming to detect online misinformation automatically, emerges as a rapidly growing research topic in the community. In this paper, we propose a novel plug-and-play augmentation method for the MD task, namely Misinformation Detection with Potential Commonsense Conflict (MD-PCC). We take inspiration from the prior studies indicating that fake articles are more likely to involve commonsense conflict. Accordingly, we construct commonsense expressions for articles, serving to express potential commonsense conflicts inferred by the difference between extracted commonsense triplet and golden ones inferred by the well-established commonsense reasoning tool COMET. These expressions are then specified for each article as augmentation. Any specific MD methods can be then trained on those commonsense-augmented articles. Besides, we also collect a novel commonsense-oriented dataset named CoMis, whose all fake articles are caused by commonsense conflict. We integrate MD-PCC with various existing MD backbones and compare them across both 4 public benchmark datasets and CoMis. Empirical results demonstrate that MD-PCC can consistently outperform the existing MD baselines.
164: An Out-Of-Distribution Membership Inference Attack Approach for Cross-Domain Graph Attacks
Authors: Jinyan Wang, Liu Yang, Yuecen Wei, Jiaxuan Si, Chenhao Guo, Qingyun Sun, Xianxian Li, Xingcheng Fu
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Network-based methods face privacy leakage risks due to the introduction of topological structures about the targets, which allows attackers to bypass the target’s prior knowledge of the sensitive attributes and realize membership inference attacks (MIA) by observing and analyzing the topology distribution. As privacy concerns grow, the assumption of MIA, which presumes that attackers can obtain an auxiliary dataset with the same distribution, is increasingly deviating from reality. In this paper, we categorize the distribution diversity issue in real-world MIA scenarios as an Out-Of-Distribution (OOD) problem, and propose a novel Graph OOD Membership Inference Attack (GOOD-MIA) to achieve cross-domain graph attacks. Specifically, we construct shadow subgraphs with distributions from different domains to model the diversity of real-world data. We then explore the stable node representations that remain unchanged under external influences and consider eliminating redundant information from confounding environments and extracting task-relevant key information to more clearly distinguish between the characteristics of training data and unseen data. This OOD-based design makes cross-domain graph attacks possible. Finally, we perform risk extrapolation to optimize the attack’s domain adaptability during attack inference to generalize the attack to other domains. Experimental results demonstrate that GOOD-MIA achieves superior attack performance in datasets designed for multiple domains.
173: Enhancing Sampling Protocol for Point Cloud Classification Against Corruptions
Authors: Chongshou Li, Pin Tang, Tianrui Li, Yuheng Liu, Xinke Li
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: AI Ethics, Trust, Fairness (3/3)
Show Abstract
Established sampling protocols for 3D point cloud learning, such as Farthest Point Sampling (FPS) and Fixed Sample Size (FSS), have long been relied upon. However, real-world data often suffer from corruptions, such as sensor noise, which violates the benign data assumption in current protocols. As a result, these protocols are highly vulnerable to noise, posing significant safety risks in critical applications like autonomous driving. To address these issues, we propose an enhanced point cloud sampling protocol, PointSP, designed to improve robustness against point cloud corruptions. PointSP incorporates key point reweighting to mitigate outlier sensitivity and ensure the selection of representative points. It also introduces a local-global balanced downsampling strategy, which allows for scalable and adaptive sampling while maintaining geometric consistency. Additionally, a lightweight tangent plane interpolation method is used to preserve local geometry while enhancing the density of the point cloud. Unlike learning-based approaches that require additional model training, PointSP is architecture-agnostic, requiring no extra learning or modification to the network. This enables seamless integration into existing pipelines. Extensive experiments on synthetic and real-world corrupted datasets show that PointSP significantly improves the robustness and accuracy of point cloud classification, outperforming state-of-the-art methods across multiple benchmarks.
186: SOTA: Spike-Navigated Optimal TrAnsport Saliency Region Detection in Composite-bias Videos
Authors: Wenxuan Liu, Yao Deng, Kang Chen, Xian Zhong, Zhaofei Yu, Tiejun Huang
Location: Guangzhou | Day: TBD
Show Abstract
Existing saliency detection methods struggle in real-world scenarios due to motion blur and occlusions. In contrast, spike cameras, with their high temporal resolution, significantly enhance visual saliency maps. However, the composite noise inherent to spike camera imaging introduces discontinuities in saliency detection. Low-quality samples further distort model predictions, leading to saliency bias. To address these challenges, we propose Spike-navigated Optimal TrAnsport Saliency Region Detection (SOTA), a framework that leverages the strengths of spike cameras while mitigating biases in both spatial and temporal dimensions. Our method introduces Spike-based Micro-debias (SM) to capture subtle frame-to-frame variations and preserve critical details, even under minimal scene or lighting changes. Additionally, Spike-based Global-debias (SG) refines predictions by reducing inconsistencies across diverse conditions. Extensive experiments on real and synthetic datasets demonstrate that SOTA outperforms existing methods by eliminating composite noise bias. Our code and dataset will be released at https://github.com/lwxfight/sota.
190: Constructive Conflict-Driven Multi-Agent Reinforcement Learning for Strategic Diversity
Authors: Yuxiang Mai, Qiyue Yin, Wancheng Ni, Pei Xu, Kaiqi Huang
Location: Guangzhou | Day: TBD
Show Abstract
In recent years, diversity has emerged as a useful mechanism to enhance the efficiency of multi-agent reinforcement learning (MARL). However, existing methods predominantly focus on designing policies based on individual agent characteristics, often neglecting the interplay and mutual influence among agents during policy formation. To address this gap, we propose Competitive Diversity through Constructive Conflict (CoDiCon), a novel approach that incorporates competitive incentives into cooperative scenarios to encourage policy exchange and foster strategic diversity among agents. Drawing inspiration from sociological research, which highlights the benefits of moderate competition and constructive conflict in group decision-making, we design an intrinsic reward mechanism using ranking features to introduce competitive motivations. A centralized intrinsic reward module generates and distributes varying reward values to agents, ensuring an effective balance between competition and cooperation. By optimizing the parameterized centralized reward module to maximize environmental rewards, we reformulate the constrained bilevel optimization problem to align with the original task objectives. We evaluate our algorithm against state-of-the-art methods in the SMAC and GRF environments. Experimental results demonstrate that CoDiCon achieves superior performance, with competitive intrinsic rewards effectively promoting diverse and adaptive strategies among cooperative agents.
196: Logic Distillation: Learning from Code Function by Function for Decision-making Tasks
Authors: Dong Chen, Shilin Zhang, Fei Gao, Yueting Zhuang, Siliang Tang, Qidong Liu, Mingliang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Large language models (LLMs) have garnered increasing attention owing to their powerful comprehension and generation capabilities. Generally, larger LLMs (L-LLMs) that require paid interfaces exhibit significantly superior performance compared to smaller LLMs (S-LLMs) that can be deployed on a variety of devices. Knowledge distillation (KD) aims to empower S-LLMs with the capabilities of L-LLMs, while S-LLMs merely mimic the outputs of L-LLMs, failing to get the powerful decision-making capability for new situations. Consequently, S-LLMs are helpless when it comes to continuous decision-making tasks that require logical reasoning. To tackle the identified challenges, we propose a novel framework called Logic Distillation (LD). Initially, LD employs L-LLMs to instantiate complex instructions into discrete functions and illustrates their usage to establish a function base. Subsequently, LD fine-tunes S-LLMs based on the function base to learn the logic employed by L-LLMs in decision-making. During testing, S-LLMs will yield decision-making outcomes, function by function, based on current states. Experiments demonstrate that with the assistance of LD, S-LLMs can achieve outstanding results in continuous decision-making tasks, comparable to, or even surpassing, those of L-LLMs. The code and data for the proposed method are provided for research purposes https://github.com/Anfeather/Logic-Distillation.
200: MTGIB-UNet: A Multi-Task Graph Information Bottleneck and Uncertainty Weighted Network for ADMET Prediction
Authors: Xuqiang Li, Wenjie Du, Jun Xia, Jianmin Wang, Xiaoqi Wang, Yang Yang, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Accurate prediction of ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties is crucial in drug development, as these properties directly impact a drug’s efficacy and safety. However, existing multi-task learning models often face challenges related to noise interference and task conflicts when dealing with complex molecular structures. To address these issues, we propose a novel multi-task Graph Neural Network (GNN) model, \textbf{MTGIB-UNet}. The model begins by encoding molecular graphs to capture intricate molecular structure information. Subsequently, based on the Graph Information Bottleneck (GIB) principle, the model compresses the information flow by extracting subgraphs, retaining task-relevant features while removing noise for each task. These embeddings are then fused through a gated network that dynamically adjusts the contribution weights of auxiliary tasks to the primary task. Specifically, an uncertainty weighting (UW) strategy is applied, with additional emphasis placed on the primary task, allowing dynamic adjustment of task weights while strengthening the influence of the primary task on model training. Experiments on standard ADMET datasets demonstrate that our model outperforms existing methods. Additionally, the model shows good interpretability by identifying key molecular substructures related to specific ADMET endpoints.
206: MTPNet: Multi-Grained Target Perception for Unified Activity Cliff Prediction
Authors: Zishan Shu, Yufan Deng, Hongyu Zhang, Zhiwei Nie, Jie Chen
Location: Guangzhou | Day: TBD
Show Abstract
Activity cliff prediction is a critical task in drug discovery and material design. Existing computational methods are limited to handling single binding targets, which restricts the applicability of these prediction models. In this paper, we present the Multi-Grained Target Perception network (MTPNet) to incorporate the prior knowledge of interactions between the molecules and their target proteins. Specifically, MTPNet is a unified framework for activity cliff prediction, which consists of two components: Macro-level Target Semantic (MTS) guidance and Micro-level Pocket Semantic (MPS) guidance. By this way, MTPNet dynamically optimizes molecular representations through multi-grained protein semantic conditions. To our knowledge, it is the first time to employ the receptor proteins as guiding information to effectively capture critical interaction details. Extensive experiments on 30 representative activity cliff datasets demonstrate that MTPNet significantly outperforms previous approaches, achieving an average RMSE improvement of 18.95% on top of several mainstream GNN architectures. Overall, MTPNet internalizes interaction patterns through conditional deep learning to achieve unified predictions of activity cliffs, helping to accelerate compound optimization and design. Codes are available at: https://github.com/ZishanShu/MTPNet.
216: Point Cloud Mixture-of-Domain-Experts Model for 3D Self-supervised Learning
Authors: Yaohua Zha, Tao Dai, Hang Guo, Yanzi Wang, Bin Chen, Ke Chen, Shu-Tao Xia
Location: Guangzhou | Day: TBD
Show Abstract
Point clouds, as a primary representation of 3D data, can be categorized into scene domain point clouds and object domain point clouds. Point cloud self-supervised learning (SSL) has become a mainstream paradigm for learning 3D representations. However, existing point cloud SSL primarily focuses on learning domain-specific 3D representations within a single domain, neglecting the complementary nature of cross-domain knowledge, which limits the learning of 3D representations. In this paper, we propose to learn a comprehensive Point cloud Mixture-of-Domain-Experts model (Point-MoDE) via a block-to-scene pre-training strategy. Specifically,
We first propose a mixture-of-domain-expert model consisting of scene domain experts and multiple shared object domain experts. Furthermore, we propose a block-to-scene pretraining strategy, which leverages the features of point blocks in the object domain to regress their initial positions in the scene domain through object-level block mask reconstruction and scene-level block position regression. By integrating the complementary knowledge between object and scene, this strategy simultaneously facilitates the learning of both object-domain and scene-domain representations, leading to a more comprehensive 3D representation.
Extensive experiments in downstream tasks demonstrate the superiority of our model.
220: Expanding the Category of Classifiers with LLM Supervision
Authors: Derui Lyu, Xiangyu Wang, Taiyu Ban, Lyuzhou Chen, Xiren Zhou, Huanhuan Chen
Location: Guangzhou | Day: TBD
Show Abstract
Zero-shot learning has shown significant potential for creating cost-effective and flexible systems to expand classifiers to new categories. However, existing methods still rely on manually created attributes designed by domain experts. Motivated by the widespread success of large language models (LLMs), we introduce an LLM-driven framework for class-incremental learning that removes the need for human intervention, termed Classifier Expansion with Multi-vIew LLM knowledge (CEMIL). In CEMIL, an LLM agent autonomously generates detailed textual multi-view descriptions for unseen classes, offering richer and more flexible class representations than traditional expert-constructed vectorized attributes. These LLM-derived textual descriptions are integrated through a contextual filtering attention mechanism to produce discriminative class embeddings. Subsequently, a weight injection module maps the class embeddings to classifier weights, enabling seamless expansion to new classes. Experimental results show that CEMIL outperforms existing methods using expert-constructed attributes, demonstrating its effectiveness for fully automated classifier expansion without human participation.
221: A Structural Complexity Analysis of Hierarchical Task Network Planning
Authors: Cornelius Brand, Robert Ganian, Fionn Mc Inerney, Simon Wietheger
Location: Montreal | Day: August 21st | Time: 15:00 | Session: Knowledge Representation and Reasoning (4/4)
Show Abstract
We perform a refined complexity-theoretic analysis of three classical problems in the context of Hierarchical Task Network Planning: the verification of a provided plan, whether an executable plan exists, and whether a given state can be reached. Our focus lies on identifying structural properties which yield tractability. We obtain new polynomial algorithms for all three problems on a natural class of primitive networks, along with corresponding lower bounds. We also obtain an algorithmic meta-theorem for lifting polynomial-time solvability from primitive to general task networks, and prove that its preconditions are tight. Finally, we analyze the parameterized complexity of the three problems.
239: Towards Regularized Mixture of Predictions for Class-Imbalanced Semi-Supervised Facial Expression Recognition
Authors: Hangyu Li, Yixin Zhang, Jiangchao Yao, Nannan Wang, Bo Han
Location: Guangzhou | Day: TBD
Show Abstract
Semi-supervised facial expression recognition (SSFER) effectively assigns pseudo-labels to confident unlabeled samples when only limited emotional annotations are available. Existing SSFER methods are typically built upon an assumption of the class-balanced distribution. However, they are far from real-world applications due to biased pseudo-labels caused by class imbalance. To alleviate this issue, we propose Regularized Mixture of Predictions (ReMoP), a simple yet effective method to generate high-quality pseudo-labels for imbalanced samples. Specifically, we first integrate feature similarity into the linear prediction to learn a mixture of predictions. Furthermore, we introduce a class regularization term that constrains the feature geometry to mitigate imbalance bias. Being practically simple, our method can be integrated with existing semi-supervised learning and SSFER methods to tackle the challenge associated with class-imbalanced SSFER effectively. Extensive experiments on four facial expression datasets demonstrate the effectiveness of the proposed method across various imbalanced conditions. The source code is made publicly available at https://github.com/hangyu94/ReMoP.
240: Template3D-AD: Point Cloud Template Matching Method Based on Center Points for 3D Anomaly Detection
Authors: Yi Liu, Changsheng Zhang, Yufei Yang
Location: Guangzhou | Day: TBD
Show Abstract
Existing 3D anomaly detection methods mainly include reconstruction-based methods and memory-based methods. However, reconstruction-based methods rely on anomaly simulation strategies, while the memory bank of memory-based methods cannot cover the features of all points. Different from existing methods, this paper proposes Template3D-AD, a 3D anomaly detection method based on template matching. Template3D-AD matches the test sample with the template based on center points, and extracts the global features and local features of the center point respectively. Considering that the appearance of anomalies is related to the change of surface shape, this paper proposes a curvature-based local feature representation method, which increases the feature difference between abnormal surfaces and normal surfaces. Then, this paper designs a global-local detection strategy, which combines global feature differences and local feature differences for anomaly detection. Extensive experiments show that Template3D-AD outperforms the state-of-the-art methods, achieving 84.4% (1.5% ↑) I-AUROC on the Real3D-AD dataset and 86.5% (11.6% ↑) I-AUROC on the Anomaly-ShapeNet dataset. Code at https://github.com/CaedmonLY/Template3D-AD.
245: Enhancing Semantic Clarity: Discriminative and Fine-grained Information Mining for Remote Sensing Image-Text Retrieval
Authors: Yu Liu, Haipeng Chen, Yuheng Liang, Yuheng Yang, Xun Yang, Yingda Lyu
Location: Guangzhou | Day: TBD
Show Abstract
Remote sensing image-text retrieval is a fundamental task in remote sensing multimodal analysis, promoting the alignment of visual and language representations. The mainstream approaches commonly focus on capturing shared semantic representations between visual and textual modalities. However, the inherent characteristics of remote sensing image-text pairs lead to a semantic confusion problem, stemming from redundant visual representations and high inter-class similarity. To tackle this problem, we propose a novel Discriminative and Fine-grained Information Mining (DFIM) model, which aims to enhance semantic clarity by reducing visual redundancy and increasing the semantic gap between different classes. Specifically, the Dynamic Visual Enhancement (DVE) module adaptively enhances the visual discriminative features under the guidance of multimodal fusion information. Meanwhile, the Fine-grained Semantic Matching (FSM) module cleverly models the matching relationship between image regions and text words as an optimal transport problem, thereby refining intra-instance matching. Extensive experiments on two benchmark datasets justify the superiority of DFIM in terms of retrieval accuracy and visual interpretability over the leading methods.
255: KnowRA: Knowledge Retrieval Augmented Method for Document-level Relation Extraction with Comprehensive Reasoning Abilities
Authors: Chengcheng Mai, Yuxiang Wang, Ziyu Gong, Hanxiang Wang, Yihua Huang
Location: Guangzhou | Day: TBD
Show Abstract
Document-level relation extraction (Doc-RE) aims to extract relations between entities across multiple sentences. Therefore, Doc-RE requires more comprehensive reasoning abilities like humans, involving complex cross-sentence interactions between entities, contexts, and external general knowledge, compared to the sentence-level RE. However, most existing Doc-RE methods focus on optimizing single reasoning ability, but lack the ability to utilize external knowledge for comprehensive reasoning on long documents. To solve these problems, a knowledge retrieval augmented method, named KnowRA, was proposed with comprehensive reasoning to autonomously determine whether to accept external knowledge to assist Doc-RE. Firstly, we constructed a document graph for semantic encoding and integrated the co-reference resolution model to augment the co-reference reasoning ability. Then, we expanded the document graph into a document knowledge graph by retrieving the external knowledge base for common-sense reasoning and a novel knowledge filtration method was presented to filter out irrelevant knowledge. Finally, we proposed the axis attention mechanism to build direct and indirect associations with intermediary entities for achieving cross-sentence logical reasoning. Extensive experiments conducted on two datasets verified the effectiveness of our method compared to the state-of-the-art baselines. Our code is available at https://anonymous.4open.science/r/KnowRA.
257: Causal Learning Meet Covariates: Empowering Lightweight and Effective Nationwide Air Quality Forecasting
Authors: Jiaming Ma, Zhiqing Cui, Binwu Wang, Pengkun Wang, Zhengyang Zhou, Zhe Zhao, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Air quality prediction plays a crucial role in the development of smart cities, garnering significant attention from both academia and industry. Current air quality prediction models encounter two major limitations: their high computational complexity limits scalability to nationwide datasets, and they often regard weather covariates as optional auxiliary information. In reality, weather covariates can have a substantial impact on air quality indices (AQI), exhibiting a significant causal association. In this paper, we first present a nationwide air quality dataset to address the lack of open-source, large-scale datasets in this field. Then we propose a causal learning model, CauAir, for air quality prediction that harnesses the powerful representation capabilities of the Transformer to explicitly model the causal association between weather covariates and AQI. To address the high complexity of traditional Transformers, we design CachLormer, which features two key innovations: a simplified architecture with redundant components removed, and a cache-attention mechanism that employs learnable embeddings for perceiving causal association between AQI and weather covariates in a coarsegrained perspective. We use information theory to illustrate the superiority of the proposed model. Finally, experimental results on three datasets with 28 as the baseline demonstrate that our model achieves competitive performance, while maintaining high training efficiency and low memory consumption. The source code is available at CauAir Official Repository.
258: Deep Opinion-Unaware Blind Image Quality Assessment by Learning and Adapting from Multiple Annotators
Authors: Zhihua Wang, Xuelin Liu, Jiebin Yan, Jie Wen, Wei Wang, Chao Huang
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: Computer Vision (3/3)
Show Abstract
Existing deep neural network (DNN)-based blind image quality assessment (BIQA) methods primarily rely on human-rated datasets for training. However, collecting human labels is extremely time-consuming and labor-intensive, posing a significant bottleneck for practical applications. To address this challenge, we propose a Deep opinion-Unaware BIQA model by learning and adapting from Multiple Annotators, termed DUBMA, thereby eliminating the need for human annotations. Specifically, we first generate a large-scale set of distorted image pairs and then assign relative quality rankings using existing full-reference IQA models. The resulting dataset is subsequently employed for training our DUBMA.
Due to the inherent discrepancies between synthetic and real-world distortions, a domain shift may occur. To address this, we propose an outlier-robust unsupervised domain adaptation approach leveraging optimal transport. This strategy effectively reduces the gap between synthetic and real-world distortion domains, thereby boosting the model’s adaptability and overall performance. Extensive experiments show that DUBMA outperforms existing opinion-unaware BIQA methods in terms of prediction accuracy across multiple datasets.
274: Optimized View and Geometry Distillation from Multi-view Diffuser
Authors: Youjia Zhang, Zikai Song, Junqing Yu, Yawei Luo, Wei Yang
Location: Guangzhou | Day: TBD
Show Abstract
Generating multi-view images from a single input view using image-conditioned diffusion models is a recent advancement and has shown considerable potential. However, issues such as the lack of consistency in synthesized views and over-smoothing in extracted geometry persist. Previous methods integrate multi-view consistency modules or impose additional supervisory to enhance view consistency while compromising on the flexibility of camera positioning and limiting the versatility of view synthesis. In this study, we consider the radiance field optimized during geometry extraction as a more rigid consistency prior, compared to volume and ray aggregation used in previous works. We further identify and rectify a critical bias in the traditional radiance field optimization process through score distillation from a multi-view diffuser. We introduce an Unbiased Score Distillation (USD) that utilizes unconditioned noises from a 2D diffusion model, greatly refining the radiance field fidelity. We leverage the rendered views from the optimized radiance field as the basis and develop a two-step specialization process of a 2D diffusion model, which is adept at conducting object-specific denoising and generating high-quality multi-view images. Finally, we recover faithful geometry and texture directly from the refined multi-view images. Empirical evaluations demonstrate that our optimized geometry and view distillation technique generates comparable results to the state-of-the-art models trained on extensive datasets, all while maintaining freedom in camera positioning. Source code of our work is publicly available at: https://youjiazhang.github.io/USD/.
275: Mask Does Not Matter: A Unified Latent Diffusion-Enhanced Framework for Mask-Free Virtual Try-On
Authors: Chenghu Du, Junyin Wang, Kai Liu, Shengwu Xiong, Yi Rong
Location: Guangzhou | Day: TBD
Show Abstract
A good virtual try-on model should introduce minimal redundant conditional information to avoid instability and increase inference efficiency. Existing methods rely on inpainting masks to guide the generation of the object, but the masks, generated by unstable human parsers, often produce unreliable results with fabric residues due to wrong segmentation. Moreover, large mask regions can lose spatial structure and identity information, requiring extra conditional inputs to compensate, which increases model instability and reduces efficiency. To tackle the problem, we present a novel Mask-Free virtual Try-ON (MFTON) framework. Specifically, we propose a mask-free strategy to eliminate all denoising conditions except for clothing and person images, thereby directly extracting spatial structure and identity information from the person image to improve efficiency and reduce instability. Additionally, to optimize the generated clothing regions, we propose a clothing texture-aware attention mechanism to enable the model to focus on texture generation with significant visual differences. We then introduce a geometric detail capture loss to further enable the model to capture more high-frequency information. Finally, we propose an appearance consistency inference method to reduce the initial randomness of the sampling process significantly. Extensive experiments on popular datasets demonstrate that our method outperforms state-of-the-art virtual try-on methods.
282: Poisoning-based Backdoor Attacks for Arbitrary Target Label with Positive Triggers
Authors: Binxiao Huang, Ngai Wong
Location: Montreal | Day: August 21st | Time: 10:00 | Session: CV: attacks
Show Abstract
Poisoning-based backdoor attacks expose vulnerabilities during the data preparation phase of deep neural network (DNN) training. The DNNs trained on the poisoned dataset will be embedded with a backdoor, making them behave well on clean data while outputting malicious predictions whenever a trigger is applied. To exploit the abundant information contained in the input-to-label mapping, our scheme utilizes the network trained from the clean dataset as a trigger generator to produce poisons that significantly raise the success rate of backdoor attacks versus conventional approaches. Specifically, we introduce a new categorization of triggers inspired by adversarial techniques and propose a multi-label and multi-payload Poisoning-based backdoor attack with Positive Triggers (PPT), which strategically manipulates inputs to align them closer to the target label in the feature space of benign classifiers. Once the classifier is trained on the poisoned dataset, we can generate an input-label-aware trigger to make the infected classifier predict any given input to any target label with a high possibility. Through extensive experiments under both dirty-label and clean-label settings, we demonstrate empirically that the proposed attack achieves a high attack success rate without sacrificing accuracy across various datasets, including SVHN, CIFAR10, GTSRB, and Tiny ImageNet. Additionally, the PPT attack can elude a variety of classical backdoor defenses, proving its effectiveness.
291: Prompt-Free Conditional Diffusion for Multi-object Image Augmentation
Authors: Haoyu Wang, Lei Zhang, Wei Wei, Chen Ding, Yanning Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Diffusion model has underpinned much recent advances of dataset augmentation in various computer vision tasks. However, when involving generating multi-object images as real scenarios, most existing methods either rely entirely on text condition, resulting in a deviation between the generated objects and the original data, or rely too much on the original images, resulting in a lack of diversity in the generated images, which is of limited help to downstream tasks. To mitigate both problems with one stone, we propose a prompt-free conditional diffusion framework for multi-object image augmentation. Specifically, we introduce a local-global semantic fusion strategy to extract semantics from images to replace text, and inject knowledge into the diffusion model through LoRA to alleviate the category deviation between the original model and the target dataset. In addition, we design a reward model based counting loss to assist the traditional reconstruction loss for model training. By constraining the object counts of each category instead of pixel-by-pixel constraints, bridging the quantity deviation between the generated data and the original data while improving the diversity of the generated data. Experimental results demonstrate the superiority of the proposed method over several representative state-of-the-art baselines and showcase strong downstream task gain and out-of-domain generalization capabilities. Code is available at \href{https://github.com/00why00/PFCD}{here}.
302: Enhancing Long-Tail Bundle Recommendations Utilizing Composition Pattern Modeling
Authors: Tianhui Ma, Shuyao Wang, Zhi Zheng, Hui Xiong
Location: Guangzhou | Day: TBD
Show Abstract
Bundle recommendation aims to provide users with a one-stop service by offering a collection of related items. However, these systems face a significant challenge, where a small portion of bundles accumulate most interactions while the long-tail bundles receive few interactions.This imbalance leads to poor performance for long-tail bundles despite their potential to satisfy diverse user needs.
Existing long-tail item recommendation methods fail to effectively address this problem, as long-tail bundle recommendation requires not only capturing the user-bundle interactions but also the item compositions in different bundles.
Therefore, in this paper, we propose Composition-Aware Long-tail Bundle Recommendation (CALBRec), which leverages the inherent composition patterns shared across different bundles as valuable signals for further representation augmentation and recommendation enhancement.
Specifically, to solve the complexity of modeling shared composition patterns due to the exponential explosion caused by the growing number of items and bundle sizes, we first introduce a composition-aware tail adapter to capture the shared composition patterns and then adaptively integrate them into individual bundle representations.
Moreover, to mitigate the impact of noise in user-bundle interaction data, we propose to map the bundle representations into a set of learnable prototypes, and we further propose a prototype learning module to combine the composition patterns with interaction signals for tail bundles.
Extensive experiments on three public datasets demonstrate that our method can improve the performance on bundle recommendation significantly, especially on the long-tail bundles.
303: INT: Instance-Specific Negative Mining for Task-Generic Promptable Segmentation
Authors: Jian Hu, Zixu Cheng, Shaogang Gong
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Computer vision (2/3)
Show Abstract
Task-generic promptable image segmentation aims to achieve segmentation of diverse samples under a single task description by utilizing only one task-generic prompt. Current methods leverage the generalization capabilities of Vision-Language Models (VLMs) to infer instance-specific prompts from these task-generic prompts in order to guide the segmentation process. However, when VLMs struggle to generalise to some image instances, predicting instance-specific prompts becomes poor. To solve this problem, we introduce Instance-specific Negative Mining for Task-Generic Promptable Segmentation (INT). The key idea of INT is to adaptively reduce the influence of irrelevant (negative) prior knowledge whilst to increase the use the most plausible prior knowledge, selected by negative mining with higher contrast, in order to optimise instance-specific prompts generation. Specifically, INT consists of two components: (1) instance-specific prompt generation, which progressively fliters out incorrect information in prompt generation; (2) semantic mask generation, which ensures each image instance segmentation matches correctly the semantics of the instance-specific prompts. INT is validated on six datasets, including camouflaged objects and medical images, demonstrating its effectiveness, robustness and scalability.
309: HiTuner: Hierarchical Semantic Fusion Model Fine-Tuning on Text-Attributed Graphs
Authors: Zihan Fang, Zhiling Cai, Yuxuan Zheng, Shide Du, Yanchao Tan, Shiping Wang
Location: Guangzhou | Day: TBD
Show Abstract
Text-Attributed Graphs (TAGs) are vital for modeling entity relationships across various domains. Graph Neural Networks have become cornerstone for processing graph structures, while the integration of text attributes remains a prominent research. The development of Large Language Models (LLMs) provides new opportunities for advancing textual encoding in TAGs. However, LLMs face challenges in specialized domains due to their limited task-specific knowledge, and fine-tuning them for specific tasks demands significant resources. To cope with the above challenges, we propose HiTuner, a novel framework that leverages fine-tuned Pre-trained Language Models (PLMs) with domain expertise as tuner to enhance the hierarchical LLM contextualized representations for modeling TAGs. Specifically, we first strategically select hierarchical hidden states of LLM to form a set of diverse and complementary descriptions as input for the sparse projection operator. Concurrently, a hybrid representation learning is developed to amalgamate the broad linguistic comprehension of LLMs with task-specific insights of the fine-tuned PLMs. Finally, HiTuner employs a confidence network to adaptively fuse the semantically-augmented representations. Empirical results across benchmark datasets spanning various domains validate the effectiveness of the proposed framework.
Our codes are available at: https://github.com/ZihanFang11/HiTuner
310: Multi-Sourced Compositional Generalization in Visual Question Answering
Authors: Chuanhao Li, Wenbo Ye, Zhen Li, Yuwei Wu, Yunde Jia
Location: Guangzhou | Day: TBD
Show Abstract
Compositional generalization is the ability of generalizing novel compositions from seen primitives, and has received much attention in vision-and-language (V&L) recently. Due to the multi-modal nature of V&L tasks, the primitives composing compositions source from different modalities, resulting in multi-sourced novel compositions. However, the generalization ability over multi-sourced novel compositions, i.e., multi-sourced compositional generalization (MSCG) remains unexplored. In this paper, we explore MSCG in the context of visual question answering (VQA), and propose a retrieval-augmented training framework to enhance the MSCG ability of VQA models by learning unified representations for primitives from different modalities. Specifically, semantically equivalent primitives are retrieved for each primitive in the training samples, and the retrieved features are aggregated with the original primitive to refine the model. This process helps the model learn consistent representations for the same semantic primitives across different modalities. To evaluate the MSCG ability of VQA models, we construct a new GQA-MSCG dataset based on the GQA dataset, in which samples include three types of novel compositions composed of primitives from different modalities. The GQA-MSCG dataset is available at https://github.com/NeverMoreLCH/MSCG.
322: Improvements to the Generate-and-Complete Approach to Conformant Planning
Authors: Liangda Fang, Min Zhan, Jin Tong, Xiujie Huang, Ziliang Chen, Quanlong Guan
Location: Guangzhou | Day: TBD
Show Abstract
Conformant planning is a computationally challenging task that generates an action sequence to achieve goal condition with uncertain initial states and non-deterministic actions. The generate-and-complete (in short, GC) approach shows superior performance on conformant planning, which iteratively enumerates the solution of a planning subproblem for a single initial state and attempts to extend it for all initial states until a conform solution is found. However, two major drawbacks of the GC approach hinder its performance: the computational overhead due to state exploration and the insertion of many redundant actions. To overcome the above drawbacks, we improve both verification and completion procedures. Experimental results show that the improved GC planner has significant improvements over the original GC approach in many instances with a large number of initial states. Our approach also outperforms all of state-of-the-art planners, solving 989 instances in comparison to 784, which is the most solved by DNF.
324: Directing Mamba to Complex Textures: An Efficient Texture-Aware State Space Model for Image Restoration
Authors: Long Peng, Xin Di, Zhanfeng Feng, Wenbo Li, Renjing Pei, Yang Wang, Xueyang Fu, Yang Cao, Zheng-Jun Zha
Location: Guangzhou | Day: TBD
Show Abstract
Image restoration aims to recover details and enhance contrast in degraded images. With the growing demand for high-quality imaging (e.g., 4K and 8K), achieving a balance between restoration quality and computational efficiency has become increasingly critical. Existing methods, primarily based on CNNs, Transformers, or their hybrid approaches, apply uniform deep representation extraction across the image. However, these methods often struggle to effectively model long-range dependencies and largely overlook the spatial characteristics of image degradation (regions with richer textures tend to suffer more severe damage), making it hard to achieve the best trade-off between restoration quality and efficiency. To address these issues, we propose a novel texture-aware image restoration method, TAMambaIR, which simultaneously perceives image textures and achieves a trade-off between performance and efficiency. Specifically, we introduce a novel Texture-Aware State Space Model, which enhances texture awareness and improves efficiency by modulating the transition matrix of the state-space equation and focusing on regions with complex textures. Additionally, we design a Multi-Directional Perception Block to improve multi-directional receptive fields while maintaining low computational overhead. Extensive experiments on benchmarks for image super-resolution, deraining, and low-light image enhancement demonstrate that TAMambaIR achieves state-of-the-art performance with significantly improved efficiency, establishing it as a robust and efficient framework for image restoration.
325: Flow-based Time-aware Causal Structure Learning for Sequential Recommendation
Authors: Hangtong Xu, Yuanbo Xu, Huayuan Liu, En Wang
Location: Guangzhou | Day: TBD
Show Abstract
Sequential models aim to predict future interactions based on users’ historical interaction sequences. Traditional sequential methods primarily focus on capturing intra-historical sequence dependencies, overlooking the influence of unobserved confounders in recommendation scenarios. Recent studies incorporate time as additional information helps the model capture dynamic user preferences. However, time is just the external manifestation of the influence of confounders but not the actual cause of the dynamic of user preference. Additionally, improperly integrating time with item embeddings can obstruct the model’s ability to capture sequence dependencies. To address these challenges, we first revisit the sequential recommendation problem from a causal perspective and incorporate confounders as a new task. We propose a new framework—Flow-based Time-aware Causal Structure for Sequential Recommendation (FCSRec)—explicitly incorporating unobserved confounders’ influence in the recommendation process. Specifically, we use Normalizing Flows to learn the causal graph of confounders and incorporate time information as conditional info to capture confounders’ time-sensitive representations. To balance the influence of confounders and sequence dependencies, we introduce a classifier-free training paradigm by randomly masking the influence of confounders during training to encourage the model to learn both sequence dependencies and confounders’ influence equally. We validate FCSRec on manifold real-world datasets, and experimental results show that FCSRec outperforms several state-of-the-art methods in recommendation performance. Our code is available at Code-link.
326: Linear Trading Position with Sparse Spectrum
Authors: Zhao-Rong Lai, Haisheng Yang
Location: Guangzhou | Day: TBD
Show Abstract
The principal portfolio approach is an emerging method in signal-based trading. However, these principal portfolios may not be diversified to explore the key features of the prediction matrix or robust to different situations. To address this problem, we propose a novel linear trading position with sparse spectrum that can explore a larger spectral region of the prediction matrix. We also develop a Krasnosel’skii-Mann fixed-point algorithm to optimize this trading position, which possesses the descent property and achieves a linear convergence rate in the objective value. This is a new theoretical result for this type of algorithms. Extensive experiments show that the proposed method achieves good and robust performance in various situations.
332: A Primal-dual Perspective for Distributed TD-learning
Authors: Han Dong Lim, Donghwan Lee
Location: Montreal | Day: August 21st | Time: 15:00 | Session: ML: Reinforcement Learning (2/2)
Show Abstract
The goal of this paper is to investigate distributed temporal difference (TD) learning for a networked multi-agent Markov decision process. The proposed approach is based on distributed optimization algorithms, which can be interpreted as primal-dual ordinary differential equation (ODE) dynamics subject to null-space constraints. Based on the exponential convergence behavior of the primal-dual ODE dynamics subject to null-space constraints, we examine the behavior of the final iterate in various distributed TD-learning scenarios, considering both constant and diminishing step-sizes and incorporating both i.i.d. and Markovian observation models. Unlike existing methods, the proposed algorithm does not require the assumption that the underlying communication network structure is characterized by a doubly stochastic matrix.
336: Boost Embodied AI Models with Robust Compression Boundary
Authors: Chong Yu, Tao Chen, Zhongxue Gan
Location: Guangzhou | Day: TBD
Show Abstract
The rapid improvement of deep learning models with the integration of the physical world has dramatically improved embodied AI capabilities. Meanwhile, the powerful embodied AI models and their scales place an increasing burden on deployment efficiency. The efficiency issue is more apparent on embodied AI platforms than on data centers because they have more limited computational resources and memory bandwidth. Meanwhile, most embodied AI scenarios, like autonomous driving and robotics, are more sensitive to fast responses. Theoretically, the traditional model compression techniques can help embodied AI models with more efficient computation, lower memory and energy consumption, and reduced latency. Because the embodied AI models are expected to interact with the physical world, the corresponding compressed models are also expected to resist natural corruption caused by real-world events such as noise, blur, weather conditions, and even adversarial corruption. This paper explores the novel paradigm to boost the efficiency of the embodied AI models and the robust compression boundary. The efficacy of our method has been proven to find the optimal balance between accuracy, efficiency, and robustness in real-world conditions.
354: kgMBQA: Quality Knowledge Graph-driven Multimodal Blind Image Assessment
Authors: Wuyuan Xie, Tingcheng Bian, Miaohui Wang
Location: Guangzhou | Day: TBD
Show Abstract
Blind image assessment aims to simulate human prediction of image quality distortion levels and provide quality scores. However, existing unimodal quality indicators have limited representational ability when facing complex contents and distortion types, and the predicted scores also fail to provide explanatory reasons, which further affects the credibility of their prediction results. To address these challenges, we propose a multimodal quality indicator with explanatory text descriptions, called kgMBQA. Specifically, we construct an image quality knowledge graph and conduct in-depth mining to generate explanatory texts. The text modality is further aligned and fused with the image modality, thereby improving the model performance while also outputting its corresponding quality explanatory description. The experimental results demonstrate that our kgMBQA achieves the best performance compared to recent representative methods on the KonIQ-10k, LIVE Challenge, BIQ2021, TID2013, and AIGC-3K datasets.
359: Multi-granularity Knowledge Transfer for Continual Reinforcement Learning
Authors: Chaofan Pan, Lingfei Ren, Yihui Feng, Linbo Xiong, Wei Wei, Yonghao Li, Xin Yang
Location: Guangzhou | Day: TBD
Show Abstract
Continual reinforcement learning (CRL) empowers RL agents with the ability to learn a sequence of tasks, accumulating knowledge learned in the past and using the knowledge for problemsolving or future task learning. However, existing methods often focus on transferring fine-grained knowledge across similar tasks, which neglects the multi-granularity structure of human cognitive control, resulting in insufficient knowledge transfer across diverse tasks. To enhance coarse-grained knowledge transfer, we propose a novel framework called MT-Core (as shorthand for Multi-granularity knowledge Transfer for Continual reinforcement learning). MT-Core has a key characteristic of multi-granularity policy learning: 1) a coarsegrained policy formulation for utilizing the powerful reasoning ability of the large language model (LLM) to set goals, and 2) a fine-grained policy learning through RL which is oriented by the goals. We also construct a new policy library (knowledge base) to store policies that can be retrieved for multi-granularity knowledge transfer. Experimental results demonstrate the superiority of the proposed MT-Core in handling diverse CRL tasks versus popular baselines.
363: Injecting Imbalance Sensitivity for Multi-Task Learning
Authors: Zhipeng Zhou, Liu Liu, Peilin Zhao, Wei Gong
Location: Guangzhou | Day: TBD
Show Abstract
Multi-task learning (MTL) has emerged as a promising approach for deploying deep learning models in real-life applications. Recent studies have proposed optimization-based learning paradigms to establish task-shared representations in MTL. However, our paper empirically argues that these studies, specifically gradient-based ones, primarily emphasize the conflict issue while neglecting the potentially more significant impact of imbalance/dominance in MTL. In line with this perspective, we enhance the existing baseline method by injecting imbalance-sensitivity through the imposition of constraints on the projected norms. To demonstrate the effectiveness of our proposed IMbalance-sensitive Gradient (IMGrad) descent method, we evaluate it on multiple mainstream MTL benchmarks, encompassing supervised learning tasks as well as reinforcement learning. The experimental results consistently demonstrate competitive performance.
366: FBQuant: FeedBack Quantization for Large Language Models
Authors: Yijiang Liu, Hengyu Fang, Liulu He, Rongyu Zhang, Yichuan Bai, Yuan Du, Li Du
Location: Guangzhou | Day: TBD
Show Abstract
Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to the limited computational resources of edge devices. In particular, the key bottleneck stems from memory bandwidth constraints related to weight loading.
Weight-only quantization effectively reduces memory access, yet often induces significant accuracy degradation.
Recent efforts to incorporate sub-branches have shown promise for mitigating quantization errors, but these methods either lack robust optimization strategies or rely on suboptimal objectives. To address these gaps, we propose FeedBack Quantization (FBQuant), a novel approach inspired by negative feedback mechanisms in automatic control.
FBQuant inherently ensures that the reconstructed weights remain bounded by the quantization process, thereby reducing the risk of overfitting.
To further offset the additional latency introduced by sub-branches, we develop an efficient CUDA kernel that decreases 60% of extra inference time.
Comprehensive experiments demonstrate the efficiency and effectiveness of FBQuant across various LLMs. Notably, for 3-bit Llama2-7B, FBQuant improves zero-shot accuracy by 1.2%.
384: Smart Contracts for Trustless Sampling of Correlated Equilibria
Authors: Togzhan Barakbayeva, Zhuo Cai, Amir Goharshady, Karaneh Keypoor
Location: Montreal | Day: August 21st | Time: 10:00 | Session: GTEP: Noncooperative games
Show Abstract
Correlated equilibria are a standard solution concept in game theory and generalize Nash equilibria. In a 2-player non-cooperative game in which player i has action set A_i, a correlated equilibrium is a self-enforcing probability distribution σ over A_1 * A_2. Specifically, when a strategy profile (s_1, s_2) in A_1 * A_2 is sampled according to σ, each player i can observe their own component s_i, but not the other player’s component. Knowing s_i and σ, player i cannot increase their expected payoff by defecting and playing a strategy s’_i different from s_i. Correlated equilibria are ubiquitous and crucial in mechanism design, including in the design of blockchain-based protocols which aim to incentivize honest behavior.

A correlated equilibrium depends on a centralized and impartial oracle, often called the ”external signal” in game theory literature, to sample a strategy profile and disclose each player’s component to them, while keeping the other player’s component secret. However, there is currently no trustless method to achieve this on the blockchain without centralization or relying on trusted third-parties.

In this work, we address this challenge and provide two novel protocols, one based on oblivious transfer and the other based on zkSNARKs to replace the public signal with a smart contract. We prove that our approaches are secure and provide the desired privacy properties of a correlated equilibrium, while also being efficient in terms of gas usage and thus affordable in practice.
392: SCOUT: Semi-supervised Camouflaged Object Detection by Utilizing Text and Adaptive Data Selection
Authors: Weiqi Yan, Lvhai Chen, Shengchuan Zhang, Yan Zhang, Liujuan Cao
Location: Guangzhou | Day: TBD
Show Abstract
The difficulty of pixel-level annotation has significantly hindered the development of the Camouflaged Object Detection (COD) field. To save on annotation costs, previous works leverage the semi-supervised COD framework that relies on a small number of labeled data and a large volume of unlabeled data. We argue that there is still significant room for improvement in the effective utilization of unlabeled data. To this end, we introduce a Semi-supervised Camouflaged Object Detection by Utilizing Text and Adaptive Data Selection (SCOUT). It includes an Adaptive Data Augment and Selection (ADAS) module and a Text Fusion Module (TFM). The ADSA module selects valuable data for annotation through an adversarial augment and sampling strategy. The TFM module further leverages the selected valuable data by combining camouflage-related knowledge and text-visual interaction. To adapt to this work, we build a new dataset, namely RefTextCOD. Extensive experiments show that the proposed method surpasses previous semi-supervised methods in the COD field and achieves state-of-the-art performance. Our code will be released at https://github.com/Heartfirey/UCOD-DPL.
405: Unleashing the Potential of Transformer Flow for Photorealistic Face Restoration
Authors: Kepeng Xu, Li Xu, Gang He, Wei Chen, Xianyun Wu, Wenxin Yu
Location: Guangzhou | Day: TBD
Show Abstract
Face restoration is a challenging task due to the need to remove artifacts and restore details. Traditional methods usually use generative model prior to achieve face restoration, but the restored results are still insufficient in terms of realism and details. In this paper, we introduce OmniFace, a novel face restoration framework that leverages Transformer-based diffusion flow. By exploiting the scaling property of Transformer, OmniFace achieves high-resolution restoration with exceptional realism and detail. The framework integrates three key components: (1) a Transformer-driven vector estimation network, (2) a representation aligned ControlNet, and (3) an adaptive training strategy for face restoration. The inherent scaling law of Transformer architectures enables the restoration of high-quality faces at high resolution. The controlnet combined with pre-trained diffusion representation can be easily trained. The adaptive training strategy provides a vector field that is more suitable for face restoration. Comprehensive experiments demonstrate that OmniFace outperforms existing techniques in terms of restoration quality across multiple benchmark datasets, especially in restoring photographic-level texture details in high-resolution scenes.
410: DO-CoLM: Dynamic 3D Conformation Relationships Capture with Self-Adaptive Ordering Molecular Relational Modeling in Language Models
Authors: Zhuo Chen, Jiahui Zhang, Sihan Wang, Hongxin Xiang, Jianmin Wang, Wenjie Du, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Molecular Relational Learning (MRL) aims to understand interactions between molecular pairs, playing a critical role in advancing biochemical research. Recently, Large Language Models (LLMs), with their extensive knowledge bases and advanced reasoning capabilities, have emerged as powerful tools for MRL. However, existing LLMs, which primarily rely on SMILES strings and molecular graphs, face two major challenges. They struggle to capture molecular stereochemistry and dynamics, as molecules possess multiple 3D conformations with varying reactivity and dynamic transformation relationships that are essential for accurately predicting molecular interactions but cannot be effectively represented by 1D SMILES or 2D molecular graphs. Additionally, these models do not consider the autoregressive nature of LLMs, overlooking the impact of input order on model performance. To address these issues, we propose DO-CoLM: a Dynamic relationship capture and self-adaptive Ordering 3D molecular Conformation LM for MRL. By introducing modules to dynamically model intra-molecular and inter-molecular conformational relationships and adaptively adjust the molecular modality input order, DO-CoLM achieves superior performance, as demonstrated by experimental results on 12 cross-domain datasets.
449: InfVC: An Inference-Enhanced Local Search Algorithm for the Minimum Vertex Cover Problem in Massive Graphs
Authors: Rui Sun, Peiyan Liu, Yiyuan Wang, Zhaohui Liu, Liping Du, Jian Gao
Location: Guangzhou | Day: TBD
Show Abstract
The minimum vertex cover (MVC) problem is a classic NP-hard combinatorial optimization problem with extensive real-world applications. In this paper, we propose an efficient local search algorithm, InfVC, to solve the MVC in massive graphs, which comprises three ideas. First, we introduce an inference-driven optimization strategy that explores better feasible solutions through inference rules. Second, we develop a structural-determined perturbation strategy that is motivated by the structure features of high-quality solutions, prioritizing high-degree vertices into the candidate solution to guide the search process to some potential high-quality search area. Third, we design a self-adaptive local search framework that dynamically balances exploration and exploitation through a perturbation management mechanism. Extensive experiments demonstrate that InfVC outperforms all the state-of-the-art algorithms on almost massive instances.
459: Adversarial Attacks on Both Face Recognition and Face Anti-spoofing Models
Authors: Fengfan Zhou, Qianyu Zhou, Heifei Ling, Xuequan Lu
Location: Guangzhou | Day: TBD
Show Abstract
Adversarial attacks on Face Recognition (FR) systems have demonstrated significant effectiveness against standalone FR models. However, their practicality diminishes in complete FR systems that incorporate Face Anti-Spoofing (FAS) models, as these models can detect and mitigate a substantial number of adversarial examples. To address this critical yet under-explored challenge, we introduce a novel attack setting that targets both FR and FAS models simultaneously, thereby enhancing the practicability of adversarial attacks on integrated FR systems. Specifically, we propose a new attack method, termed Reference-free Multi-level Alignment (RMA), designed to improve the capacity of black-box attacks on both FR and FAS models. The RMA framework is built upon three key components. Firstly, we propose an Adaptive Gradient Maintenance module to address the imbalances in gradient contributions between FR and FAS models. Secondly, we develop a Reference-free Intermediate Biasing module to improve the transferability of adversarial examples against FAS models. In addition, we introduce a Multi-level Feature Alignment module to reduce feature discrepancies at various levels of representation. Extensive experiments showcase the superiority of our proposed attack method to state-of-the-art adversarial attacks.
474: Towards a Unified View of Social Laws with Instantaneous Actions
Authors: Alexander Tuisov, Evgeny Mishlyakov, Alexander Shleyfman, Erez Karpas
Location: Montreal | Day: August 21st | Time: 11:30 | Session: Planning and Scheduling (4/5)
Show Abstract
Multiple agents operating in a shared environment can interfere with each other’s ability to reach their goals. One of the approaches to address this issue is enacting a social law – a set of rules that restricts some possible behaviors of the agents. A social law is considered robust if it guarantees that each agent can achieve its goal independently of the actions of other agents. Recent work has shown how to verify that a given social law, encoded in an MA-STRIPS formalism, is robust by compilation to classical planning. Follow-up work presented an extended compilation which can handle numeric multi-agent planning. In this paper, we present a new compilation, which can handle both classical and numeric multi-agent planning formalisms, as well as any other multi-agent planning formalism with instantaneous actions, in which action preconditions can be negated using first-order logic with equality. This opens the door to using social laws in even richer planning formalisms. Our empirical evaluation shows that the added expressivity of the new compilation does not hurt its performance, and it achieves comparable performance to the previous state-of-the-art compilations.
475: L2M2: A Hierarchical Framework Integrating Large Language Model and Multi-agent Reinforcement Learning
Authors: Minghong Geng, Shubham Pateria, Budhitama Subagdja, Lin Li, Xin Zhao, Ah-Hwee Tan
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Agent-based and Multi-agent Systems (2/3)
Show Abstract
Multi-agent reinforcement learning (MARL) has demonstrated remarkable success in collaborative tasks, yet faces significant challenges in scaling to complex scenarios requiring sustained planning and coordination across long horizons. While hierarchical approaches help decompose these tasks, they typically rely on hand-crafted subtasks and domain-specific knowledge, limiting their generalizability. We present L2M2, a novel hierarchical framework that leverages large language models (LLMs) for high-level strategic planning and MARL for low-level execution. L2M2 enables zero-shot planning that supports both end-to-end training and direct integration with pre-trained MARL models. Experiments in the VMAS environment demonstrate that L2M2’s LLM-guided MARL achieves superior performance while requiring less than 20% of the training samples compared to baseline methods. In the MOSMAC environment, L2M2 demonstrates strong performance with pre-defined subgoals and maintains substantial effectiveness without subgoals – scenarios where baseline methods consistently fail. Analysis through kernel density estimation reveals L2M2’s ability to automatically generate appropriate navigation plans, demonstrating its potential for addressing complex multi-agent coordination tasks.
482: SetKE: Knowledge Editing for Knowledge Elements Overlap
Authors: Yifan Wei, Xiaoyan Yu, Ran Song, Hao Peng, Angsheng Li
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Natural Language Processing (2/2)
Show Abstract
Large Language Models (LLMs) excel in tasks such as retrieval and question answering but require updates to incorporate new knowledge and reduce inaccuracies and hallucinations.
Traditional updating methods, like fine-tuning and incremental learning, face challenges such as overfitting and high computational costs.
Knowledge Editing (KE) provides a promising alternative but often overlooks the Knowledge Element Overlap (KEO) phenomenon, where multiple triplets share common elements, leading to editing conflicts.
We identify the prevalence of KEO in existing KE datasets and show its significant impact on current KE methods, causing performance degradation in handling such triplets.
To address this, we propose a new formulation, Knowledge Set Editing (KSE), and introduce SetKE, a method that edits sets of triplets simultaneously.
Experimental results demonstrate that SetKE outperforms existing methods in KEO scenarios on mainstream LLMs. Additionally, we introduce EditSet, a dataset containing KEO triplets, providing a comprehensive benchmark.
486: T-T: Table Transformer for Tagging-based Aspect Sentiment Triplet Extraction
Authors: Kun Peng, Chaodong Tong, Cong Cao, Hao Peng, Qian Li, Guanlin Wu, Lei Jiang, Yanbing Liu, Philip S. Yu
Location: Guangzhou | Day: TBD
Show Abstract
Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed of aspect terms, opinion terms, and sentiment polarities from given sentences. The table tagging method is a popular approach to addressing this task, which encodes a sentence into a 2-dimensional table, allowing for the tagging of relations between any two words. Previous efforts have focused on designing various downstream relation learning modules to better capture interactions between tokens in the table, revealing that a stronger capability in relation capture can lead to greater improvements in the model. Motivated by this, we attempt to directly utilize transformer layers as downstream relation learning modules. Due to the powerful semantic modeling capability of transformers, it is foreseeable that this will lead to excellent improvement. However, owing to the quadratic relation between the length of the table and the length of the input sentence sequence, using transformers directly faces two challenges: overly long table sequences and unfair local attention interaction. To address these challenges, we propose a novel Table-Transformer (T-T) for the tagging-based ASTE method. Specifically, we introduce a stripe attention mechanism with a loop-shift strategy to tackle these challenges. The former modifies the global attention mechanism to only attend to a 2-dimensional local attention window, while the latter facilitates interaction between different attention windows. Extensive and comprehensive experiments demonstrate that the T-T, as a downstream relation learning module, achieves state-of-the-art performance with lower computational costs.
487: Asymptotic Fair Division: Chores Are Easier Than Goods
Authors: Pasin Manurangsi, Warut Suksompong
Location: Montreal | Day: August 21st | Time: 11:30 | Session: GTEP: Fair division
Show Abstract
When dividing items among agents, two of the most widely studied fairness notions are envy-freeness and proportionality. We consider a setting where m chores are allocated to n agents and the disutility of each chore for each agent is drawn from a probability distribution. We show that an envy-free allocation exists with high probability provided that m >= 2n, and moreover, m must be at least n+Theta(n) in order for the existence to hold. On the other hand, we prove that a proportional allocation is likely to exist as long as m = omega(1), and this threshold is asymptotically tight. Our results reveal a clear contrast with the allocation of goods, where a larger number of items is necessary to ensure existence for both notions.
489: Prompt-Aware Controllable Shadow Removal
Authors: Kerui Chen, Zhiliang Wu, Wenjin Hou, Kun Li, Hehe Fan, Yi Yang
Location: Guangzhou | Day: TBD
Show Abstract
Shadow removal aims to restore the image content in shadowed regions. While deep learning-based methods have shown promising results, they still face key challenges: 1) uncontrolled removal of all shadows, or 2) controllable removal but heavily relies on precise shadow region masks. To address these issues, we introduce a novel paradigm: prompt-aware controllable shadow removal. Unlike existing approaches, our paradigm allows for targeted shadow removal from specific subjects based on user prompts (e.g., dots, lines, or subject masks). This approach eliminates the need for shadow annotations and offers flexible, user-controlled shadow removal. Specifically, we propose an end-to-end learnable model, the Prompt-Aware Controllable Shadow Removal Network (PACSRNet). PACSRNet consists of two key modules: a prompt-aware module that generates shadow masks for the specified subject based on the user prompt, and a shadow removal module that uses the shadow prior from the first module to restore the content in the shadowed areas. Additionally, we enhance the shadow removal module by incorporating feature information from the prompt-aware module through a linear operation, providing prompt-guided support for shadow removal. Recognizing that existing shadow removal datasets lack diverse user prompts, we contribute a new dataset specifically designed for prompt-based controllable shadow removal. Extensive experimental results demonstrate the effectiveness and superiority of PACSRNet.
493: Not in My Backyard! Temporal Voting Over Public Chores
Authors: Edith Elkind, Tzeh Yuan Neoh, Nicholas Teh
Location: Montreal | Day: August 19th | Time: 11:30 | Session: GTEP: Computational social choice (1/2)
Show Abstract
We study a temporal voting model where voters have dynamic preferences over a set of public chores—projects that benefit society, but impose individual costs on those affected by their implementation. We investigate the computational complexity of optimizing utilitarian and egalitarian welfare. Our results show that while optimizing the former is computationally straightforward, minimizing the latter is computationally intractable, even in very restricted cases. Nevertheless, we identify several settings where this problem can be solved efficiently, either exactly or by an approximation algorithm. We also examine the effects of enforcing temporal fairness and its impact on social welfare, and analyze the competitive ratio of online algorithms. We then explore the strategic behavior of agents, providing insights into potential malfeasance in such decision-making environments. Finally, we discuss a range of fairness measures and their suitability for our setting.
496: PDDFormer: Pairwise Distance Distribution Graph Transformer for Crystal Material Property Prediction
Authors: Xiangxiang Shen, Zheng Wan, Lingfeng Wen, Licheng Sun, Jian Yang, Xuan Tang, Shing-Ho J. Lin, Xiao He, Mingsong Chen, Xian Wei
Location: Guangzhou | Day: TBD
Show Abstract
Crystal structures can be simplified as a periodic point set that repeats across three-dimensional space along an underlying lattice. Traditionally, crystal representation methods rely on descriptors such as lattice parameters, symmetry, and space groups to characterize the structure. However, in reality, atoms in materials always vibrate above absolute zero, causing their positions to fluctuate continuously. This dynamic behavior disrupts the fundamental periodicity of the lattice, making crystal graphs based on static lattice parameters and conventional descriptors discontinuous under slight perturbations. Chemists proposed the pairwise distance distribution (PDD) method to address this. However, the completeness of PDD requires defining a large number of neighboring atoms, leading to high computational costs. Additionally, PDD does not account for atomic information, making it challenging to apply it directly to crystal material property prediction tasks. To tackle these challenges, we introduce the atom-weighted Pairwise Distance Distribution (WPDD) and Unit cell Pairwise Distance Distribution (UPDD) for the first time, applying them to the construction of multi-edge crystal graphs. We demonstrate the continuity and general completeness of crystal graphs under slight atomic position perturbations. Moreover, by modeling PDD as global information and integrating it into matrix-based message passing, we significantly reduce computational costs. Comprehensive evaluation results show that WPDDFormer achieves state-of-the-art predictive accuracy across tasks on benchmark datasets such as the Materials Project and JARVIS-DFT.
497: Probabilistic Analysis of Stable Matching in Large Markets with Siblings
Authors: Zhaohong Sun, Tomohiko Yokoyama, Makoto Yokoo
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Game Theory and Economic Paradigms
Show Abstract
We study a practical centralized matching problem which assigns children to daycare centers.
The collective preferences of siblings from the same family introduce complementarities, which can lead to the absence of stable matchings, as observed in the hospital-doctor matching problems involving couples. Intriguingly, stable matchings are consistently observed in real-world daycare markets, despite the prevalence of sibling applicants.

We conduct a probabilistic analysis of large random markets to examine the existence of stable matchings in such markets. Specifically, we focus on scenarios where daycare centers have similar priorities over children, a common characteristic in real-world markets. Our analysis reveals that as the market size approaches infinity, the likelihood of stable matchings existing converges to 1.

To facilitate our exploration, we refine an existing heuristic algorithm to address a more rigorous stability concept, as the original one may fail to meet this criterion. Through extensive experiments on both real-world and synthetic datasets, we demonstrate the effectiveness of our revised algorithm in identifying stable matchings, particularly when daycare priorities exhibit high similarity.
498: Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
Authors: Xin He, Longhui Wei, Lingxi Xie, Qi Tian
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of novel works recently. The prevailing trend involves adopting data-driven methodologies, wherein diverse instruction-following datasets were collected. However, these approaches always face the challenge of limited visual perception capabilities, as they solely utilizing CLIP-like encoders to extract visual information from inputs. Though these encoders are pre-trained on billions of image-text pairs, they still grapple with the information loss dilemma, given that textual captions only partially capture the contents depicted in images. To address this limitation, this paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. Specifically, this work introduces a novel method that incorporates multi-task encoders and existing visual tools into the MLLMs training and inference pipeline, aiming to provide a more comprehensive summarization of visual inputs. Extensive experiments have evaluated its effectiveness of advancing MLLMs, showcasing improved visual perception capability achieved through the integration of visual experts.
508: View-Association-Guided Dynamic Multi-View Classification
Authors: Xinyan Liang, Li Lv, Qian Guo, Bingbing Jiang, Feijiang Li, Liang Du, Lu Chen
Location: Guangzhou | Day: TBD
Show Abstract
In multi-view classification tasks, integrating information from multiple views effectively is crucial for improving model performance. However, most existing methods fail to fully leverage the complex relationships between views, often treating them independently or using static fusion strategies. In this paper, we propose a View-Association-Guided Dynamic Multi-View Classification method (AssoDMVC) to address these limitations. Our approach dynamically models and incorporates the relationships between different views during the classification process. Specifically, we introduce a view-relation-guided mechanism that captures the dependencies and interactions between views, allowing for more flexible and adaptive feature fusion. This dynamic fusion strategy ensures that each view contributes optimally based on its contextual relevance and the inter-view relationships. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms traditional multi-view classification techniques, offering a more robust and efficient solution for tasks involving complex multi-view data.
509: Phenotypic Profile-Informed Generation of Drug-Like Molecules via Dual-Channel Variational Autoencoders
Authors: Hui Liu, Shiye Tian, Xuejun Liu
Location: Guangzhou | Day: TBD
Show Abstract
The de novo generation of drug-like molecules capable of inducing desirable phenotypic changes is receiving increasing attention. However, previous methods predominantly rely on expression profiles to guide molecule generation, but overlook the perturbative effect of the molecules on cellular contexts. To overcome this limitation, we propose SmilesGEN, a novel generative model based on variational autoencoder (VAE) architecture to generate molecules with potential therapeutic effects. SmilesGEN integrates a pre-trained drug VAE (SmilesNet) with an expression profile VAE (ProfileNet), jointly modeling the interplay between drug perturbations and transcriptional responses in a common latent space. Specifically, ProfileNet is imposed to reconstruct pre-treatment expression profiles when eliminating drug-induced perturbations in the latent space, while SmilesNet is informed by desired expression profiles to generate drug-like molecules. Our empirical experiments demonstrate that SmilesGEN outperforms current state-of-the-art models in generating molecules with higher degree of validity, uniqueness, novelty, as well as higher Tanimoto similarity to known ligands targeting the relevant proteins. Moreover, we evaluate SmilesGEN for scaffold-based molecule optimization and generation of therapeutic agents, and confirmed its superior performance in generating molecules with higher similarity to approved drugs. SmilesGEN establishes a robust framework that leverages gene signatures to generate drug-like molecules that hold promising potential to induce desirable cellular phenotypic changes. The source code and datasets are available at: https://github.com/hliulab/SmilesGEN.
517: ExpertDiff: Head-less Model Reprogramming with Diffusion Classifiers for Out-of-Distribution Generalization
Authors: Jee Seok Yoon, Junghyo Sohn, Wootaek Jeong, Heung-Il Suk
Location: Montreal | Day: August 19th | Time: 11:30 | Session: ML: Difussion Models
Show Abstract
Vision-language models have achieved remarkable performance across various tasks by leveraging large-scale multimodal training data. However, their ability to generalize to out-of-distribution (OOD) domains requiring expert-level knowledge remains an open challenge. To address this, we investigate cross-domain transfer learning approaches for efficiently adapting diffusion classifiers to new target domains demanding expert-level domain knowledge. Specifically, we propose ExpertDiff, a head-less model reprogramming technique that optimizes the instruction-following abilities of text-to-image diffusion models via learnable prompts, while leveraging the diffusion classifier objective as a modular plug-and-play adaptor. Our approach eliminates the need for conventional output mapping layers (e.g., linear probes), enabling seamless integration with off-the-shelf diffusion frameworks like Stable Diffusion. We demonstrate the effectiveness of ExpertDiff on the various OOD datasets (i.e., medical and satellite imagery). Furthermore, we qualitatively showcase ExpertDiff’s ability to faithfully reconstruct input images, highlighting its potential for both downstream discriminative and upstream generative tasks. Our work paves the way for effectively repurposing powerful foundation models for novel OOD applications requiring domain expertise.
519: Drafting and Revision: Advancing High-Fidelity Video Inpainting
Authors: Zhiliang Wu, Kun Li, Hehe Fan, Yi Yang
Location: Guangzhou | Day: TBD
Show Abstract
Video inpainting aims to fill the missing regions in video with spatial-temporally coherent contents. Existing methods usually treat the missing contents as a whole and adopt a hybrid objective containing a reconstruction loss and an adversarial loss to train the model. However, these two kinds of loss focus on contents at different frequencies, simply combining them may cause inter-frequency conflicts, leading the trained model to generate compromised results. Inspired by the common corrupted painting restoration process of “drawing a draft first and then revising the details later”, this paper proposes a Drafting-and-Revision Completion Network (DRCN) for video inpainting. Specifically, we first design a Drafting Network that utilizes the temporal information to complete the low-frequency semantic structure at low resolution. Then, a Revision Network is developed to hallucinate high-frequency details at high resolution by using the output of Drafting Network. In this way, adversarial loss and reconstruction loss can be applied to high-frequency and low-frequency respectively, effectively mitigating inter-frequency conflicts. Furthermore, Revision Network can be stacked in a pyramid manner to generate higher resolution details, which provide a feasible solution for high-resolution video inpainting. Experiments show that DRCN achieves improvements of 7.43% and 12.64% in E_warp and LPIPS, and can handle higher resolution videos on limited GPU memory.
523: OMS: One More Step Noise Searching to Enhance Membership Inference Attacks for Diffusion Models
Authors: Xiaomeng Fu, Xi Wang, Qiao Li, Jin Liu, Jiao Dai, Jizhong Han, Xingyu Gao
Location: Guangzhou | Day: TBD
Show Abstract
The data-intensive nature of Diffusion models amplifies the risks of privacy infringements and copyright disputes, particularly when training on extensive unauthorized data scraped from the Internet. Membership Inference Attacks (MIA) aim to determine whether a data sample has been utilized by the target model during training, thereby serving as a pivotal tool for privacy preservation. Current MIA employs the prediction loss to distinguish between training member samples and non-members.
These methods assume that, compared to non-members, members, having been encountered by the model during training result in a smaller prediction loss. However, this assumption proves ineffective in diffusion models due to the random noise sampled during the training process. Rather than estimating the loss, our approach examines this random noise and reformulate the MIA as a noise search problem, assuming that members are more feasible to find the noise used in the training process.
We formulate this noise search process as an optimization problem and employ the fixed-point iteration to solve it. We analyze current MIA methods through the lens of the noise search framework and reveal that they rely on the first residual as the discriminative metric to differentiate members and non-members. Inspired by this observation, we introduce OMS, which augments existing MIA methods by iterating One More fixed-point Step to include a further residual, i.e., the second residual.
We integrate our method into various MIA methods across different diffusion models. The experimental results validate the efficacy of our proposed approach.
547: Joint-Perturbation Simultaneous Pseudo-Gradient
Authors: Carlos Martin, Tuomas Sandholm
Location: Montreal | Day: August 21st | Time: 10:00 | Session: GTEP: Noncooperative games
Show Abstract
We study the problem of computing an approximate Nash equilibrium of a game whose strategy space is continuous without access to gradients of the utility function.
Lack of access to gradients is common in reinforcement learning settings, where the environment is treated as a black box, as well as equilibrium finding in mechanisms such as auctions, where the mechanism’s payoffs are discontinuous in the players’ actions.
To tackle this problem, we turn to zeroth-order optimization techniques that combine pseudo-gradients with equilibrium-finding dynamics.
Specifically, we introduce a new technique that requires a number of utility function evaluations per iteration that is constant rather than linear in the number of players.
It achieves this by performing a single joint perturbation on all players’ strategies, rather than perturbing each one individually.
This is very important for many-player games, especially when the utility function is expensive to compute in terms of wall time, memory, money, or other resources.
We evaluate our approach on various games, including auctions, which have important real-world applications.
Our approach yields a dramatic improvement in performance in terms of the wall time required to reach an approximate Nash equilibrium.
549: An Association-based Fusion Method for Speech Enhancement
Authors: Shijie Wang, Qian Guo, Lu Chen, Liang Du, Zikun Jin, Zhian Yuan, Xinyan Liang
Location: Guangzhou | Day: TBD
Show Abstract
Deep learning-based speech enhancement (SE) methods predominantly draw upon two architectural frameworks: generative adversarial networks and diffusion models. In the realm of SE, capturing the local and global relations between signal frames is crucial for the success of these methods. These frameworks typically employ a UNet architecture as their foundational backbone, integrating Long Short-Term Memory (LSTM) networks or attention mechanisms within the UNet to effectively model both local and global signal relations. However, the coupled relation modeling way may not fully harness the potential of these relations. In this paper, we propose an innovative Association-based Fusion Speech Enhancement method (AFSE), a decoupled method. AFSE first constructs a graph that encapsulates the association between each time window of the speech signal, and then models the global relations between frames by fusing the features of these time windows in a manner akin to graph neural networks. Furthermore, AFSE leverages a UNet with dilated convolutions to model the local relations, enabling the network to maintain a high-resolution representation while benefiting from a wider receptive field. Experimental results demonstrate that the AFSE method significantly improves performance in speech enhancement tasks, validating the effectiveness and superiority of our approach. The code is available at https://github.com/jie019/AFSE_IJCAI2025.
567: Integrating Independent Layer-Wise Rank Selection with Low-Rank SVD Training for Model Compression: A Theory-Driven Approach
Authors: Yifan Guo, Alyssa Yu
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Machine Learning (3/4)
Show Abstract
In recent years, with the rise of large language models, model sizes have grown dramatically, garnering attention for their remarkable performance but also raising concerns about the substantial computational and communication resources they require. This has created significant challenges in fine-tuning or re-training models on devices with limited computing and memory resources. Efficient model compression through low-rank factorization has emerged as a promising solution, offering a way to balance the tradeoff between compression ratio and prediction accuracy. However, existing approaches to low-rank selection often rely on trial-and-error methods to determine the optimal rank, lacking theoretical guidance and incurring high computational costs. Furthermore, these methods typically treat low-rank factorization as a post-training process, resulting in suboptimal compressed models. In this paper, we design a novel approach by integrating rank selection into the low-rank training process and performing independent layer-wise rank selection under the guidance of a theoretical loss error bound. Specifically, we first conduct a comprehensive theoretical analysis to quantify how low-rank approximations impact the training losses. Building on these insights, we develop an efficient layer-wise rank search algorithm and seamlessly incorporate it into low-rank singular value decomposition (SVD) training. Our evaluation results on benchmark datasets demonstrate that our approach can achieve high prediction accuracy while delivering significant compression performance. Furthermore, our solution is generic and can be extended to broader learning models.
588: Learnable Frequency Decomposition for Image Forgery Detection and Localization
Authors: Dong Li, Jiayíng Zhu, Yidi Liu, Xin Lu, Xueyang Fu, Jiawei Liu, Aiping Liu, Zheng-Jun Zha
Location: Guangzhou | Day: TBD
Show Abstract
Concern for image authenticity spurs research in image forgery detection and localization (IFDL). Most deep learning-based methods focus primarily on spatial domain modeling and have not fully explored frequency domain strategies. In this paper, we observe and analyze the frequency characteristic changes caused by image tampering. Observations indicate that manipulation traces are especially prominent in phase components and span both low and high-frequency bands. Based on these findings, we propose a forensic frequency decomposition network (F2D-Net), which incorporates deep Fourier transforms and leverages both phase information and high and low-frequency components to enhance IFDL. Specifically, F2D-Net consists of the Spectral Decomposition Subnetwork (SDSN) and the Frequency Separation Subnetwork (FSSN). The former decomposes the image into amplitude and phase, focusing on learning the semantic content in the phase spectrum to identify forged objects, thus improving forgery detection accuracy. The latter further adaptively decomposes the output of the SDSN to obtain corresponding high and low frequencies, and applies a divide-and-conquer strategy to refine each frequency band, mitigating the optimization difficulties caused by coupled forgery traces across different frequencies, thereby better capturing the pixels belonging to the forged object to improve localization accuracy. Experiments on multiple datasets demonstrate that our method outperforms state-of-the-art image forgery detection and localization techniques both qualitatively and quantitatively.
598: FissionVAE: Federated Non-IID Image Generation with Latent Space and Decoder Decomposition
Authors: Chen Hu, Hanchi Ren, Jingjing Deng, Xianghua Xie, Xiaoke Ma
Location: Montreal | Day: August 19th | Time: 15:00 | Session: ML: Federated Learning
Show Abstract
Federated learning is a machine learning paradigm that enables decentralized clients to collaboratively learn a shared model while keeping all the training data local. While considerable research has focused on federated image generation, particularly Generative Adversarial Networks, Variational Autoencoders have received less attention. In this paper, we address the challenges of non-IID (independently and identically distributed) data environments featuring multiple groups of images of different types. Non-IID data distributions can lead to difficulties in maintaining a consistent latent space and can also result in local generators with disparate texture features being blended during aggregation. We thereby introduce FissionVAE that decouples the latent space and constructs decoder branches tailored to individual client groups. This method allows for customized learning that aligns with the unique data distributions of each group. Additionally, we incorporate hierarchical VAEs and demonstrate the use of heterogeneous decoder architectures within FissionVAE. We also explore strategies for setting the latent prior distributions to enhance the decoupling process. To evaluate our approach, we assemble two composite datasets: the first combines MNIST and FashionMNIST; the second comprises RGB datasets of cartoon and human faces, wild animals, marine vessels, and remote sensing images. Our experiments demonstrate that FissionVAE greatly improves generation quality on these datasets compared to baseline federated VAE models.
602: Exploring Semantic Masked Autoencoder for Self-supervised Point Cloud Understanding
Authors: Yixin Zha, Chuxin Wang, Wenfei Yang, Tianzhu Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Point cloud understanding aims to acquire robust and general feature representations from unlabeled data. Masked point modeling-based methods have recently shown significant performance across various downstream tasks. These pre-training methods rely on random masking strategies to establish the perception of point clouds by restoring corrupted point cloud inputs, which leads to the failure of capturing reasonable semantic relationships by the self-supervised models. To address this issue, we propose Semantic Masked Autoencoder, which comprises two main components: a prototype-based component semantic modeling module and a component semantic-enhanced masking strategy. Specifically, in the component semantic modeling module, we design a component semantic guidance mechanism to direct a set of learnable prototypes in capturing the semantics of different components from objects. Leveraging these prototypes, we develop a component semantic-enhanced masking strategy that addresses the limitations of random masking in effectively covering complete component structures. Furthermore, we introduce a component semantic-enhanced prompt-tuning strategy, which further leverages these prototypes to improve the performance of pre-trained models in downstream tasks. Extensive experiments conducted on datasets such as ScanObjectNN, ModelNet40, and ShapeNetPart demonstrate the effectiveness of our proposed modules.
605: Multi-Label Text Classification with Label Attention Aware and Correlation Aware Contrastive Learning
Authors: Zhengzhong Zhu, Pei Zhou, Zeting Li, Kejiang Chen, Jiangping Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Multi-label text classification (MLTC) is a challenging task where each document can be associated with multiple interdependent labels. This task is complicated by two key issues: the intricate correlations among labels and the partial overlap between labels and text relevance. Existing methods often fail to capture the semantic dependencies between labels or struggle to handle the ambiguities caused by partial overlaps, resulting in suboptimal representation learning.
To address these challenges, we propose the Unified Contextual and Label-Aware Framework (UCLAF), which integrates a Label Attention Aware Network(LAN) and Correlation Aware Contrastive Learning (CACL) in a synergistic design. The Label Attention Aware Network explicitly models label dependencies by embedding labels and texts into a shared semantic space, aligning text representations with label semantics. Meanwhile, Correlation Aware Contrastive Learning refines these representations by dynamically modeling sample-level relationships, leveraging a contrastive loss function that accounts for the proportional overlap of labels between samples. This complementary approach enables UCLAF to jointly address complex label correlations and partial label overlaps.
Extensive experiments on benchmark datasets demonstrate that UCLAF significantly outperforms state-of-the-art methods, showcasing its effectiveness in improving both representation learning and classification performance in MLTC tasks. We will release our code after the paper is accepted.
611: Data Poisoning Attack Defense and Evolutionary Domain Adaptation for Federated Medical Image Segmentation
Authors: Min Hyuk Kim, Seok Bong Yoo
Location: Montreal | Day: August 21st | Time: 10:00 | Session: CV: attacks
Show Abstract
Federated learning has significant demonstrated potential in medical image segmentation to protect data privacy by retaining local data. However, its application is still hindered by two critical challenges: 1) the retained data poisoning attacks that severely compromise the accuracy of the global segmentation model and 2) domain gaps among clients, undermining its generalizability. To address these issues, we propose AdaShield-FL, a data poisoning attack defense and evolutionary domain adaptation for federated medical image segmentation. AdaShield-FL incorporates a disentangled reconstruction and segmentation module that purifies data in the k-space domain to mitigate the effects of adversarial attacks iteratively. Moreover, it introduces a data poisoning attack detection mechanism that analyzes abnormal patterns in training loss sequences to identify malicious clients. This method also aligns local and global covariance matrices via evolutionary optimization to minimize the domain gap efficiently. The experimental validation on cardiac magnetic resonance imaging datasets demonstrates the robustness and superior performance of AdaShield-FL compared with other federated learning methods.
614: Learn to Think: Bootstrapping LLM Logic Through Graph Representation Learning
Authors: Hang Gao, Chenhao Zhang, Tie Wang, Junsuo Zhao, Fengge Wu, Changwen Zheng, Huaping Liu
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) have achieved remarkable success across various domains. However, they still face significant challenges, including high computational costs for training and limitations in solving complex reasoning problems. Although existing methods have extended the reasoning capabilities of LLMs through structured paradigms, these approaches often rely on task-specific prompts and predefined reasoning processes, which constrain their flexibility and generalizability. To address these limitations, we propose a novel framework that leverages graph learning to enable more flexible and adaptive reasoning capabilities for LLMs. Specifically, this approach models the reasoning process of a problem as a graph and employs LLM-based graph learning to guide the adaptive generation of each reasoning step. To further enhance the adaptability of the model, we introduce a Graph Neural Network (GNN) module to perform representation learning on the generated reasoning process, enabling real-time adjustments to both the model and the prompt. Experimental results demonstrate that this method significantly improves reasoning performance across multiple tasks without requiring additional training or task-specific prompt design. Code can be found in https://github.com/zch65458525/L2T.
623: Learning Heterogeneous Performance-Fairness Trade-offs in Federated Learning
Authors: Rongguang Ye, Ming Tang
Location: Guangzhou | Day: TBD
Show Abstract
Recent methods leverage a hypernet to handle the performance-fairness trade-offs in federated learning. This hypernet maps the clients’ preferences between model performance and fairness to preference-specifc models on the trade-off curve, known as local Pareto front. However, existing methods typically adopt a uniform preference sampling distribution to train the hypernet across clients, neglecting the inherent heterogeneity of their local Pareto fronts. Meanwhile, from the perspective of generalization, they do not consider the gap between local and global Pareto fronts on the global dataset. To address these limitations, we propose HetPFL to effectively learn both local and global Pareto fronts. HetPFL comprises Preference Sampling Adaptation (PSA) and Preference-aware Hypernet Fusion (PHF). PSA adaptively determines the optimal preference sampling distribution for each client to accommodate heterogeneous local Pareto fronts. While PHF performs preference-aware fusion of clients’ hypernets to ensure the performance of the global Pareto front. We prove that HetPFL converges linearly with respect to the number of rounds, under weaker assumptions than existing methods. Extensive experiments on four datasets show that HetPFL significantly outperforms seven baselines in terms of the quality of learned local and global Pareto fronts.
632: Multimodal Knowledge Retrieval-Augmented Iterative Alignment for Satellite Commonsense Conversation
Authors: Qian Li, Xuchen Li, Zongyu Chang, Yuzheng Zhang, Cheng Ji, Shangguang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Satellite technology has significantly influenced our daily lives, manifested in applications such as navigation and communication. With its development, a vast amount of multimodal satellite commonsense data has been generated, thus leading to an urgent demand for conversation about satellite data. However, existing large language models suffer from prevalent hallucinations and poor comprehensibility on multimodal satellite data due to their high professional content threshold and partial information opacity. To address these issues, we propose a multimodal satellite knowledge retrieval-augmented iterative alignment framework (Sat-RIA) for satellite commonsense conversation. We first construct multi-view retrieval expert knowledge to reduce hallucinations and enhance the interpretability of responses, which incorporates the satellite expert database, satellite rule, satellite image database, and a satellite knowledge graph. We next design commonsense conversation instructions to make the answers more legible and understandable. Furthermore, the retrieval-augmented iterative alignment module refines response precision by aligning outputs with task-specific standards through multi-stage evaluations.
Finally, we construct satellite multi-turn dialogue and visual question-answer datasets for a more comprehensive evaluation of satellite commonsense conversation. Experimental results demonstrate that Sat-RIA outperforms existing large language models and provides more comprehensible answers with fewer hallucinations.
636: Variational Multi-Modal Hypergraph Attention Network for Multi-Modal Relation Extraction
Authors: Qian Li, Cheng Ji, Shu Guo, Kun Peng, Qianren Mao, Shangguang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-modal relation extraction (MMRE) is a challenging task that seeks to identify relationships between entities with textual and visual attributes. However, existing methods struggle to handle the complexities posed by multiple entity pairs within a single sentence that share similar contextual information (e.g., identical text and image content). These scenarios amplify the difficulty of distinguishing relationships and hinder accurate extraction. To address these limitations, we propose the variational multi-modal hypergraph attention network (VM-HAN), a novel and robust framework for MMRE. Unlike previous approaches, VM-HAN constructs a multi-modal hypergraph for each sentence-image pair, explicitly modeling high-order intra-/inter-modal correlations among different entity pairs in the same context. This design enables a more detailed and nuanced understanding of entity relationships by capturing intricate cross-modal interactions that are often overlooked. Additionally, we introduce the variational hypergraph attention network (V-HAN). This variational attention mechanism dynamically refines the hypergraph structure, enabling the model to effectively handle the inherent ambiguity and complexity of multi-modal data. Comprehensive experiments on benchmark MMRE datasets demonstrate that VM-HAN achieves state-of-the-art performance, significantly surpassing existing methods in both accuracy and efficiency.
637: AlphaGAT: A Two-Stage Learning Approach for Adaptive Portfolio Selection
Authors: Shicheng Li, Jinshan Zhang, Feng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Portfolio selection is a critical task in finance, involving the allocation of resources across various assets. However, current methods often struggle to maintain robust performance due to the inherent low signal-to-noise ratio in raw financial data and shifts in data distribution. We propose AlphaGAT, a novel two-stage learning approach for portfolio selection, designed to adapt to different market scenarios. Inspired by the concept of alpha factors, which transform historical market data into actionable signals, the first stage introduces an advanced model named CATimeMixer for alpha factor generation with a novel loss function to improve the effectiveness and robustness. CATimeMixer integrates TimeMixer with Conv1D (C) and cross-asset Attention (A). Specifically, Conv1D enhances TimeMixer by capturing trend and seasonal features across different scales, while cross-asset attention enables TimeMixer to extract interrelationships between different assets. The second stage applies reinforcement learning to dynamically adjust weights, integrating alpha factors into trading signals. Recognizing the varying effectiveness of alpha factors across different periods, our RL agent innovatively transforms the alpha factors into graphs and employs graph attention networks (GAT) to discern the significance of different alpha factors, enhancing policy robustness. Extensive experiments on real-world market data show that our approach outperforms state-of-the-art methods.
641: Free Lunch of Image-mask Alignment for Anomaly Image Generation and Segmentation
Authors: Xiangyue Li, Xiaoyang Wang, Zhibin Wan, Quan Zhang, Yupei Wu, Tao Deng, Mingjie Sun
Location: Guangzhou | Day: TBD
Show Abstract
This paper aims at generating anomalous images and their segmentation labels to address the lack of real-world anomaly samples and privacy issues. Departing from conventional approaches that use masks solely to guide the generation of anomaly images, we propose a dual-branch training strategy for the generative model. This strategy enables the simultaneous production of anomaly images and masks, with an alignment regularization loss that ensures the coherence between the generated images and their masks. During inference, only the image-generation branch is activated to produce synthetic samples for training the downstream segmentation model. Furthermore, we propose to integrate the well-trained generative model into the training of segmentation models, utilizing a generative feedback loss to refine the segmentation model’s performance. Experiments show our method’s IoU metrics exceed previous methods by 5.03%, 5.68% and 16.63% on Real-IAD (industrial), polyp (medical), and Floor Dirty (indoor) datasets. The code is publicly accessible at https://github.com/huan-yin/anomaly-alignment.
651: LensNet: An End-to-End Learning Framework for Empirical Point Spread Function Modeling and Lensless Imaging Reconstruction
Authors: Jiesong Bai, Yuhao Yin, Yihang Dong, Xiaofeng Zhang, Chi-Man Pun, Xuhang Chen
Location: Guangzhou | Day: TBD
Show Abstract
Lensless imaging stands out as a promising alternative to conventional lens-based systems, particularly in scenarios demanding ultracompact form factors and cost-effective architectures. However, such systems are fundamentally governed by the Point Spread Function (PSF), which dictates how a point source contributes to the final captured signal. Traditional lensless techniques often require explicit calibrations and extensive pre-processing, relying on static or approximate PSF models. These rigid strategies can result in limited adaptability to real-world challenges, including noise, system imperfections, and dynamic scene variations, thus impeding high-fidelity reconstruction. In this paper, we propose LensNet, an end-to-end deep learning framework that integrates spatial-domain and frequency-domain representations in a unified pipeline. Central to our approach is a learnable Coded Mask Simulator (CMS) that enables dynamic, data-driven estimation of the PSF during training, effectively mitigating the shortcomings of fixed or sparsely calibrated kernels. By embedding a Wiener filtering component, LensNet refines global structure and restores fine-scale details, thus alleviating the dependency on multiple handcrafted pre-processing steps. Extensive experiments demonstrate LensNet’s robust performance and superior reconstruction quality compared to state-of-the-art methods, particularly in preserving high-frequency details and attenuating noise. The proposed framework establishes a novel convergence between physics-based modeling and data-driven learning, paving the way for more accurate, flexible, and practical lensless imaging solutions for applications ranging from miniature sensors to medical diagnostics. The link of code is https://github.com/baijiesong/Lensnet.
653: OS-GCL: A One-Shot Learner in Graph Contrastive Learning
Authors: Cheng Ji, Chenrui He, Qian Li, Qingyun Sun, Xingcheng Fu, Jianxin Li
Location: Guangzhou | Day: TBD
Show Abstract
Graph contrastive learning (GCL) enhances the self-supervised learning capacity for graph representation learning. Nevertheless, the previous research has neglected to consider one fundamental nature of GCL — graph contrastive learning operates as a one-shot learner, guided by the widely utilized noise contrastive estimation (e.g., the InfoNCE loss). Theoretically, to initially investigate the factors that contribute to the one-shot learner essence, we analyze the InfoNCE-based objective and derive its equivalent form of the softmax-based cross-entropy function. It is concluded that the InfoNCE-based GCL is determined to be a (2n-1)-way 1-shot classifier (n is the number of nodes). In this particular context, each sample is indicative of a unique ideational class, and each class has only one sample. Consequently, the one-shot learning nature of GCL leads to the issue of the limited self-supervised signal. To further address the above issue, we propose a One-Shot Learner in Graph Contrastive Learning (OS-GCL). Firstly, we estimate the potential probability distributions of the deterministic node features and discrete graph topology. Secondly, we develop a probabilistic message-passing mechanism to propagate probability (of feature) on probability (of topology). Thirdly, we propose the ProbNCE loss functions to contrast distributions. Extensive experimental results demonstrate the superiority of OS-GCL. To the best of our knowledge, this is the first study to examine the one-shot learning essence and the limited self-supervised signal issue of GCL.
674: Noise Optimized Conditional Diffusion for Domain Adaptation
Authors: Lingkun Luo, Shiqiang Hu, Liming Chen
Location: Montreal | Day: August 19th | Time: 15:00 | Session: CV: Difusion models
Show Abstract
Pseudo-labeling is a cornerstone of Unsupervised Domain Adaptation (UDA), yet the scarcity of High-Confidence Pseudo-Labeled Target Domain Samples (hcpl-tds) often leads to inaccurate cross-domain statistical alignment, causing DA failures. To address this challenge, we propose Noise Optimized Conditional Diffusion for Domain Adaptation (NOCDDA), which seamlessly integrates the generative capabilities of conditional diffusion models with the decision-making requirements of DA to achieve task-coupled optimization for efficient adaptation. For robust cross-domain consistency, we modify the DA classifier to align with the conditional diffusion classifier within a unified optimization framework, enabling forward training on noise-varying cross-domain samples. Furthermore, we argue that the conventional N(0,I) initialization in diffusion models often generates class-confused hcpl-tds, compromising discriminative DA. To resolve this, we introduce a class-aware noise optimization strategy that refines sampling regions for reverse class-specific hcpl-tds generation, effectively enhancing cross-domain alignment. Extensive experiments across 5 benchmark datasets and 29 DA tasks demonstrate significant performance gains of NOCDDA over 31 state-of-the-art methods, validating its robustness and effectiveness.
680: Distribution-Aware Online Learning for Urban Spatiotemporal Forecasting on Streaming Data
Authors: Chengxin Wang, Gary Tan, Swagato Barman Roy, Beng Chin Ooi
Location: Guangzhou | Day: TBD
Show Abstract
The intrinsic non-stationarity of urban spatiotemporal (ST) streams, particularly unique distribution shifts that evolve over time, poses substantial challenges for accurate urban ST forecasting. Existing works often overlook these dynamic shifts, limiting their ability to adapt to evolving trends effectively. To address this challenge, we propose DOL, a novel Distribution-aware Online Learning framework designed to handle the unique shifts in urban ST streams. DOL introduces a streaming update mechanism that leverages streaming memories to strategically adapt to gradual distribution shifts. By aligning network updates with these shifts, DOL avoids unnecessary updates, reducing computational overhead while improving prediction accuracy. DOL also incorporates an adaptive spatiotemporal network with a location-specific learner, enabling it to handle diverse urban distribution shifts across locations. Experimental results on four real-world datasets confirm DOL’s superiority over state-of-the-art models. The source code is available at https://github.com/cwang-nus/DOL.
686: Efficient Visual Representation Learning with Heat Conduction Equation
Authors: Zhemin Zhang, Xun Gong
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Computer Vision (1/3)
Show Abstract
Foundation models, such as CNNs and ViTs, have powered the development of image representation learning. However, general guidance to model architecture design is still missing. Inspired by the connection between image representation learning and heat conduction, we model images by the heat conduction equation, where the essential idea is to conceptualize image features as temperatures and model their information interaction as the diffusion of thermal energy. Based on this idea, we find that many modern model architectures, such as residual structures, SE block, and feed-forward networks, can be interpreted from the perspective of the heat conduction equation. Therefore, we leverage the heat equation to design new and more interpretable models. As an example, we propose the Heat Conduction Layer and the Refinement Approximation Layer inspired by solving the heat conduction equation using Finite Difference Method and Fourier series, respectively. The main goal of this paper is to integrate the overall architectural design of neural networks into the theoretical framework of heat conduction. Nevertheless, our Heat Conduction Network (HcNet) still shows competitive performance, e.g., HcNet-T achieves 83.0% top-1 accuracy on ImageNet-1K while only requiring 28M parameters and 4.1G MACs. The code is publicly available at: https://github.com/ZheminZhang1/HcNet.
690: Riding the Wave: Multi-Scale Spatial-Temporal Graph Learning for Highway Traffic Flow Prediction Under Overload Scenarios
Authors: Xigang Sun, Jiahui Jin, Hancheng Wang, Xiangguo Sun, Xiaoliang Wang, Jun Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Highway traffic flow prediction under overload scenarios (HIPO) is a critical problem in intelligent transportation systems, which aims to forecast future traffic patterns on highway segments during periods of exceptionally high demand. Despite its importance, this problem has rarely been explored in recent research due to the unique challenges posed by irregular flow patterns, complex traffic behaviors, and sparse contextual data. In this paper, we propose a Heterogeneous Spatial-Temporal graph network With Adaptive contrastiVE learning (HST-WAVE) to address the HIPO problem. Specifically, we first construct a heterogeneous traffic graph according to the physical highway structure. Then, we develop a multi-scale temporal weaving Transformer and a coupled heterogeneous graph attention network to capture the irregular traffic flow patterns and complex transition behaviors. Furthermore, we introduce an adaptive temporal enhancement contrastive learning strategy to bridge the gap between divergent temporal patterns and mitigate data sparsity. We conduct extensive experiments on two real-world highway network datasets (No. G56 and G60 in Hangzhou, China), showing that our model can effectively handle the HIPO problem and achieve state-of-the-art performance. The source code is available at https://github.com/luck-seu/HST-WAVE.
693: Incorporating Legal Logic into Deep Learning: An Intelligent Approach to Probation Prediction
Authors: Qinghua Wang, Xu Zhang, Lingyan Yang, Rui Shao, Bonan Wang, Fang Wang, Cunquan Qu
Location: Guangzhou | Day: TBD
Show Abstract
Probation is a crucial institution in modern criminal law, embodying the principles of fairness and justice while contributing to the harmonious development of society. Despite its importance, the current Intelligent Judicial Assistant System (IJAS) lacks dedicated methods for probation prediction, and research on the underlying factors influencing probation eligibility remains limited. In addition, probation eligibility requires a comprehensive analysis of both criminal circumstances and remorse. Much of the existing research in IJAS relies primarily on data-driven methodologies, which often overlooks the legal logic underpinning judicial decision-making. To address this gap, we propose a novel approach that integrates legal logic into deep learning models for probation prediction, implemented in three distinct stages. First, we construct a specialized probation dataset that includes fact descriptions and probation legal elements (PLEs). Second, we design a distinct probation prediction model named the Multi-Task Dual-Theory Probation Prediction Model (MT-DT), which is grounded in the legal logic of probation and the Dual-Track Theory of Punishment. Finally, our experiments on the probation dataset demonstrate that the MT-DT model outperforms baseline models, and an analysis of the underlying legal logic further validates the effectiveness of the proposed approach.
694: SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation
Authors: Zhaoxi Mu, Xinyu Yang, Gang Wang
Location: Guangzhou | Day: TBD
Show Abstract
While contemporary speech separation technologies adeptly process lengthy mixed audio waveforms, they are frequently challenged by the intricacies of real-world environments, including noisy and reverberant settings, which can result in artifacts or distortions in the separated speech. To overcome these limitations, we introduce SepALM, a pioneering approach that employs audio language models (ALMs) to rectify and re-synthesize speech within the text domain following preliminary separation. SepALM comprises four core components: a separator, a corrector, a synthesizer, and an aligner. By integrating an ALM-based end-to-end error correction mechanism, we mitigate the risk of error accumulation and circumvent the optimization hurdles typically encountered in conventional methods that amalgamate automatic speech recognition (ASR) with large language models (LLMs). Additionally, we have developed Chain-of-Thought (CoT) prompting and knowledge distillation techniques to facilitate the reasoning and training processes of the ALM. Our experiments substantiate that SepALM not only elevates the precision of speech separation but also markedly bolsters adaptability in novel acoustic environments.
695: Subgraph Information Bottleneck with Causal Dependency for Stable Molecular Relational Learning
Authors: Peiliang Zhang, Jingling Yuan, Chao Che, Yongjun Zhu, Lin Li
Location: Guangzhou | Day: TBD
Show Abstract
Molecular Relational Learning (MRL) is widely applied in molecular sciences. Recent studies attempt to retain molecular core information (e.g., substructures) by Graph Information Bottleneck but primarily focus on information compression without considering the causal dependencies of chemical reactions among substructures. This oversight neglects the core factors that determine molecular relationships, making maintaining stable MRL in distribution-shifted data challenging. To bridge this gap, we propose the Causal Subgraph Information Bottleneck (CausalGIB) for stable MRL. CausalGIB leverages causal dependency to guide substructure representation and integrates subgraph information bottleneck to optimize the core substructure representation, generating stable representations. Specifically, we distinguish causal and confounding substructures by noise injection and substructure interaction based on causal analysis. Furthermore, by minimizing the discrepancy between causal and confounding information within subgraph information bottleneck, CausalGIB captures core substructures composed of causal substructures and aggregates them into molecular representations to improve their stability. Experimental results on nine datasets demonstrate that CausalGIB outperforms state-of-the-art models in two tasks and significantly enhances model’s stability in distribution-shifted data.
705: MonoMixer: Marrying Convolution and Vision Transformer for Efficient Self-Supervised Monocular Depth Estimation
Authors: Zhiyong Chang, Yan Wang
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Computer vision (2/3)
Show Abstract
Self-supervised monocular depth estimation that does not require hard-to-source depth labels for training has been widely studied in recent years. Due to its significant and growing needs, many lightweight but effective architectures have been designed for edge devices. Convolutional Neural Networks (CNNs) have shown its extraordinary ability in monocular depth estimation. However, their limited receptive field stints existing methods to reason only locally, inhibiting the effectiveness of the self-supervised paradigm. Recently, Transformers has achieved great success in estimating depth maps from monocular images. Nevertheless, massive parameters in the Transformers hinder the deployment to edge devices. In this paper, we propose MonoMixer, a brand-new lightweight CNN-Transformer architecture with three main contributions: 1) The details-augmented (DA) block employs graph reasoning unit to capture abundant local details, resulting depth prediction at a higher level of precision. 2) The self-modulate channel attention (SMCA) block adaptively adjust the channel weights of feature maps, aiming to emphasize the crucial features and aggregate channel-wise feature maps of different patterns. 3) The global-guided Transformer (G2T) block integrates global semantic token into multi-scale local features and exploit cross-attention to encode long range dependencies. Furthermore, experimental results demonstrate the superiority of our proposed MonoMixer both at model size and inference speed, and achieve state-of-the-art performance on three datasets: KITTI, Make3D and Cityscapes. Specifically, our proposed MonoMixer outperforms
MonoFormer by a large margin in accuracy, with about 95 % fewer parameters.
719: Revealing Concept Shift in Spatio-Temporal Graphs via State Learning
Authors: Kuo Yang, Yunhe Guo, Qihe Huang, Zhengyang Zhou, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Dynamic graphs are ubiquitous in the real world, presenting the temporal evolution of individuals within spatial associations. Recently, dynamic graph learning research is flourishing, striving to more effectively capture evolutionary patterns and spatial correlations. However, existing methods still fail to address the issue of concept shift in dynamic graphs. Concept shift manifests as a distribution shift in the mapping pattern between historical observations and future evolution. The reason is that some environment variables in dynamic graphs exert varying effects on evolution patterns, but these variables are not effectively captured by the models, leading to the intractable concept shift issue. To tackle this issue, we propose a State-driven environment inference framework (Samen) to achieve a dynamic graph learning framework equipped with concept generalization ability. Firstly, we propose a two-stage environment inference and compression strategy. From the perspective of state space, we introduce a prefix-suffix collaborative state learning mechanism to bidirectionally model the spatio-temporal states. A hierarchical state compressor is further designed to refine the state information resulting in concept shift. Secondly, we propose a skip-connection spatio-temporal prediction module, which effectively utilizes the inferred environments to improve the model’s generalization capability. Finally, we select seven datasets from different domains to validate the effectiveness of our model. By comparing the performance of different models on samples with concept shift, we verify that our Samen gains generalization capacity that existing methods fail to capture.
720: The Proportional Veto Principle for Approval Ballots
Authors: Daniel Halpern, Ariel D. Procaccia, Warut Suksompong
Location: Montreal | Day: August 19th | Time: 11:30 | Session: GTEP: Computational social choice (1/2)
Show Abstract
The proportional veto principle, which captures the idea that a candidate vetoed by a large group of voters should not be chosen, has been studied for ranked ballots in single-winner voting. We introduce a version of this principle for approval ballots, which we call flexible-voter representation (FVR). We show that while the approval voting rule and other natural scoring rules provide the optimal FVR guarantee only for some flexibility threshold, there exists a scoring rule that is FVR-optimal for all thresholds simultaneously. We also extend our results to multi-winner voting.
722: Towards Robust Deterministic and Probabilistic Modeling for Predictive Learning
Authors: Xuesong Nie, Haoyuan Jin, Vijayakumar Bhagavatula, Xiaofeng Liu
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Computer Vision (1/3)
Show Abstract
Predictive modeling of unannotated spatiotemporal data presents inherent challenges, primarily due to the highly entangled visual dynamics in real-world scenes. To tackle these complexities, we introduce a novel insight through Disentangling Deterministic and Probabilistic (DDP) modeling. We note a key observation in spatiotemporal data where low-level details typically remain stable, whereas high-level motion frequently exhibits dynamic variations. The core motivation involves constructing two distinct pathways in the latent space: a deterministic path and a probabilistic path. The probabilistic path begins by defining the motion flow, which explicitly describes complex many-to-many motion patterns between patches, and models its probabilistic distribution using a motion diffuser. The deterministic path incorporates a spectral-aware enhancer to retain and amplify visual details in the frequency domain. These designs ensure visual consistency while also capturing intricate long-term motion dynamics. Extensive experiments demonstrate the superiority of DDP across diverse scenario evaluations.
728: Advancing Stain Transfer for Multi-Biomarkers: A Human Annotation-Free Method Based on Auxiliary Task Supervision
Authors: Siyuan Xu, Haofei Song, Yingjiao Deng, Jiansheng Wang, Yan Wang, Qingli Li
Location: Guangzhou | Day: TBD
Show Abstract
Histopathological examination primarily relies on hematoxylin and eosin (H&E) and immunohistochemical (IHC) staining. Though IHC provides more crucial molecular information for diagnosis, it is more costly than H&E staining. Stain transfer technology seeks to efficiently generate virtual IHC images from H&E images. While current deep learning-based methods have made progress, they still struggle to maintain pathological and structural consistency across biomarkers without pixel-level aligned reference. To address the problem, we propose an Auxiliary Task supervision-based Stain Transfer method for multi-biomarkers (ATST-Net), which pioneeringly employs human annotation-free masks as ground truth (GT). ATST-Net ensures pathological consistency, structural preservation and style transfer. It automatically annotates H&E masks in a cost-effective manner by utilizing consecutive IHC sections. Multiple auxiliary tasks provide diverse supervisory information on the location and intensity of biomarker expression, ensuring model accuracy and interpretability. We design a pretrained model-based generator to extract deep feature in H&E images, improving generalization performance. Extensive experiments demonstrate the effectiveness of ATST-Net’s components. Compared to existing methods, ATST-Net achieves state-of-the-art (SOTA) accuracy on datasets with multiple biomarkers and intensity levels, while also reflecting high practical value. Code is available at https://github.com/SikangSHU/ATST-Net.
740: Beyond Individual and Point: Next POI Recommendation via Region-aware Dynamic Hypergraph with Dual-level Modeling
Authors: Xixi Li, Zhuo Gu, Rui Yao, Yong Zhou, Hancheng Zhu, Jiaqi Zhao, Wen-liang Du
Location: Guangzhou | Day: TBD
Show Abstract
Next POI recommendation contributes to the prosperity of various intelligent location-based services. Existing studies focus on exploring sequential patterns and POI interactions using sequential and graph-based methods to enhance recommendation performance. However, they don’t effectively exploit geographical information. In addition, methods that focus on modeling mobility patterns using individual limited data may suffer from data sparsity and the information cocoons problem. Moreover, most graph structures focus on adjacent nodes, failing to capture potential high-order associations among POIs. To address these challenges, we propose the Region-aware dynamic Hypergraph learning method with Dual-level interaction Modeling (ReHDM), which exploits users’ dynamic mobility beyond individual and point. Specifically, ReHDM utilizes regional encoding to mine the potential spatial relationships among POIs with coarse-grained geographical information. By incorporating POI-level and trajectory-level associations within a hypergraph convolutional network, ReHDM comprehensively captures cross-user collaborative information. Furthermore, ReHDM captures not only dependencies among POIs within each trajectory for a single user, but also the high-order collaborative information across individual user trajectories and associated users’ trajectories. Experimental results on three public datasets demonstrate the superiority of ReHDM to the state-of-the-art.
749: Coupling Category Alignment for Graph Domain Adaptation
Authors: Nan Yin, Xiao Teng, Zhiguang Cao, Mengzhu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Graph domain adaptation (GDA), which transfers knowledge from a labeled source domain to an unlabeled target graph domain, attracts considerable attention in numerous fields. However, existing methods commonly employ message-passing neural networks (MPNNs) to learn domain-invariant representations by aligning the entire domain distribution, inadvertently neglecting category-level distribution alignment and potentially causing category confusion. To address the problem, we propose an effective framework named Coupling Category Alignment (CoCA) for GDA, which effectively addresses the category alignment issue with theoretical guarantees. CoCA incorporates a graph convolutional network branch and a graph kernel network branch, which explore graph topology in implicit and explicit manners. To mitigate category-level domain shifts, we leverage knowledge from both branches, iteratively filtering highly reliable samples from the target domain using one branch and fine-tuning the other accordingly. Furthermore, with these reliable target domain samples, we incorporate the coupled branches into a holistic contrastive learning framework. This framework includes multi-view contrastive learning to ensure consistent representations across the dual branches, as well as cross-domain contrastive learning to achieve category-level domain consistency. Theoretically, we establish a sharper generalization bound, which ensures the effectiveness of category alignment. Extensive experiments on benchmark datasets validate the superiority of the proposed CoCA compared with baselines.
750: scSiameseClu: A Siamese Clustering Framework for Interpreting Single-cell RNA Sequencing Data
Authors: Ping Xu, Zhiyuan Ning, Pengjiang Li, Wenhao Liu, Pengyang Wang, Jiaxu Cui, Yuanchun Zhou, Pengfei Wang
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Multidisciplinary Topics and Applications (1/2)
Show Abstract
Single-cell RNA sequencing (scRNA-seq) reveals cell heterogeneity, with cell clustering playing a key role in identifying cell types and marker genes. Recent advances, especially graph neural networks (GNNs)-based methods, have significantly improved clustering performance. However, the analysis of scRNA-seq data remains challenging due to noise, sparsity, and high dimensionality. Compounding these challenges, GNNs often suffer from over-smoothing, limiting their ability to capture complex biological information. In response, we propose scSiameseClu, a novel Siamese Clustering framework for interpreting single-cell RNA-seq data, comprising of 3 key steps: (1) Dual Augmentation Module, which applies biologically informed perturbations to the gene expression matrix and cell graph relationships to enhance representation robustness; (2) Siamese Fusion Module, which combines cross-correlation refinement and adaptive information fusion to capture complex cellular relationships while mitigating over-smoothing; and (3) Optimal Transport Clustering, which utilizes Sinkhorn distance to efficiently align cluster assignments with predefined proportions while maintaining balance. Comprehensive evaluations on seven real-world datasets demonstrate that scSiameseClu outperforms state-of-the-art methods in single-cell clustering, cell type annotation, and cell type classification, providing a powerful tool for scRNA-seq data interpretation.
751: Sharpness-aware Zeroth-order Optimization for Graph Transformers
Authors: Yang Liu, Chuan Zhou, Yuhan Lin, Shuai Zhang, Yang Gao, Zhao Li, Shirui Pan
Location: Guangzhou | Day: TBD
Show Abstract
Graph Transformers (GTs) have emerged as powerful tools for handling graph-structured data through global attention mechanisms. While GTs can effectively capture long-range dependencies, they introduce difficulties in optimization due to their complex, non-differentiable operators, which cannot be directly handled by standard gradient-based optimizers (such as Adam or AdamW). To investigate the above issues, this work adopts the line of Zeroth-Order Optimization (ZOO) technique. However, direct integration of ZOO incurs considerable challenges due to the sharp loss landscape and steep gradients within the GT parameter space. Under the above observations, we propose a Sharpness-aware Zeroth-order Optimizer (SZO) that combines Sharpness-Aware Minimization (SAM) technique facilitating convergence within a flatter neighborhood, and leverages parallel computing for efficient gradient estimation. Theoretically, we provide a comprehensive analysis of the optimizer from both convergence and generalization perspectives. Empirically, we conduct extensive experiments on various classical GTs across a wide range of benchmark datasets, which underscore the superior performance of SZO over the state-of-the-art optimizers.
754: Towards Equilibrium: An Instantaneous Probe-and-Rebalance Multimodal Learning Approach
Authors: Yang Yang, Xixian Wu, Qing-Yuan Jiang
Location: Guangzhou | Day: TBD
Show Abstract
The multimodal imbalance problem has been extensively studied to prevent the undesirable scenario where multimodal performance falls below that of unimodal models. However, existing methods typically assess the strength of modalities and perform learning simultaneously under the imbalanced status. This deferred strategy fails to rebalance multimodal learning instantaneously, leading to performance degeneration. To address this, we propose a novel multimodal learning approach, termed instantaneous probe-and-rebalance multimodal learning (IPRM), which employs a two-pass forward method to first probe (but not learn) and then perform rebalanced learning under the balanced status. Concretely, we first employ the geodesic multimodal mixup (GMM) to incorporate fusion representation and probe modality strength in the first forward phase. Then the weights are instantaneously recalibrated based on the probed strength, facilitating balanced training via the second forward pass. This process is applied dynamically throughout the entire training process. Extensive experiments reveal that our proposed IPRM outperforms all baselines, achieving state-of-the-art (SOTA) performance on numerous widely used datasets. The code is available at https://github.com/njustkmg/IJCAI25-IPRM.
769: BILE: An Effective Behavior-based Latent Exploration Scheme for Deep Reinforcement Learning
Authors: Yiming Wang, Kaiyan Zhao, Yan Li, Leong Hou U
Location: Guangzhou | Day: TBD
Show Abstract
Efficient exploration of state spaces is critical for the success of deep reinforcement learning (RL). While many methods leverage exploration bonuses to encourage exploration instead of relying solely on extrinsic rewards, these bonus-based approaches often face challenges with learning efficiency and scalability, especially in environments with high-dimensional state spaces.
To address these issues, we propose BehavIoral metric-based Latent Exploration (BILE). The core idea is to learn a compact representation within the behavioral metric space that preserves value differences between states. By introducing additional rewards to encourage exploration in this latent space, BILE drives the agent to visit states with higher value diversity and exhibit more behaviorally distinct actions, leading to more effective exploration of the state space. Additionally, we present a novel behavioral metric for efficient and robust training of the state encoder, backed by theoretical guarantees. Extensive experiments on high-dimensional environments, including realistic indoor scenarios in Habitat, robotic tasks in Robosuite, and challenging discrete Minigrid benchmarks, demonstrate the superiority and scalability of our method over other approaches.
787: Tackling Long-Tailed Data Challenges in Spiking Neural Networks via Heterogeneous Knowledge Distillation
Authors: Moqi Li, Xu Yang, Cheng Deng
Location: Montreal | Day: August 20th | Time: 10:00 | Session: ML: Spiking Neural Networks
Show Abstract
Spiking Neural Networks (SNNs), inspired by the behavior of biological neurons, have gained significant research interest for resource-constrained edge devices and neuromorphic hardware due to their use of binary spike signals for inter-unit communication with low power consumption. However, the absence of research on spiking neural networks on long-tailed data has severely limited the deployment and application of this emerging network in practical scenarios. To fill this gap, this paper proposes a long-tail learning framework based on spiking neural networks, named LT-SpikingFormer, to alleviate the distribution bias between head and tail classes. LT-SpikingFormer adopts a widely trained Convolutional Neural Network to construct a heterogeneous knowledge distillation paradigm, offering balanced and reliable prior knowledge. Moreover, a multi-granularity hierarchical feature distillation objective is proposed to leverage cross-layer local features and network global predictions to facilitate refined information distillation to optimize the network, specifically for the performance of the tailed classes. Extensive experimental results demonstrate that our method performs well on several benchmark datasets.
791: ESBN: Estimation Shift of Batch Normalization for Source-free Universal Domain Adaptation
Authors: Jiao Li, Houcheng Su, Bingli Wang, Yuandong Min, Mengzhu Wang, Nan Yin, Shanshan Wang, Jingcai Guo
Location: Guangzhou | Day: TBD
Show Abstract
Domain adaptation (DA) is crucial for transferring models trained in one domain to perform well in a different, often unseen domain. Traditional methods, including unsupervised domain adaptation (UDA) and source-free domain adaptation (SFDA), have made significant progress. However, most existing DA methods rely heavily on Batch Normalization (BN) layers, which are not optimal in source-free settings, where the source domain is unavailable for comparison. In this study, we propose a novel method, ESBN, which addresses the challenge of domain shift by adjusting the placement of normalization layers and replacing BN with Batch-free Normalization (BFN). Unlike BN, BFN is less dependent on batch statistics and provides more robust feature representations through instance-specific statistics. We systematically investigate the effects of different BN layer placements across various network configurations and demonstrate that selective replacement with BFN improves generalization performance. Extensive experiments on multiple domain adaptation benchmarks show that our approach outperforms state-of-the-art methods, particularly in challenging scenarios such as Open-Partial Domain Adaptation (OPDA).
799: Learning Real Facial Concepts for Independent Deepfake Detection
Authors: Ming-Hui Liu, Harry Cheng, Tianyi Wang, Xin Luo, Xin-Shun Xu
Location: Guangzhou | Day: TBD
Show Abstract
Deepfake detection models often struggle with generalization to unseen datasets, manifesting as misclassifying real instances as fake in target domains. This is primarily due to an overreliance on forgery artifacts and a limited understanding of real faces. To address this challenge, we propose a novel approach RealID to enhance generalization by learning a comprehensive concept of real faces while assessing the probabilities of belonging to the real and fake classes independently. RealID comprises two key modules: the Real Concept Capture Module (RealC^2) and the Independent Dual-Decision Classifier (IDC). With the assistance of a Multi-Real Memory, RealC^2 maintains various prototypes for real faces, allowing the model to capture a comprehensive concept of real class. Meanwhile, IDC redefines the classification strategy by making independent decisions based on the concept of the real class and the presence of forgery artifacts. Through the combined effect of the above modules, the influence of forgery-irrelevant patterns is alleviated, and extensive experiments on five widely used datasets demonstrate that RealID significantly outperforms existing state-of-the-art methods, achieving a 1.74% improvement in average accuracy.
803: AdaMixT: Adaptive Weighted Mixture of Multi-Scale Expert Transformers for Time Series Forecasting
Authors: Huanyao Zhang, Jiaye Lin, Wentao Zhang, Haitao Yuan, Guoliang Li
Location: Guangzhou | Day: TBD
Show Abstract
Multivariate time series forecasting involves predicting future values based on historical observations. However, existing approaches primarily rely on predefined single-scale patches or lack effective mechanisms for multi-scale feature fusion. These limitations hinder them from fully capturing the complex patterns inherent in time series, leading to constrained performance and insufficient generalizability. To address these challenges, we propose a novel architecture named Adaptive Weighted Mixture of Multi-Scale Expert Transformers (AdaMixT). Specifically, AdaMixT introduces various patches and leverages both General Pre-trained Models (GPM) and Domain-specific Models (DSM) for multi-scale feature extraction. To accommodate the heterogeneity of temporal features, AdaMixT incorporates a gating network that dynamically allocates weights among different experts, enabling more accurate predictions through adaptive multi-scale fusion. Comprehensive experiments on eight widely used benchmarks, including Weather, Traffic, Electricity, ILI, and four ETT datasets, consistently demonstrate the effectiveness of AdaMixT in real-world scenarios.
831: Can We Verify Step by Step for Incorrect Answer Detection?
Authors: Xin Xu, Shizhe Diao, Can Yang, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Chain-of-Thought (CoT) prompting has marked a significant advancement in enhancing the reasoning capabilities of large language models (LLMs). Previous studies have developed various extensions of CoT, which focus primarily on enhancing end-task performance. In addition, there has been research on assessing the quality of reasoning chains in CoT. This raises an intriguing question: Is it possible to predict the accuracy of LLM outputs by scrutinizing the reasoning chains they generate? To answer this research question, we introduce a benchmark, R2PE, designed specifically to explore the relationship between reasoning chains and performance in various reasoning tasks spanning five different domains. This benchmark aims to measure the falsehood of the final output of LLMs based on the reasoning steps. To make full use of information in multiple reasoning chains, we propose the process discernibility score (PDS) framework that beats the answer-checking baseline by a large margin. Concretely, this resulted in an average of 5.1% increase in the F1 score and 2.97% improvement in AUC-PR across all 45 subsets within R2PE. We further demonstrate our PDS’s efficacy in advancing open-domain QA accuracy. Our code will be released in the final version. Codes and data are available at https://github.com/XinXU-USTC/R2PE.git. For further details on the appendix, please refer to https://arxiv.org/abs/2402.10528.
832: Shaping a Stabilized Video by Mitigating Unintended Changes for Concept-Augmented Video Editing
Authors: Mingce Guo, Jingxuan He, Yufei Yin, Zhangye Wang, Shengeng Tang, Lechao Cheng
Location: Guangzhou | Day: TBD
Show Abstract
Text-driven video editing powered by generative diffusion models holds significant promise for applications spanning film production, advertising, and beyond. However, the limited expressiveness of pre-trained word embeddings often restricts nuanced edits, especially when targeting novel concepts with specific attributes. In this work, we present a novel Concept-Augmented Textual Inversion (CATI) framework that flexibly integrates new object information from user-provided concept videos. By fine-tuning only the V (Value) projection in attention via Low-Rank Adaptation (LoRA), our approach preserves the original attention distribution of the diffusion model while efficiently incorporating external concept knowledge. To further stabilize editing results and mitigate the issue of attention dispersion when prompt keywords are modified, we introduce a Dual Prior Supervision (DPS) mechanism. DPS supervises cross-attention between the source and target prompts, preventing undesired changes to non-target areas and improving the fidelity of novel concepts. Extensive evaluations demonstrate that our plug-and-play solution not only maintains spatial and temporal consistency but also outperforms state-of-the-art methods in generating lifelike and stable edited videos. The source code is publicly available at https://guomc9.github.io/STIVE-PAGE/.
848: Streaming Multi-agent Pathfinding
Authors: Mingkai Tang, Lu Gan, Kaichen Zhang
Location: Guangzhou | Day: TBD
Show Abstract
The task of the multi-agent pathfinding (MAPF) problem is to navigate a team of agents from their start point to the goal points. However, this setup is unsuitable in the assembly line scenario, which is periodic with a long working hour. To address this issue, the study formalizes the streaming MAPF (S-MAPF) problem, which assumes that the agents in the same agent stream have a periodic start time and share the same action sequence. The proposed solution, Agent Stream Conflict-Based Search (ASCBS), is designed to tackle this problem by incorporating a cyclic vertex/edge constraint to handle conflicts. Additionally, this work explores the potential usage of the disjoint splitting strategy within ASCBS. Experimental results indicate that ASCBS surpasses traditional MAPF solvers in terms of runtime for scenarios with prolonged working hours.
850: SyncGaussian: Stable 3D Gaussian-Based Talking Head Generation with Enhanced Lip Sync via Discriminative Speech Features
Authors: Ke Liu, Jiwei Wei, Shiyuan He, Zeyu Ma, Chaoning Zhang, Ning Xie, Yang Yang
Location: Guangzhou | Day: TBD
Show Abstract
Generating high-fidelity talking heads that maintain stable head poses and achieve robust lip sync remains a significant challenge. Although methods based on 3D Gaussian Splatting (3DGS) offer a promising solution via point-based deformation, they suffer from inconsistent head dynamics and mismatched mouth movements due to unstable Gaussian initialization and incomplete speech features. To overcome these limitations, we introduce SyncGaussian, a 3DGS-based framework that ensures stable head poses, enhanced lip sync, and realistic appearances with real-time rendering. SyncGaussian employs a stable head Gaussian initialization strategy to mitigate head jitter by optimizing commonly used rough head pose parameters. To enhance lip sync, we propose a sync-enhanced encoder that leverages audio-to-text and audio-to-visual speech features. Guided by a tailored cosine similarity loss function, the encoder integrates discriminative speech features through a multi-level sync adaptation mechanism, enabling the learning of an adaptive speech feature space. Extensive experiments demonstrate that SyncGaussian outperforms state-of-the-art methods in image quality, dynamic motion, and lip sync, with the potential for real-time applications.
852: METOR: A Unified Framework for Mutual Enhancement of Objects and Relationships in Open-vocabulary Video Visual Relationship Detection
Authors: Yongqi Wang, Xinxiao Wu, Shuo Yang
Location: Guangzhou | Day: TBD
Show Abstract
Open-vocabulary video visual relationship detection aims to detect objects and their relationships in videos without being restricted by predefined object or relationship categories. Existing methods leverage the rich semantic knowledge of pre-trained vision-language models such as CLIP to identify novel categories. They typically adopt a cascaded pipeline to first detect objects and then classify relationships based on the detected objects, which may lead to error propagation and thus suboptimal performance. In this paper, we propose Mutual EnhancemenT of Objects and Relationships (METOR), a query-based unified framework to jointly model and mutually enhance object detection and relationship classification in open-vocabulary scenarios. Under this framework, we first design a CLIP-based contextual refinement encoding module that extracts visual contexts of objects and relationships to refine the encoding of text features and object queries, thus improving the generalization of encoding to novel categories. Then we propose an iterative enhancement module to alternatively enhance the representations of objects and relationships by fully exploiting their interdependence to improve recognition performance. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate that our framework achieves state-of-the-art performance. Codes are at https://github.com/wangyongqi558/METOR.
854: Efficient Dynamic Ensembling for Multiple LLM Experts
Authors: Jinwu Hu, Yufeng Wang, Shuhai Zhang, Kai Zhou, Guohao Chen, Yu Hu, Bin Xiao, Mingkui Tan
Location: Guangzhou | Day: TBD
Show Abstract
LLMs have demonstrated impressive performance across various language tasks. However, the strengths of LLMs can vary due to different architectures, model sizes, areas of training data, etc. Therefore, ensemble reasoning for the strengths of different LLM experts is critical to achieving consistent and satisfactory performance on diverse inputs across a wide range of tasks. However, existing LLM ensemble methods are either computationally intensive or incapable of leveraging complementary knowledge among LLM experts for various inputs. In this paper, we propose an efficient Dynamic Ensemble Reasoning paradigm, called DER to integrate the strengths of multiple LLM experts conditioned on dynamic inputs. Specifically, we model the LLM ensemble reasoning problem as a Markov Decision Process, wherein an agent sequentially takes inputs to request knowledge from an LLM candidate and passes the output to a subsequent LLM candidate. Moreover, we devise a reward function to train a DER-Agent to dynamically select an optimal answering route given the input questions, aiming to achieve the highest performance with as few computational resources as possible. Last, to fully transfer the expert knowledge from the prior LLMs, we develop a Knowledge Transfer Prompt that enables the subsequent LLM candidates to transfer complementary knowledge effectively. Experiments demonstrate that our method uses fewer computational resources to achieve better performance compared to state-of-the-art baselines. Code and appendix are available at https://github.com/Fhujinwu/DER.
899: Set-Based Retrograde Analysis: Precomputing the Solution to 28-card Bridge Double Dummy Deals
Authors: Isaac Stone, Nathan R. Sturtevant, Jonathan Schaeffer
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Search
Show Abstract
Among the most popular games played worldwide, Bridge stands out for having had little AI progress for over 25 years. Ginsberg’s Partition Search algorithm (1996) was a breakthrough for double-dummy Bridge play, allowing a program to reason about sets of states rather than individual states. Partition Search supports the current state of the art for both bidding and cardplay. In the time since, virtually no progress has been made in Bridge bidding. Inspired by Ginsberg’s idea, this paper presents Setrograde Analysis, a new set-based algorithm for perfectly solving Bridge hands. Using this approach, we have solved all 7-trick (28-card) hands — 10^30 states, which can be reduced to 10^17 unique states using preexisting techniques. This was done by considering five orders of magnitude fewer sets than the traditional state-based Retrograde Analysis algorithm. This work suggests that the entire 13-trick (52-card) state space can be solved with modern technology using this new approach.
The 7-trick computation represents the largest endgame database to date in any game.
900: Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant
Authors: Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, Yongin Kwon
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: Natural Language Processing (1/2)
Show Abstract
Quantization has gained attention as a promising solution for the cost-effective deployment of large and small language models. However, most prior work has been limited to perplexity or basic knowledge tasks and lacks a comprehensive evaluation of recent models like Llama-3.3.
In this paper, we conduct a comprehensive evaluation of instruction-tuned models spanning 1B to 405B parameters, applying four quantization methods across 13 datasets.
Our findings reveal that (1) quantized models generally surpass smaller FP16 baselines, yet they often struggle with instruction-following and hallucination detection; (2) FP8 consistently emerges as the most robust option across tasks, and AWQ tends to outperform GPTQ in weight-only quantization;
(3) smaller models can suffer severe accuracy drops at 4-bit quantization, while 70B-scale models maintain stable performance;
(4) notably, \textit{hard} tasks do not always experience the largest accuracy losses, indicating that quantization magnifies a model’s inherent weaknesses rather than simply correlating with task difficulty; and (5) an LLM-based judge (MT-Bench) highlights significant performance declines in Coding and STEM tasks, though it occasionally reports improvements in reasoning.
903: NuMDS: An Efficient Local Search Algorithm for Minimum Dominating Set Problem
Authors: Rui Sun, Zhaohui Liu, Yiyuan Wang, Han Xiao, Jiangnan Li, Jiejiang Chen
Location: Guangzhou | Day: TBD
Show Abstract
The minimum dominating set (MDS) problem is a crucial NP-hard combinatorial optimization problem with wide applications in real-world scenarios. In this paper, we propose an efficient local search algorithm namely NuMDS to solve the MDS, which comprises three key ideas. First, we introduce a dominate propagation-based reduction method that fixes a portion of vertices in a given graph. Second, we develop a novel two-phase initialization method based on the decomposition method. Third, we propose a multi-stage local search procedure, which adopts three different search manners according to the current stage of the search. We conduct extensive experiments to demonstrate the outstanding effectiveness of NuMDS, and the results clearly indicate that NuMDS outperforms previous state-of-the-art algorithms on almost all instances.
908: Imagination-Limited Q-Learning for Offline Reinforcement Learning
Authors: Wenhui Liu, Zhijian Wu, Jingchao Wang, Dingjiang Huang, Shuigeng Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Offline reinforcement learning seeks to derive improved policies entirely from historical data but often struggles with over-optimistic value estimates for out-of-distribution (OOD) actions. This issue is typically mitigated via policy constraint or conservative value regularization methods. However, these approaches may impose overly constraints or biased value estimates, potentially limiting performance improvements. To balance exploitation and restriction, we propose an Imagination-Limited Q-learning (ILQ) method, which aims to maintain the optimism that OOD actions deserve within appropriate limits. Specifically, we utilize the dynamics model to imagine OOD action-values, and then clip the imagined values with the maximum behavior values. Such design maintains reasonable evaluation of OOD actions to the furthest extent, while avoiding its over-optimism. Theoretically, we prove the convergence of the proposed ILQ under tabular Markov decision processes. Particularly, we demonstrate that the error bound between estimated values and optimality values of OOD state-actions possesses the same magnitude as that of in-distribution ones, thereby indicating that the bias in value estimates is effectively mitigated. Empirically, our method achieves state-of-the-art performance on a wide range of tasks in the D4RL benchmark.
914: Vi3D-LLaMA: Observe and Understand the 3D Scene with A Video Sequence
Authors: Yingjie Wang, Jiajun Deng, Yao Li, Houqiang Li, Yanyong Zhang
Location: Montreal | Day: August 21st | Time: 15:00 | Session: CV: videos
Show Abstract
Current 3D Multimodal Large Language Models (3D MLLMs) leverage explicit 3D input, e.g., point clouds, to understand the 3D world and enable spatial reasoning. These explicit 3D data are usually obtained through reconstruction or additional depth sensors, affecting the model’s scalability and deployment. In this work, we take a different stance and introduce Vi3D-LLaMA, a powerful MLLM operating without point cloud or depth data. Instead, we explore how to capture and interpret 3D scenes directly from RGB video sequences. The core idea of this work is to empower the video MLLM with the capability of understanding the 3D world from two aspects: (1) 3D-Aware Geometric Encoding: Camera parameters and a frustum-based 3D position encoder are used to transform video representations into 3D-aware tokens, enabling implicit modeling of 3D structures with RGB frames. (2) Fine-Grained Semantic Enhancement: High-resolution (HR) images are progressively incorporated into the video representation through a lightweight HR adapter, facilitating video tokens with semantic details. We conduct extensive experiments and demonstrate that Vi3D-LLaMA, using only RGB data, can achieve comparable results with state-of-the-art 3D MLLMs. Additionally, we benchmark our method on the new VSI-Bench, showing consistent improvement over the video MLLM baseline.
948: Verified Certificates via SAT and Computer Algebra Systems for the Ramsey R(3,8) and R(3,9) Problems
Authors: Zhengyu Li, Conor Duggan, Curtis Bright, Vijay Ganesh
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Constraint Satisfaction and Optimization (2/3)
Show Abstract
The Ramsey problem R(3,k) seeks to determine the smallest value of n such that any red/blue edge coloring of the complete graph on n vertices must either contain a blue triangle (3-clique) or a red clique of size k. Despite its significance, many previous computational results for the Ramsey R(3,k) problem such as R(3,8) and R(3,9) lack formal verification. To address this issue, we use the software MathCheck to generate certificates for Ramsey problems R(3,8) and R(3,9) (and symmetrically R(8,3) and R(9,3)) by integrating a Boolean satisfiability (SAT) solver with a computer algebra system (CAS). Our SAT+CAS approach significantly outperforms traditional SAT-only methods, demonstrating an improvement of several orders of magnitude in runtime. For instance, our SAT+CAS approach solves R(3,8) (resp., R(8,3)) sequentially in 59 hours (resp., in 11 hours), while a SAT-only approach using state-of-the-art CaDiCaL solver times out after 7 days. Additionally, in order to be able to scale to harder Ramsey problems R(3,9) and R(9,3) we further optimized our SAT+CAS tool using a parallelized cube-and-conquer approach. Our results provide the first independently verifiable certificates for these Ramsey numbers, ensuring both correctness and completeness of the exhaustive search process of our SAT+CAS tool.
952: EAVIT: Efficient and Accurate Human Value Identification From Text Data via LLMs
Authors: Wenhao Zhu, Yuhang Xie, Guojie Song, Xin Zhang
Location: Guangzhou | Day: TBD
Show Abstract
The rapid evolution of large language models (LLMs) has revolutionized various fields, including the identification and discovery of human values within text data. While traditional NLP models, such as BERT, have been employed for this task, their ability to represent textual data is significantly outperformed by emerging LLMs like GPTs. However, the performance of online LLMs often degrades when handling long contexts required for value identification, which also incurs substantial computational costs. To address these challenges, we propose EAVIT, an efficient and accurate framework for human value identification that combines the strengths of both locally fine-tunable and online black-box LLMs. Our framework employs a value detector—a small, local language model—to generate initial value estimations. These estimations are then used to construct concise input prompts for online LLMs, enabling accurate final value identification. To train the value detector, we introduce explanation-based training and data generation techniques specifically tailored for value identification, alongside sampling strategies to optimize the brevity of LLM input prompts. Our approach effectively reduces the number of input tokens by up to 1/6 compared to directly querying online LLMs, while consistently outperforming traditional NLP methods and other LLM-based strategies.
957: EyeSeg: An Uncertainty-Aware Eye Segmentation Framework for AR/VR
Authors: Zhengyuan Peng, Jianqing Xu, Shen Li, Jiazhen Ji, Yuge Huang, Jingyun Zhang, Jinmin Li, Shouhong Ding, Rizen Guo, Xin Tan, Lizhuang Ma
Location: Guangzhou | Day: TBD
Show Abstract
Human-machine interaction through augmented reality (AR) and virtual reality (VR) is increasingly prevalent, requiring accurate and efficient gaze estimation which hinges on the accuracy of eye segmentation to enable smooth user experiences. We introduce EyeSeg, a novel eye segmentation framework designed to overcome key challenges that existing approaches struggle with: motion blur, eyelid occlusion, and train-test domain gaps. In these situations, existing models struggle to extract robust features, leading to suboptimal performance. Noting that these challenges can be generally quantified by uncertainty, we design EyeSeg as an uncertainty-aware eye segmentation framework for AR/VR wherein we explicitly model the uncertainties by performing Bayesian uncertainty learning of a posterior under the closed set prior. Theoretically, we prove that a statistic of the learned posterior indicates segmentation uncertainty levels and empirically outperforms existing methods in downstream tasks, such as gaze estimation. EyeSeg outputs an uncertainty score and the segmentation result, weighting and fusing multiple gaze estimates for robustness, which proves to be effective especially under motion blur, eyelid occlusion and cross-domain challenges. Moreover, empirical results suggest that EyeSeg achieves segmentation improvements of MIoU, E1, F1, and ACC surpassing previous approaches.
968: Heterogeneous Federated Learning with Scalable Server Mixture-of-Experts
Authors: Jingang Jiang, Yanzhao Chen, Xiangyang Liu, Haiqi Jiang, Chenyou Fan
Location: Guangzhou | Day: TBD
Show Abstract
Classical Federated Learning (FL) encounters significant challenges when deploying large models on power-constrained clients. To tackle this, we propose an asymmetric FL mechanism that enables the aggregation of compact client models into a comprehensive server model. We design the server model as a Mixture-of-Experts (MoE), where each expert has the same architecture as each client model. This uniformity allows for efficient fusion of the most pertinent client models to update each server expert, based on the measured relevance between each client and server expert. To address the Non-IID data issue, we further optimize the server-side MoE architecture by incorporating a main expert that always activates alongside a set of selectively activated routed experts. This configuration ensures a balance between learning general knowledge and specific data distribution. Our Fed-MoE framework is model-agnostic and has demonstrated notable improvements on vision FL tasks with million-scale ResNet backbones, and language tasks with billion-scale BERT and GPT-2 backbones.
1002: Dynamic Multiple High-order Correlations Fusion with Noise Filtering for Incomplete Multi-view Noisy-label Learning
Authors: Kaixiang Wang, Xiaojian Ding, Fan Yang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-view multi-label data often suffers from incomplete feature views and label noise. This paper is the first to address both challenges simultaneously, rectifying critical deficiencies in existing methodologies that inadequately extract and fuse high-order structural correlations across views while lacking robust solutions to mitigate label noise. We introduce a dynamic multiple high-order correlations fusion with noise filtering, specifically designed for incomplete multi-view noisy-label learning. By capitalizing on a dynamic multi-hypergraph neural network, inspired by the principles of ensemble learning, we adeptly capture and integrate high-order correlations among samples from different views. The model’s capability is further augmented through an innovative hypergraph fusion technique based on random walk theory, which empowers it to seamlessly amalgamate both structural and feature information. Moreover, we propose sophisticated noise-filtering matrices that are tightly embedded within the hypergraph neural network, devised to counteract the detrimental impact of label noise. Recognizing that label noise perturbs the data distribution in the label space, these filtering matrices exploit the distributional disparities between feature and label spaces. The high-order structural information derived from both domains underpins the learning and efficacy of the noise-filtering matrices. Empirical evaluations on benchmark datasets unequivocally demonstrate that our method significantly outperforms contemporary state-of-the-art techniques.
1007: Towards Improved Risk Bounds for Transductive Learning
Authors: Bowei Zhu, Shaojie Li, Yong Liu
Location: Guangzhou | Day: TBD
Show Abstract
Transductive learning is a popular setting in statistic learning theory, reasoning from observed, specific training cases to specific test cases, which has been widely used in many fields such as graph neural networks and semi-supervised learning. Existing results provide fast rates of convergence based on the traditional local techniques, which need the surrogate function that upper bounds the uniform error within a localized region to be “sub-root”. We derive new version of concentration inequality for empirical processes in transductive learning and apply generic chaining technique to relax the assumptions and gain tighter results in empirical risk minimization. Furthermore, we concentrate on the generalization of moment penalization algorithm. We design a novel estimator based on the second moment (variance) penalization and derive its learning rates, which is the first theoretical generalization analysis considering variance-based algorithms.
1010: Zero-Shot Machine Unlearning with Proxy Adversarial Data Generation
Authors: Huiqiang Chen, Tianqing Zhu, Xin Yu, Wanlei Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Machine unlearning aims to remove the influence of specific samples from a trained model. A key challenge in this process is over-unlearning, where the model’s performance on the remaining data significantly drops due to the change in the model’s parameters. Existing unlearning algorithms depend on the remaining data to prevent this issue. As such, these methods are inapplicable in a more practical scenario, where only the unlearning samples are available (i.e., zero-shot unlearning). This paper presents a novel framework, ZS-PAG, to fill this gap. Our approach offers three key innovations: (1) we approximate the inaccessible remaining data by generating adversarial samples; (2) leveraging the generated samples, we pinpoint a specific subspace to perform the unlearning process, therefore preventing over-unlearning in the challenging zero-shot scenario; and (3) we consider the influence of the unlearning process on the remaining samples and design an influence-based pseudo-labeling strategy. As a result, our method further improves the model’s performance after unlearning. The proposed method holds a theoretical guarantee, and experiments on various benchmarks validate the effectiveness and superiority of our proposed method over several baselines.
1012: Non-collective Calibrating Strategy for Time Series Forecasting
Authors: Bin Wang, Yongqi Han, Minbo Ma, Tianrui Li, Junbo Zhang, Feng Hong, Yanwei Yu
Location: Guangzhou | Day: TBD
Show Abstract
Deep learning-based approaches have demonstrated significant advancements in time series forecasting. Despite these ongoing developments, the complex dynamics of time series make it challenging to establish the rule of thumb for designing the golden model architecture. In this study, we argue that refining existing advanced models through a universal calibrating strategy can deliver substantial benefits with minimal resource costs, as opposed to elaborating and training a new model from scratch. We first identify a multi-target learning conflict in the calibrating process, which arises when optimizing variables across time steps, leading to the underutilization of the model’s learning capabilities. To address this issue, we propose an innovative calibrating strategy called Socket+Plug (SoP). This approach retains an exclusive optimizer and early-stopping monitor for each predicted target within each Plug while keeping the fully trained Socket backbone frozen. The model-agnostic nature of SoP allows it to directly calibrate the performance of any trained deep forecasting models, regardless of their specific architectures. Extensive experiments on various time series benchmarks and a spatio-temporal meteorological ERA5 dataset demonstrate the effectiveness of SoP, achieving up to a 22% improvement even when employing a simple MLP as the Plug (highlighted in Figure 1).
1020: Stability and Generalization for Stochastic (Compositional) Optimizations
Authors: Xiaokang Pan, Jin Liu, Hulin Kuang, Youqi Li, Lixing Chen, Zhe Qu
Location: Guangzhou | Day: TBD
Show Abstract
The use of estimators instead of stochastic gradients for updates has been shown to improve algorithm convergence rates of, but their impact on generalization remains under-explored. In this paper, we investigate how estimators influence generalization. Our focus is on two widely studied problems: stochastic optimization (SO) and stochastic compositional optimization (SCO), both under convex and non-convex settings. For SO problems, we first analyze the generalization error of the STORM algorithm as a foundational step. We then extend our analysis to SCO problems by introducing an algorithmic framework that encompasses several popular algorithmic approaches. Through this framework, we conduct a generalization analysis, uncovering new insights into the impact of estimators on generalization. Subsequently, we provide a detailed analysis of three specific algorithms within this framework: SCGD, SCSC, and COVER, to explore the effects of different estimator strategies. Furthermore, in the context of SCO, we propose a novel definition of stability and a new decomposition of excess risk in the non-convex setting. Our analysis indicates two key findings: (1) In SCO problems, eliminating the estimator for the gradient of the inner function does not impact generalization performance while significantly reducing computational and storage overhead. (2) Faster convergence rates are consistently associated with better generalization performance.
1024: MATCH: Modality-Calibrated Hypergraph Fusion Network for Conversational Emotion Recognition
Authors: Jiandong Shi, Ming Li, Lu Bai, Feilong Cao, Ke Lu, Jiye Liang
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal emotion recognition aims to identify emotions by integrating multimodal features derived from spoken utterances. However, existing work often neglects the calibration of conversational entities, focusing mainly on extracting potential intra- or cross-modal information. This leads to the underutilization of utterance information that is essential for accurately characterizing emotion. Additionally, the lack of effective modeling of conversational patterns limits the ability to capture emotional pathways across contexts, modalities and speakers, impacting the overall emotional understanding. In this study, we propose the modality-calibrated hypergraph fusion network (MATCH), which leverages multimodal fusion and hypergraph learning techniques to address these challenges. In particular, we introduce an entity calibration strategy that refines the representations of conversational entities both at the modality and context levels, allowing for deeper insights into emotion-related cues. Furthermore, we present an emotion-aligned hypergraph fusion method that incorporates a line graph to explore conversational patterns, facilitating flexible knowledge transfer across modalities through hyperedge-level and graph-level alignments. Experiments demonstrate that MATCH outperforms state-of-the-art approaches on two benchmark datasets.
1031: Counterfactual Thinking Driven Emotion Regulation for Image Sentiment Recognition
Authors: Xinyue Zhang, Zhaoxia Wang, Hailing Wang, Guitao Cao
Location: Guangzhou | Day: TBD
Show Abstract
Image sentiment recognition (ISR) facilitates the practical application of affective computing on rapidly growing social platforms. Nowadays, region-based ISR methods that use affective regions to guide emotion prediction have gained significant attention. However, existing methods lack a causality-based mechanism to guide affective region generation and effective tools to quantitatively evaluate their quality. Inspired by the psychological theory of Emotion Regulation, we propose a counterfactual thinking driven emotion regulation network (CTERNet), which simulates the Emotion Regulation Theory by modeling the entire process of ISR based on human causality-driven mechanisms. Specifically, we first use multi-scale perception for feature extraction to simulate the stage of situation selection. Next, we combine situation modification, attentional deployment, and cognitive change into a counterfactual thinking based cognitive reappraisal module, which learns both affective regions (factual) and other potential affective regions (counterfactual). In the response modulation stage, we compare the factual and counterfactual outcomes to encourage the network to discover the most emotionally representative regions, thereby quantifying the quality of affective regions for ISR tasks. Experimental results demonstrate that our method outperforms or matches the state-of-the-art approaches, proving its effectiveness in addressing the key challenges of region-based ISR.
1032: HyperTrans: Efficient Hypergraph-Driven Cross-Domain Pattern Transfer in Image Anomaly Detection
Authors: Tengyu Zhang, Deyu Zeng, Baoqiang Li, Wei Wang, Wei Liu, Zongze Wu
Location: Guangzhou | Day: TBD
Show Abstract
Anomaly detection plays a pivotal role in industrial quality assurance processes, with cross-domain problems, exemplified by the model upgrade from RGB to 3D, being prevalent in real-world scenarios yet remaining systematically underexplored. To address the severe challenges posed by the extreme lack of datasets in target domain, we retain the knowledge from source models and explore a novel solution for anomaly detection through cross-domain learning, introducing HyperTrans. Targeting few-shot scenarios, HyperTrans centers around hypergraphs to model the relationship of the limited patch features and employs a perturbation-rectification-scoring architecture. The domain perturbation module injects and adapts channel-level statistical perturbations, mitigating style shifts during domain transfer. Subsequently, a residual hypergraph restoration module utilizes a cross-domain hypergraph to capture higher-order correlations in patches and align them across domains. Ultimately, with feature patterns exhibiting reduced domain shifts, an inter-domain scoring module aggregates similarity information between patches and normal patterns within the multi-domain subhypergraphs to make an integrated decision, generating multi-level anomaly predictions. Extensive experiments demonstrate that HyperTrans offers significant advantages in anomaly classification and anomaly segmentation tasks, outperforming state-of-the-art non-cross-domain methods in image-wise ROCAUC by 13%, 12%, and 15% in 1-shot, 2-shot, and 5-shot settings on MVTec3D AD.
1035: Reliable and Diverse Hierarchical Adapter for Zero-shot Video Classification
Authors: Wenxuan Ge, Peng Huang, Rui Yan, Hongyu Qu, Guosen Xie, Xiangbo Shu
Location: Guangzhou | Day: TBD
Show Abstract
Adapting pre-trained vision-language models to downstream tasks has emerged as a novel paradigm for zero-shot learning. Existing test-time adaptation (TTA) methods such as TPT attempt to fine-tune visual or textual representations to accommodate downstream tasks but still require expensive optimization costs. To this end, Training-free Dynamic Adapter (TDA) maintains a cache containing visual features for each category in a parameter-free manner and measures sample confidence based on prediction entropy of test samples. Inspired by TDA, this work aims to develop the first training-free adapter for zero-shot video classification. Capturing the intrinsic temporal relationships within video data to construct and maintain the video cache is key to extending TDA to the video domain. In this work, we propose a reliable and diverse Hierarchical Adapter for zero-shot video classification, which consists of Frame-level Cache Refiner and Video-level Cache Updater. Before each video sample enters the corresponding cache, it needs to be refined at frame level based on prediction entropy and temporal probability difference. Due to the limited capacity of the cache, we update the cache during inference based on the principle of diversity. Experiments on four popular video classification benchmarks demonstrate the effectiveness of Hierarchical Adapter. The code is available at https://github.com/Gwxer/Hierarchical-Adapter.
1044: DriftRemover: Hybrid Energy Optimizations for Anomaly Images Synthesis and Segmentation
Authors: Siyue Yao, Haotian Xu, Mingjie Sun, Siyue Yu, Jimin Xiao, Eng Gee Lim
Location: Guangzhou | Day: TBD
Show Abstract
This paper tackles the challenge of anomaly image synthesis and segmentation to generate various anomaly images and their segmentation labels to mitigate the issue of data scarcity. Existing approaches employ the precise mask to guide the generation, relying on additional mask generators, leading to increased computational costs and limited anomaly diversity. Although a few works use coarse masks as the guidance to expand diversity, they lack effective generation of labels for synthetic images, thereby reducing their practicality. Therefore, our proposed method simultaneously generates anomaly images and their corresponding masks by utilizing coarse masks and anomaly categories. The framework utilizes attention maps from synthesis process as mask labels and employs two optimization modules to tackle drift challenges, which are mismatches between synthetic results and real situations. Our evaluation demonstrates that our method improves pixel-level AP by 1.3% and F1-MAX by 1.8% in anomaly detection tasks on the MVTec dataset. Additionally, its successful application in practical scenarios highlights its effectiveness, improving IoU by 37.2% and F-measure by 25.1% with the Floor Dirt dataset. The code is available at https://github.com/JJessicaYao/DriftRemover.
1048: LP-Based Weighted Model Integration over Non-Linear Real Arithmetic
Authors: S. Akshay, Supratik Chakraborty, Soroush Farokhnia, Amir Goharshady, Harshit Jitendra Motwani, Đorđe Žikelić
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Uncertainty in AI
Show Abstract
Weighted model integration (WMI) is a relatively recent formalism that has received significant interest as a technique for solving probabilistic inference tasks with complicated weight functions. Existing methods and tools are mostly focused on linear and polynomial functions and provide limited support for WMI of rational or radical functions, which naturally arise in several applications. In this work, we present a novel method for approximate WMI, which provides more effective support for the wide class of semi-algebraic functions that includes rational and radical functions, with literals defined over non-linear real arithmetic. Our algorithm leverages Farkas’ lemma and Handelman’s theorem from real algebraic geometry to reduce WMI to solving a number of linear programming (LP) instances. The algorithm provides formal guarantees on the error bound of the obtained approximation and can reduce it to any user-defined value epsilon. Furthermore, our approach is perfectly parallelizable. Finally, we present extensive experimental results, demonstrating the superior performance of our method on a range of WMI tasks for rational and radical functions when compared to state-of-the-art tools for WMI, in terms of both applicability and tightness.
1049: Interactive Multimodal Learning via Flat Gradient Modification
Authors: Qing-Yuan Jiang, Zhouyang Chi, Yang Yang
Location: Guangzhou | Day: TBD
Show Abstract
Due to the notorious modality imbalance phenomenon, multimodal learning (MML) struggles to achieve satisfactory performance. Recently, multimodal learning with alternating unimodal adaptation (MLA) has been proven effective in mitigating the interference between modalities by capturing interaction through orthogonal projection, thus relieving modality imbalance phenomenon to some extent. However, the projection strategy orthogonal to the original space can lead to poor plasticity as the alternating learning proceeds, thus affecting model performance. To address this issue, in this paper, we propose a novel multimodal learning method called interactiveMML via flat gradient modification (IGM) by employing a flat gradient modification strategy to enhance interactive MML. Specifically, we first employ a flat projection-based gradient modification strategy that is independent to the original space, aiming to avoid the poor plasticity issue. Then we introduce the sharpness-aware minimization (SAM)-based optimization strategy to fully exploit the flatness of the learning objective and further enhance interaction during learning. To this end, the plasticity problem can be avoided and the overall performance is improved. Extensive experiments on widely used datasets demonstrate that IGM outperforms various state-of-the-art (SOTA) baselines, achieving superior performance. The source code is available at https://anonymous.4open.science/r/method-CC45.
1050: Endogenous Recovery via Within-modality Prototypes for Incomplete Multimodal Hashing
Authors: Sa Zhu, Dayan Wu, Chenming Wu, Pengwen Dai, Bo Li
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal hashing projects multimodal data into compact binary codes, enabling rapid and storage-efficient retrieval of large-scale multimedia content.
In practical scenarios, the issue of missing modality frequently arises when dealing with multimodal data.
Existing incomplete multimodal hashing techniques directly recover missing modalities by neural networks, resulting in a disjointed representation space between the recovered and true data.
In this paper, we present a novel recovery paradigm, namely Prototype-based Modality Completion Hashing (PMCH).
Instead of directly synthesizing it from available modalities, PMCH adaptively aggregates associated within-modality prototypes to recover missing modality data.
Specifically, PMCH introduces an within-modality prototype learning module to optimize representative prototypes for each modality.
These prototypes act as recovery anchors and reside within the same representation space as their corresponding modality data.
Subsequently, PMCH adaptively aggregates the associated within-modality prototypes with coefficients derived from the modality-specific Weight-Net.
By utilizing prototypes from the same modality, the semantic disparity between the reconstructed and authentic data can be substantially diminished.
Extensive experiments on three widely used benchmark datasets demonstrate that PMCH can effectively recover the missing modality, and attain state-of-the-art performance in both complete and incomplete multimodal retrieval scenarios. Code is available at https://github.com/Sasa77777779/PMCH.git.
1057: Asynchronous Credit Assignment for Multi-Agent Reinforcement Learning
Authors: Yongheng Liang, Hejun Wu, Haitao Wang, Hao Cai
Location: Guangzhou | Day: TBD
Show Abstract
Credit assignment is a critical problem in multi-agent reinforcement learning (MARL), aiming to identify agents’ marginal contributions for optimizing cooperative policies. Current credit assignment methods typically assume synchronous decision-making among agents. However, many real-world scenarios require agents to act asynchronously without waiting for others. This asynchrony introduces conditional dependencies between actions, which pose great challenges to current methods. To address this issue, we propose an asynchronous credit assignment framework, incorporating a Virtual Synchrony Proxy (VSP) mechanism and a Multiplicative Value Decomposition (MVD) algorithm. VSP enables physically asynchronous actions to be virtually synchronized during credit assignment. We theoretically prove that VSP preserves both task equilibrium and algorithm convergence. Furthermore, MVD leverages multiplicative interactions to effectively model dependencies among asynchronous actions, offering theoretical advantages in handling asynchronous tasks. Extensive experiments show that our framework consistently outperforms state-of-the-art MARL methods on challenging tasks while providing improved interpretability for asynchronous cooperation.
1063: Weakly-supervised Audio Temporal Forgery Localization via Progressive Audio-language Co-learning Network
Authors: Junyan Wu, Wenbo Xu, Wei Lu, Xiangyang Luo, Rui Yang, Shize Guo
Location: Guangzhou | Day: TBD
Show Abstract
Audio temporal forgery localization (ATFL) aims to find the precise forgery regions of the partial spoof audio that is purposefully modified. Existing ATFL methods rely on training efficient networks using fine-grained annotations, which are obtained costly and challenging in real-world scenarios. To meet this challenge, in this paper, we propose a progressive audio-language co-learning network (LOCO) that adopts co-learning and self-supervision manners to prompt localization performance under weak supervision scenarios. Specifically, an audio-language co-learning module is first designed to capture forgery consensus features by aligning semantics from temporal and global perspectives. In this module, forgery-aware prompts are constructed by using utterance-level annotations together with learnable prompts, which can incorporate semantic priors into temporal content features dynamically. In addition, a forgery localization module is applied to produce forgery proposals based on fused forgery-class activation sequences. Finally, a progressive refinement strategy is introduced to generate pseudo frame-level labels and leverage supervised semantic contrastive learning to amplify the semantic distinction between real and fake content, thereby continuously optimizing forgery-aware features. Extensive experiments show that the proposed LOCO achieves SOTA performance on three public benchmarks.
1066: PeSANet: Physics-encoded Spectral Attention Network for Simulating PDE-Governed Complex Systems
Authors: Han Wan, Rui Zhang, Qi Wang, Yang Liu, Hao Sun
Location: Guangzhou | Day: TBD
Show Abstract
Accurately modeling and forecasting complex systems governed by partial differential equations (PDEs) is crucial in various scientific and engineering domains. However, traditional numerical methods struggle in real-world scenarios due to incomplete or unknown physical laws. Meanwhile, machine learning approaches often fail to generalize effectively when faced with scarce observational data and the challenge of capturing local and global features. To this end, we propose the Physics-encoded Spectral Attention Network (PeSANet), which integrates local and global information to forecast complex systems with limited data and incomplete physical priors. The model consists of two key components: a physics-encoded block that uses hard constraints to approximate local differential operators from limited data, and a spectral-enhanced block that captures long-range global dependencies in the frequency domain. Specifically, we introduce a novel spectral attention mechanism to model inter-spectrum relationships and learn long-range spatial features. Experimental results demonstrate that PeSANet outperforms existing methods across all metrics, particularly in long-term forecasting accuracy, providing a promising solution for simulating complex systems with limited data and incomplete physics.
1086: Gaussian Mixture Model for Graph Domain Adaptation
Authors: Mengzhu Wang, Wenhao Ren, Yu Zhang, Yanlong Fan, Dianxi Shi, Luoxi Jing, Nan Yin
Location: Guangzhou | Day: TBD
Show Abstract
Unsupervised domain adaptation (UDA) has been widely studied with the goal of transferring knowledge from a label-rich source domain to a related but unlabeled target domain. Most UDA techniques achieve this by reducing the feature discrepancies between the two domains to learn domain-invariant feature representations. While domain-invariant feature representations can reduce the differences between the source and target domains, excessively simplifying these differences may cause the model to overlook important domain-specific features, resulting in a decline in transfer learning effectiveness. To address this issue, this paper proposes a novel Gaussian Mixture Model for graph domain adaptation (GMM). This model effectively reduces the distributional bias between the source and target domains by modeling the distribution differences on a graph structure. GMM leverages the local structural information of the graph and the clustering capability of the Gaussian mixture model to automatically learn the latent mapping relationships between the source and target domains. To the best of our knowledge, this is the first work to introduce a Gaussian mixture model into UDA. Extensive experimental results on three standard benchmarks demonstrate that the proposed GMM algorithm outperforms state-of-the-art unsupervised domain adaptation methods in terms of performance.
1095: Brain-Inspired Stepwise Patch Merging for Vision Transformers
Authors: Yonghao Yu, Dongcheng Zhao, Guobin Shen, Yiting Dong, Yi Zeng
Location: Guangzhou | Day: TBD
Show Abstract
The hierarchical architecture has become a mainstream design paradigm for Vision Transformers (ViTs), with Patch Merging serving as the pivotal component that transforms a columnar architecture into a hierarchical one. Drawing inspiration from the brain’s ability to integrate global and local information for comprehensive visual understanding, we propose Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism’s ability to ‘see’ better. SPM consists of Multi-Scale Aggregation (MSA) and Guided Local Enhancement (GLE) striking a proper balance between long-range dependency modeling and local feature enhancement. Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic segmentation. Meanwhile, experiments show that combining SPM with different backbones can further improve performance. The code has been released at https://github.com/Yonghao-Yu/StepwisePatchMerging.
1116: Enhancing Multimodal Protein Function Prediction Through Dual-Branch Dynamic Selection with Reconstructive Pre-Training
Authors: Xiaoling Luo, Peng Chen, Chengliang Liu, Xiaopeng Jin, Jie Wen, Yumeng Liu, Junsong Wang
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal protein features play a crucial role in protein function prediction. However, these features encompass a wide range of information, ranging from structural data and sequence features to protein attributes and interaction networks, making it challenging to decipher their complex interconnections. In this work, we propose a multimodal protein function prediction method (DSRPGO) by utilizing dynamic selection and reconstructive pre-training mechanisms. To acquire complex protein information, we introduce reconstructive pre-training to mine more fine-grained information with low semantic levels. Moreover, we put forward the Bidirectional Interaction Module (BInM) to facilitate interactive learning among multimodal features. Additionally, to address the difficulty of hierarchical multi-label classification in this task, a Dynamic Selection Module (DSM) is designed to select the feature representation that is most conducive to current protein function prediction. Our proposed DSRPGO model improves significantly in BPO, MFO, and CCO on human datasets, thereby outperforming other benchmark models.
1117: Top-I2P: Explore Open-Domain Image-to-Point Cloud Registration Using Topology Relationship
Authors: Pei An, Jiaqi Yang, Muyao Peng, You Yang, Qiong Liu, Jie Ma, Liangliang Nan
Location: Guangzhou | Day: TBD
Show Abstract
Image-to-point cloud (I2P) registration is a fundamental task in computer vision, which aims to align pixels in 2D images with corresponding points in 3D point clouds. While deep learning based methods dominate this field, they often fail to generalize to the open domain. In this paper, we address open-domain I2P registration from the topology relationships perspective. Firstly, we find that topology relationships reflect sparse connections between pixels and points, which shows the significant potential in enhancing cross-modality feature interaction in the open domain. Building on this insight, we develop an I2P registration framework using topology relationships. After that, to construct and leverage the topology relationships between the heterogeneous 2D and 3D spaces, we design a registration network, Top-I2P, with correction-based topology reasoning and fast topology feature interaction modules. Extensive experiments on 7-Scenes, RGBD-V2, ScanNet, and self-collected I2P datasets demonstrate that Top-I2P achieves superior registration performance in open-domain scenarios.
1132: Decoupled Imbalanced Label Distribution Learning
Authors: Yongbiao Gao, Xiangcheng Sun, Miaogen Ling, Chao Tan, Yi Zhai, Guohua Lv
Location: Guangzhou | Day: TBD
Show Abstract
Label Distribution Learning (LDL) has been successfully implemented in numerous practical applications. However, the imbalance in label distributions presents a significant challenge due to the substantial variation in annotation information. To tackle this issue, we introduce Decoupled Imbalance Label Distribution Learning (DILDL), which decomposes the imbalanced label distribution into a dominant label distribution and a non-dominant label distribution. Our empirical findings reveal that an excessively high description degree of dominant labels can result in substantial gradient information attenuation for non-dominant labels during the learning process. Therefore, we employ the decoupling approach to balance the description degrees of both dominant and non-dominant labels independently. Furthermore, we align the feature representations with the representations of dominant and non-dominant labels separately, aiming to effectively mitigate the distribution shift problem. Experimental results demonstrate that our proposed DILDL outperforms other state-of-the-art methods for imbalance label distribution learning.
1134: On the Generalization of Feature Incremental Learning
Authors: Chao Xu, Xijia Tang, Lijun Zhang, Chenping Hou
Location: Guangzhou | Day: TBD
Show Abstract
In many real applications, the data attributes are incremental and the samples are stored with accumulated feature spaces gradually. Although there are several elegant approaches to tackling this problem, the theoretical analysis is still limited. There exist at least two challenges and fundamental questions. 1) How to derive the generalization bounds of these approaches? 2) Under what conditions do these approaches have a strong generalization guarantee? To solve these crucial but rarely studied problems, we provide a comprehensive theoretical analysis in this paper. We begin by summarizing and refining four strategies for addressing feature incremental data. Subsequently, we derive their generalization bounds, providing rigorous and quantitative insights. The theoretical findings highlight the key factors influencing the generalization abilities of different strategies. In tackling the above two fundamental problems, we also provide valuable guidance for exploring other learning challenges in dynamic environments. Finally, the comprehensive experimental and theoretical results mutually validate each other, underscoring the reliability of our conclusions.
1137: Suit the Node Pair to the Case: A Multi-Scale Node Pair Grouping Strategy for Graph-MLP Distillation
Authors: Rui Dong, Jiaxing Li, Weihuang Zheng, Youyong Kong
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Network (GNN) is powerful in solving various graph-related tasks, while its message passing mechanism may lead to latency during inference time. Multi-Layer-Perceptron (MLP) can achieve fast inference speed but with limited performance. One solution to fill this gap is through Knowledge Distillation. However, current distillation methods follow a ”node-to-node” paradigm, while considering the complex relationships between different node pairs, direct distillation fails to capture these multiple-granularity features in GNN. Furthermore, current methods which focuses on the alignment of logits in the final layer ignores further learning within layers inside student MLP. Therefore, in this paper, we introduce a multi-scale knowledge distillation method (MSN-GDM) aiming to capture multiple knowledge from GNN to MLP. We firstly propose a multi-scale node-pair grouping strategy to assign node pairs to different-scale groups according to node pair similarity metrics. The similarity metrics consider both node features and topological structures of the given node pair. Then based on the preprocessed node-set groups, we design a multi-scale distillation method that can capture comprehensive knowledge in the corresponding node-set groups. The hierarchical weighted sum of each layer is applied as the final output. Extensive experiments on eight real-world datasets demonstrate the effectiveness of our proposed method.
1141: Efficient Differentiable Approximation of Generalized Low-rank Regularization
Authors: Naiqi Li, Yuqiu Xie, Peiyuan Liu, Tao Dai, Yong Jiang, Shu-Tao Xia
Location: Guangzhou | Day: TBD
Show Abstract
Low-rank regularization (LRR) has been widely applied in various machine learning tasks, but the associated optimization is challenging. Directly optimizing the rank function under constraints is NP-hard in general. To overcome this difficulty, various relaxations of the rank function were studied. However, optimization of these relaxed LRRs typically depends on singular value decomposition, which is a time-consuming and nondifferentiable operator that cannot be optimized with gradient-based techniques. To address these challenges, in this paper we propose an efficient differentiable approximation of the generalized LRR. The considered LRR form subsumes many popular choices like the nuclear norm, the Schatten-p norm, and various nonconvex relaxations. Our method enables LRR terms to be appended to loss functions in a plug-and-play fashion, and the GPU-friendly operations enable efficient and convenient implementation. Furthermore, convergence analysis is presented, which rigorously shows that both the bias and the variance of our rank estimator rapidly reduce with increased sample size and iteration steps. In the experimental study, the proposed method is applied to various tasks, which demonstrates its versatility and efficiency. Code is available at https://github.com/naiqili/EDLRR.
1151: Exploring the Frontiers of Animation Video Generation in the Sora Era: Method, Dataset and Benchmark
Authors: Yudong Jiang, Baohan Xu, Siqian Yang, Mingyu Ying, Jing Liu, Chao Xu, Siqi Wang, Yidi Wu, Bingwen Zhu, Yue Zhang, Jinlong Hou, Huyang Sun
Location: Montreal | Day: August 21st | Time: 11:30 | Session: CV: Benchmarks
Show Abstract
Animation has gained significant interest in the recent film and TV industry. Despite the success of advanced video generation models like Sora, Kling, and CogVideoX in generating natural videos, they lack the same effectiveness in handling animation videos. Evaluating animation video generation is also a great challenge due to its unique artist styles, violating the laws of physics and exaggerated motions. In this paper, we present a comprehensive system, AniSora, designed for animation video generation, which includes a data processing pipeline, a controllable generation model, and an evaluation benchmark. Supported by the data processing pipeline with over 10M high-quality data, the generation model incorporates a spatiotemporal mask module to facilitate key animation production functions such as image-to-video generation, frame interpolation, and localized image-guided animation. We also collect an evaluation benchmark of 948 various animation videos, with specifically developed metrics for animation video generation. Our entire project is publicly available on https://github.com/bilibili/Index-anisora/tree/main
1157: Adaptive Language-Aware Image Reflection Removal Network
Authors: Siyan Fang, Yuntao Wang, Jinpu Zhang, Ziwen Li, Yuehuan Wang
Location: Guangzhou | Day: TBD
Show Abstract
Existing image reflection removal methods struggle to handle complex reflections. Accurate language descriptions can help the model understand the image content to remove complex reflections. However, due to blurred and distorted interferences in reflected images, machine-generated language descriptions of the image content are often inaccurate, which harms the performance of language-guided reflection removal. To address this, we propose the Adaptive Language-Aware Network (ALANet) to remove reflections even with inaccurate language inputs. Specifically, ALANet integrates both filtering and optimization strategies. The filtering strategy reduces the negative effects of language while preserving its benefits, whereas the optimization strategy enhances the alignment between language and visual features. ALANet also utilizes language cues to decouple specific layer content from feature maps, improving its ability to handle complex reflections. To evaluate the model’s performance under complex reflections and varying levels of language accuracy, we introduce the Complex Reflection and Language Accuracy Variance (CRLAV) dataset. Experimental results demonstrate that ALANet surpasses state-of-the-art methods for image reflection removal. The code and dataset are available at https://github.com/fashyon/ALANet.
1167: Open-World Semi-Supervised Learning with Class Semantic Correlations
Authors: Yuxin Fan, Junbiao Cui, Jiye Liang, Jianqing Liang
Location: Guangzhou | Day: TBD
Show Abstract
Open-world semi-supervised learning (OWSSL) aims to recognize both known and unknown classes, but the labeled samples only cover the known classes. Existing OWSSL methods primarily represent classes as symbolic variables, which ignore the rich internal semantic information associated with the classes and thus hampers their ability to recognize unknown classes. Recent studies incorporate textual descriptions of classes to facilitate training, but these methods overlook the class semantic correlations, which constrains their effectiveness in recognizing unknown classes. To address these issues, we propose a novel OWSSL method. Our method fine-tunes only the image encoder during training while keeping the text encoder frozen, thereby preserving the rich semantic correlations learned during the pre-training phase. Furthermore, we employ a semantic margin to extract class semantic correlations from textual descriptions, which are then utilized in enhancing image representation discriminability. Experimental results across multiple datasets demonstrate that our method significantly outperforms representative OWSSL methods in the recognition of both known and unknown classes.
1169: TCDM: A Temporal Correlation-Empowered Diffusion Model for Time Series Forecasting
Authors: Huibo Xu, Likang Wu, Xianquan Wang, Zhiding Liu, Qi Liu
Location: Guangzhou | Day: TBD
Show Abstract
Although previous studies have applied diffusion models to time series forecasting, these efforts have struggled to preserve the intrinsic temporal correlations within the series, leading to suboptimal predictive outcomes. This failure primarily results from the introduction of independent, identically distributed (i.i.d.) noise. In the forward process, the addition of i.i.d. noise to the time series gradually diminishes these temporal correlations. The reverse process starts with i.i.d. noise and lacks priors related to temporal correlations, which can result in directional biases during sampling. From a frequency-domain perspective, noise disrupts the low-frequency-dominated structure of trend components, making it difficult for the model to learn long-term temporal dependencies. To address these limitations, we introduce a decomposition prediction framework to complement the novel Temporal Correlation-Empowered Diffusion Model. Overall, We decompose the time series into trend and residual components, predict them using a base model and a diffusion model, and then combine the results. Specifically, a frequency-domain MLP model was adopted as the base model due to its not distorting the original sequence, and better the capture of long-range temporal dependencies. The diffusion model incorporates two key modules to capture short- and mid-range temporal correlations: the Maintaining Temporal Correlation Module and the Redesigned Initial Module. Extensive experiments across multiple datasets demonstrate that the proposed method significantly outperforms related strong baselines.
1175: Underground Diagnosis in 3D GPR Data by Learning in CuCoRes Model Space
Authors: Xiren Zhou, Shikang Liu, Xinyu Yan, Xiangyu Wang, Huanhuan Chen
Location: Guangzhou | Day: TBD
Show Abstract
Ground Penetrating Radar (GPR) provides detailed subterranean insights. Nevertheless, underground diagnosis via GPR is hindered by the fact that training data typically contain only normal samples, along with the complexity of GPR data’s wave-collection characteristics. This paper proposes subsurface anomaly detection within the Cubic Correlation Reservoir Network (CuCoRes) model space. CuCoRes incorporates three reservoirs with spatial correlation adjustment in each direction to adequately and accurately capture multi-directional dynamics (i.e., changing information) within GPR data. Fitting GPR data with CuCoRes and representing data with fitted models, the original GPR data is mapped into a category-discriminative CuCoRes model space, where anomalies could be efficiently identified and categorized based on model dissimilarities. Our approach leverages only limited normal GPR data, easily accessible, to support subsequent anomaly detection and categorization, enhancing its applicability in practical scenarios. Experiments on real-world data demonstrate its effectiveness, outperforming state-of-the-art.
1179: Fault Diagnosis in REDNet Model Space
Authors: Xiren Zhou, Ziyu Tang, Shikang Liu, Ao Chen, Xiangyu Wang, Huanhuan Chen
Location: Guangzhou | Day: TBD
Show Abstract
Fault Diagnosis (FD) in time-varying data presents considerations such as limited training data, intra- and inter-dimensional correlations, and constraints of training time. In response, this paper introduces FD in the Reservoir-Embedded-Directional Network (REDNet) model space. Model-oriented methods utilize well-fitted networks or functions, denoted as "models" that capture data’s changing information, as more stable and parsimonious representations of the data. Our approach employs REDNet for data fitting, wherein multiple reservoirs are organized along intrinsic correlation directions to establish intra- and inter-dimensional dependencies, thereby capturing multi-directional dynamics in high-dimensional data.
Representing each data instance with an independently fitted REDNet model maps these instances into a class-separable REDNet model space, where FD could be performed on the models rather than the original data. Concentrating on the data-intrinsic dynamics, our method achieves rapid training speeds, and maintains robust performance even with minimal training data. Experiments on several datasets demonstrate its effectiveness.
1183: Stabilizing Holistic Semantics in Diffusion Bridge for Image Inpainting
Authors: Jinjia Peng, Mengkai Li, Huibing Wang
Location: Guangzhou | Day: TBD
Show Abstract
Image inpainting aims to restore the original image from a damaged version. Recently, a special type of diffusion bridge model has achieved promising performance by directly mapping the degradation process and restoring corrupted images through the corresponding reverse process. However, due to the lack of explicit semantic priors during the denoising process, the inpainted results typically exhibit inferior context-stability and semantic consistency. To this end, this paper proposes a novel Global Structure-Guided Diffusion Bridge framework (GSGDiff), which incorporates an additional structure restorer to stabilize the generation of holistic semantics. Specifically, to acquire richer semantic structure priors, this paper proposes a posterior sampling approach that captures semantically global and consistent structures at each timestep, efficiently integrating them into the texture generation through the corresponding guidance module. Additionally, considering the characteristics of diffusion models with low denoising levels at larger timesteps, this paper proposes a semantic fusion schedule to avoid noise interference by reducing the weight of ineffective guided semantics in the early stages. By applying the proposed posterior sampling to the texture denoising process, GSGDiff can achieve more stable and superior inpainting results over competitive baselines. Experiments on Places2, Paris Street View and CelebA-HQ datasets validate the efficacy of the proposed method.
1206: Binary Event-Driven Spiking Transformer
Authors: Honglin Cao, Zijian Zhou, Wenjie Wei, Yu Liang, Ammar Belatreche, Dehao Zhang, Malu Zhang, Yang Yang, Haizhou Li
Location: Guangzhou | Day: TBD
Show Abstract
Transformer-based Spiking Neural Networks (SNNs) introduce a novel event-driven self-attention paradigm that combines the high performance of Transformers with the energy efficiency of SNNs. However, the larger model size and increased computational demands of the Transformer structure limit their practicality in resource-constrained scenarios. In this paper, we integrate binarization techniques into Transformer-based SNNs and propose the Binary Event-Driven Spiking Transformer, i.e. BESTformer. The proposed BESTformer can significantly reduce storage and computational demands by representing weights and attention maps with a mere 1-bit. However, BESTformer suffers from a severe performance drop from its full-precision counterpart due to the limited representation capability of binarization. To address this issue, we propose a Coupled Information Enhancement (CIE) method, which consists of a reversible framework and information enhancement distillation. By maximizing the mutual information between the binary model and its full-precision counterpart, the CIE method effectively mitigates the performance degradation of the BESTformer. Extensive experiments on static and neuromorphic datasets demonstrate that our method achieves superior performance to other binary SNNs, showcasing its potential as a compact yet high-performance model for resource-limited edge devices. The repository of this paper is available at https://github.com/CaoHLin/BESTFormer.
1212: Optical Flow Estimation for Tiny Objects: New Problem, Specialized Benchmark, and Bioinspired Scheme
Authors: Xueyao Ji, Gang Wang, Yizheng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Optical flow is pivotal in video-based tasks, yet existing methods mostly focus on medium-/large-size objects, while underperforming when characterizing the motion of tiny objects. To bridge this gap, we introduce the On-off Time-delay with Hassenstein-Reichardt correlator (OTHR), a computationally efficient scheme inspired by the primate visual cortex’s direction selectivity mechanism. OTHR kernels, applied across multiple frames, discern bright/dark luminance changes along a specific direction over a time delay, effectively estimating motion of tiny objects amidst noise and static backgrounds. Notably, OTHR integrates seamlessly with leading deep learning flow estimation models such as RAFT and FlowFormer. We also propose refined evaluation metrics for tiny objects and contribute a new dataset featuring such objects to aid algorithm development. Our experiments confirm OTHR’s superiority over competing methods, particularly in enhancing state-of-the-art models’ performance on tiny object motion estimation at minimal cost. Specifically, for objects less than 100 pixels, OTHR reduces RAFT and FlowFormer’s errors by 22.03% and 83.50%, respectively. The codes will be accessible at https://github.com/JaneEliot/OTHR.
1223: One-step Label Shift Adaptation via Robust Weight Estimation
Authors: Ruidong Fan, Xiao Ouyang, Tingjin Luo, Lijun Zhang, Chenping Hou
Location: Guangzhou | Day: TBD
Show Abstract
Label shift is a prevalent phenomenon encountered in open environments, characterized by a notable discrepancy in the label distributions between the source (training) and target (test) domains, whereas the conditional distributions given the labels remain invariant. Existing label shift methods adopt a two-step strategy: initially computing the importance weight and subsequently utilizing it to calibrate the target outputs. However, this conventional strategy overlooks the intricate interplay between output adjustment and weight estimation. In this paper, we introduce a novel approach termed as One-step Label Shift Adaptation (OLSA). Our methodology jointly learns the predictive model and the corresponding weights through a bi-level optimization framework, with the objective of minimizing an upper bound on the target risk. To enhance the robustness of our proposed model, we incorporate a debiasing term into the upper-level classifier training and devise a regularization term for the lower-level weight estimation. Furthermore, we present theoretical analyses about the generalization bounds, offering guarantees for the model’s performance. Extensive experimental results substantiate the efficacy of our proposal.
1250: GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype
Authors: Changxi Chi, Jun Xia, Jingbo Zhou, Jiabei Cheng, Chang Yu, Stan Z. Li
Location: Guangzhou | Day: TBD
Show Abstract
Predicting genetic perturbations enables the identification of potentially crucial genes prior to wet-lab experiments, significantly improving overall experimental efficiency. Since genes are the foundation of cellular life, building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations. However, current methods fail to fully leverage gene-related information, and solely rely on simple evaluation metrics to construct coarse-grained GRN. More importantly, they ignore functional differences between biotypes, limiting the ability to capture potential gene interactions. In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data, respectively, which serve as the initialization for gene representations. Additionally, we introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes, while capturing implicit gene relationships through graph structure learning (GSL). We propose GRAPE, a heterogeneous graph neural network (HGNN) that leverages gene representations initialized with features from descriptions and sequences, models the distinct roles of genes with different biotypes, and dynamically refines the GRN through GSL. The results on publicly available datasets show that our method achieves state-of-the-art performance. The code for reproducing the results can be seen at the link: https://github.com/ChangxiChi/GRAPE.
1273: Differentiable Prompt Learning for Vision Language Models
Authors: Zhenhan Huang, Tejaswini Pedapati, Pin-Yu Chen, Jianxi Gao
Location: Montreal | Day: August 21st | Time: 15:00 | Session: ML: Large Language Models
Show Abstract
Prompt learning is an effective way to exploit the potential of large-scale pre-trained foundational models. Continuous prompts parameterize context tokens in prompts by turning them into differentiable vectors. Deep continuous prompts insert prompts not only in the input but also in the intermediate hidden representations. Manually designed deep continuous prompts exhibit a remarkable improvement compared to the zero-shot pre-trained model on downstream tasks. How to automate the continuous prompt design is an underexplored area, and a fundamental question arises, is manually designed deep prompt strategy optimal? To answer this question, we propose a method dubbed differentiable prompt learning (DPL). The DPL method is formulated as an optimization problem to automatically determine the optimal context length of the prompt to be added to each layer, where the objective is to maximize the performance. We test the DPL method on the pre-trained CLIP. We empirically find that by using only limited data, our DPL method can find deep continuous prompt configuration with high confidence. The performance on the downstream tasks exhibits the superiority of the automatic design: our method boosts the average test accuracy by 2.60% on 11 datasets compared to baseline methods. Besides, our method focuses only on the prompt configuration (i.e. context length for each layer), which means that our method is compatible with the baseline methods that have sophisticated designs to boost the performance. We release our code in https://github.com/Zhenhan-Huang/Differentiable-Prompt-Learn.
1280: Self-Classification Enhancement and Correction for Weakly Supervised Object Detection
Authors: Yufei Yin, Lechao Cheng, Wengang Zhou, Jiajun Deng, Zhou Yu, Houqiang Li
Location: Guangzhou | Day: TBD
Show Abstract
In recent years, weakly supervised object detection (WSOD) has attracted much attention due to its low labeling cost. The success of recent WSOD models is often ascribed to the two-stage multi-class classification (MCC) task, i.e., multiple instance learning and online classification refinement. Despite achieving non-trivial progresses, these methods overlook potential classification ambiguities between these two MCC tasks and fail to leverage their unique strengths. In this work, we introduce a novel WSOD framework to ameliorate these two issues. For one thing, we propose a self-classification enhancement module that integrates intra-class binary classification (ICBC) to bridge the gap between the two distinct MCC tasks. The ICBC task enhances the network’s discrimination between positive and mis-located samples in a class-wise manner and forges a mutually reinforcing relationship with the MCC task. For another, we propose a self-classification correction algorithm during inference, which combines the results of both MCC tasks to effectively reduce the mis-classified predictions. Extensive experiments on the prevalent VOC 2007 & 2012 datasets demonstrate the superior performance of our framework.
1281: QuantileFormer: Probabilistic Time Series Forecasting with a Pattern-Mixture Decomposed VAE Transformer
Authors: Yimiao Shao, Wenzhong Li, Kang Xia, Kaijie Lin, Mingkai Lin, Sanglu Lu
Location: Guangzhou | Day: TBD
Show Abstract
Probabilistic time series forecasting has attracted an increasing attention in machine learning community for its potential applications in the fields of renewable energy, traffic management, healthcare, etc. Previous research mainly focused on extracting long-range dependencies for point-wise prediction, which fail to capture complex temporal patterns and statistical characteristics for probabilistic analysis. In this paper, we propose a novel pattern-mixture decomposition method that decomposes long-term series into quantile drift, divergence patterns, and Gaussian mixture components, which can effectively capture the intricate temporal patterns and stochastic characteristics in time series. Based on pattern-mixture decomposition, we propose a novel Transformer-based model called QuantileFormer for probabilistic time series forecasting. It takes the the comprehensive drift-divergence mixture patterns as features, and designs a variational inference based fusion Transformer architecture to generate quantile prediction results. Extensive experiments show that the proposed method consistently boosts the baseline methods by a large margin and achieves state-of-the-art performance on six real-world benchmarks.
1288: CompLex: Music Theory Lexicon Constructed by Autonomous Agents for Automatic Music Generation
Authors: Zhejing Hu, Yan Liu, Gong Chen, Bruce X. B. Yu
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Multidisciplinary Topics and Applications (2/2)
Show Abstract
Generative artificial intelligence in music has made significant strides, yet it still falls short of the substantial achievements seen in natural language processing, primarily due to the limited availability of music data. Knowledge-informed approaches have been shown to enhance the performance of music generation models, even when only a few pieces of musical knowledge are integrated. This paper seeks to leverage comprehensive music theory in AI-driven music generation tasks, such as algorithmic composition and style transfer, which traditionally require significant manual effort with existing techniques. We introduce a novel automatic music lexicon construction model that generates a lexicon, named CompLex, comprising 37,432 items derived from just 9 manually input category keywords and 5 sentence prompt templates. A new multi-agent algorithm is proposed to automatically detect and mitigate hallucinations. CompLex demonstrates impressive performance improvements across three state-of-the-art text-to-music generation models, encompassing both symbolic and audio-based methods. Furthermore, we evaluate CompLex in terms of completeness, accuracy, non-redundancy, and executability, confirming that it possesses the key characteristics of an effective lexicon.
1293: CFII-Net: Explicit Class Embeddings and Feature Maps Through Iterative Interaction for Boosting Medical Image Segmentation
Authors: Xinyu Zhu, Xiwen Liu, Lianghua He, Yin Wen
Location: Guangzhou | Day: TBD
Show Abstract
Prior knowledge of category structure is essential in medical image segmentation, especially with significant organ structure differences. However, current hybrid architectures primarily focus on enhancing pixel-level representation learning, often neglecting or weakening the key prior knowledge of categorical structures, which poses challenges in capturing category relationships and accurate segmenting. To address this concern, we propose a novel network using Explicit Class Embeddings and Feature Maps through Iterative Interaction (CFII-Net) for boosting medical image segmentation. CFII-Net effectively segments images by exploring the relationship between explicit class embeddings and pixels in images. Specifically, we propose an Explicit Class Embedding Generator (ECEG) to obtain high-quality class semantic embeddings, incorporating category structure priors, which are used to guide high-accuracy segmentation. We then introduce an iterative Interactor, which utilizes transformers to facilitate the interaction between feature maps and class embeddings, thereby exploring pixel-to-class relationships. Furthermore, we propose updating strategies to refine the class embeddings and feature maps during the iteration process for achieving refined image segmentation. Extensive empirical evidence shows that any codec can be easily integrated into CFII-Net and yields improvements over the state-of-the-art methods in four public benchmarks.
1295: Robust Graph Contrastive Learning for Incomplete Multi-view Clustering
Authors: Deyin Zhuang, Jian Dai, Xingfeng Li, Xi Wu, Yuan Sun, Zhenwen Ren
Location: Guangzhou | Day: TBD
Show Abstract
In recent years, multi-view clustering (MVC) has become a promising approach for analyzing heterogeneous multi-source data. However, during the collection of multi-view data, factors such as environmental interference or sensor failure often lead to the loss of view sample data, resulting in incomplete multi-view clustering (IMVC). Graph contrastive IMVC has demonstrated promising performance as an effective solution, which typically utilizes in-graph instances as positive pairs and out-of-graph instances as negative pairs. However, the construction of positive and negative pairs in this paradigm inevitably leads to graph noise Correspondence (GNC). To this end, we propose a new IMVC framework, namely robust graph contrastive learning (RGCL). Specifically, RGCL first completes the missing data by using a multi-view consistency transfer relationship graph. Then, to mitigate the impact of false negative pairs from graph contrastive, we propose noise-robust graph contrastive learning to mine intra-view consistency accurately. Finally, we present cross-view graph-level alignment to fully exploit the complementary information across different views. Experimental results on the six multi-view datasets demonstrate that our RGCL exhibits superiority and effectiveness compared with 9 state-of-the-art IMVC methods. The source code is available at https://github.com/DYZ163/RGCL.git.
1296: Misclassification-driven Fingerprinting for DNNs Using Frequency-aware GANs
Authors: Weixing Liu, Shenghua Zhong
Location: Guangzhou | Day: TBD
Show Abstract
Deep neural networks (DNNs) have become valuable assets due to their success in various tasks, but their high training costs also make them targets for model theft. Fingerprinting techniques are commonly used to verify model ownership, but existing methods either require training many additional models, leading to increased costs, or rely on GANs to generate fingerprints near decision boundaries, which may compromise image quality. To address these challenges, we propose a GAN-based fingerprint generation method that applies frequency-domain perturbations to normal samples, effectively creating fingerprints. This approach not only resists intellectual property (IP) threats, but also improves fingerprint acquisition efficiency while maintaining high imperceptibility. Extensive experiments demonstrate that our method achieves a state-of-the-art (SOTA) AUC of 0.98 on the Tiny-ImageNet dataset under IP removal attacks, outperforming existing methods by 8%, and consistently achieves the best ABP for three types of IP detection and erasure attacks on the GTSRB dataset. Our source code is available at https://github.com/wason981/Frequency-Fingerprinting.
1298: Exploiting Label Skewness for Spiking Neural Networks in Federated Learning
Authors: Di Yu, Xin Du, Linshan Jiang, Huijing Zhang, Shuiguang Deng
Location: Guangzhou | Day: TBD
Show Abstract
The energy efficiency of deep spiking neural networks (SNNs) aligns with the constraints of resource-limited edge devices, positioning SNNs as a promising foundation for intelligent applications leveraging the extensive data collected by these devices. To safeguard data privacy, federated learning (FL) facilitates collaborative SNN-based model training by leveraging data distributed across edge devices without transmitting local data to a central server. However, existing FL approaches encounter challenges in handling label-skewed data across devices, inducing drift in the local SNN model and consequently impairing the performance of the global SNN model. To tackle these problems, we propose a novel framework called FedLEC, which incorporates intra-client label weight calibration to balance the learning intensity across local labels and inter-client knowledge distillation to mitigate local SNN model bias caused by label absence. Extensive experiments with three different structured SNNs across five datasets (i.e., three non-neuromorphic and two neuromorphic datasets) demonstrate the efficiency of FedLEC. Compared to seven state-of-the-art FL algorithms, FedLEC achieves an average accuracy improvement of approximately 11.59% for the global SNN model under various label skew distribution settings.
1303: Computational Complexity of Planning for Recursive Primitive Task Networks: Selective Action Nullification with State Preservation
Authors: Yifan Zhang, Pascal Bercher
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Planning and Scheduling (2/5)
Show Abstract
This paper investigates fundamental aspects of Hierarchical Task Network (HTN) planning by systematically exploring recursive arrangements of primitive task networks. Working within a general framework that aligns with recently identified ACKERMANN-complete HTN problems, we map the computational complexity across various recursive configurations, revealing a rich complexity landscape. Through a novel proof technique that we call selective action nullification with state preservation, we demonstrate that even a highly restricted class of regular HTN problems remains PSPACE-complete, establishing a profound connection to classical planning. We hope these findings contribute to a deeper and broader understanding of the theoretical foundations of HTN planning.
1306: AdaR: An Adaptive Gradient Method with Cyclical Restarting of Moment Estimations
Authors: Yangchuan Wang, Lianhong Ding, Peng Shi
Location: Guangzhou | Day: TBD
Show Abstract
Adaptive gradient methods, primarily based on Adam, are prevalent in training neural networks, adjusting step sizes via exponentially decaying averages of gradients and squared gradients. Adam assigns small weights to distant gradients, termed long-tail gradients in this paper. However, these gradients persistently influence update behavior, potentially degrading generalization performance. To address this issue, we incorporate a restart mechanism into moment estimations, proposing AdaR (ADAptive gradient methods via Restarting moment estimations). Specifically, AdaR divides a training epoch into fixed-iteration intervals, alternating between two sets of moment estimations for parameter updates and discarding prior moment estimations at the beginning of each interval. Within each interval, one set updates parameters and will be discarded in the subsequent interval, while the other is reset at the midpoint to estimate moments for updates in the subsequent interval. The restart mechanism cyclically discards distant gradients, initiates fresh moment estimations for parameter updates, and stabilizes training. By prioritizing recent gradients, the method increases estimation accuracy and enhances step size adjustment. Empirically, AdaR outperforms state-of-the-art optimization algorithms on image classification and language modeling tasks, demonstrating superior generalization and faster convergence.
1311: Top-Down Guidance for Learning Object-Centric Representations
Authors: Junhong Zou, Xiangyu Zhu, Zhaoxiang Zhang, Zhen Lei
Location: Guangzhou | Day: TBD
Show Abstract
Humans’ innate ability to decompose scenes into objects allows for efficient understanding, predicting, and planning. In light of this, Object-Centric Learning (OCL) attempts to endow networks with similar capabilities, learning to represent scenes with the composition of objects. However, existing OCL models only learn through reconstructing the input images, which does not assist the model in distinguishing objects, resulting in suboptimal object-centric representations. This flaw limits current object-centric models to relatively simple downstream tasks. To address this issue, we draw on humans’ top-down vision pathway and propose Top-Down Guided Network (TDGNet), which includes a top-down pathway to improve object-centric representations. During training, the top-down pathway constructs guidance with high-level object-centric representations to optimize low-level grid features output by the backbone. While during inference, it refines object-centric representations by detecting and solving conflicts between low- and high-level features. We show that TDGNet outperforms current object-centric models on multiple datasets of varying complexity. In addition, we expand the downstream task scope of object-centric representations by applying TDGNet to the field of robotics, validating its effectiveness in downstream tasks including video prediction and visual planning. Code will be available at https://github.com/zoujunhong/RHGNet.
1316: Detecting Hallucination in Large Language Models Through Deep Internal Representation Analysis
Authors: Luan Zhang, Dandan Song, Zhijing Wu, Yuhang Tian, Changzhi Zhou, Jing Xu, Ziyi Yang, Shuhao Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Large language models (LLMs) have shown exceptional performance across various domains. However, LLMs are prone to hallucinate facts and generate non-factual responses, which can undermine their reliability in real-world applications. Current hallucination detection methods suffer from external resource demands, substantial time overhead, difficulty overcoming LLMs’ intrinsic limitation, and insufficient modeling. In this paper, we propose MHAD, a novel internal-representation-based hallucination detection method. MHAD utilizes linear probing to select neurons and layers within LLMs. The selected neurons and layers are demonstrated with significant awareness of hallucinations at the initial and final generation steps. By concatenating the outputs from these selected neurons of selected layers at the initial and final generation steps, a hallucination awareness vector is formed, enabling precise hallucination detection via an MLP. Additionally, we introduce SOQHD, a novel benchmark for evaluating hallucination detection in Open-Domain QA (ODQA). Extensive experiments show that MHAD outperforms existing hallucination detection methods across multiple LLMs, demonstrating superior effectiveness.
1326: ST-USleepNet: A Spatial-Temporal Coupling Prominence Network for Multi-Channel Sleep Staging
Authors: Jingying Ma, Qika Lin, Ziyu Jia, Mengling Feng
Location: Guangzhou | Day: TBD
Show Abstract
Sleep staging is critical to assess sleep quality and diagnose disorders. Despite advancements in artificial intelligence enabling automated sleep staging, significant challenges remain: (1) Simultaneously extracting prominent temporal and spatial sleep features from multi-channel raw signals, including characteristic sleep waveforms and salient spatial brain networks. (2) Capturing the spatial-temporal coupling patterns essential for accurate sleep staging. To address these challenges, we propose a novel framework named ST-USleepNet, comprising a spatial-temporal graph construction module (ST) and a U-shaped sleep network (USleepNet). The ST module converts raw signals into a spatial-temporal graph based on signal similarity, temporal, and spatial relationships to model spatial-temporal coupling patterns. The USleepNet employs a U-shaped structure for both the temporal and spatial streams, mirroring its original use in image segmentation to isolate significant targets. Applied to raw sleep signals and graph data from the ST module, USleepNet effectively segments these inputs, simultaneously extracting prominent temporal and spatial sleep features. Testing on three datasets demonstrates that ST-USleepNet outperforms existing baselines, and model visualizations confirm its efficacy in extracting prominent sleep features and temporal-spatial coupling patterns across various sleep stages. The code is available at https://github.com/Majy-Yuji/ST-USleepNet.
1333: Initial Models and Serialisability in Abstract Dialectical Frameworks
Authors: Lars Bengel, Matthias Thimm
Location: Montreal | Day: August 19th | Time: 15:00 | Session: KRR: Argumentation
Show Abstract
We introduce initial models for abstract dialectical frameworks (ADFs) as a notion of minimal justifiable valuations and based on that, generalise the concept of serialisability of argumentation semantics to ADFs. In particular, we show that the characteristic operator-based semantics for ADFs can be characterised through serialisation sequences, which are, essentially, decompositions of a model into a series of initial models, representing a more fine-grained view into why a model is acceptable wrt. the semantics. We also analyse the computational complexity of tasks related to initial models.
1337: SCNNs: Spike-based Coupling Neural Networks for Understanding Structural-Functional Relationships in the Human Brain
Authors: Shaolong Wei, Shu Jiang, Mingliang Wang, Liang Sun, Haonan Rao, Weiping Ding, Jiashuang Huang
Location: Guangzhou | Day: TBD
Show Abstract
Structural-functional coupling (SC-FC coupling) offers an effective approach for analyzing structural-functional relationships, capable of revealing the dependency of functional activity on the underlying white matter architecture. However, extant SC-FC coupling analysis methods primarily center on disclosing the statistical association between the topological patterns of structural connectivity (SC) and functional connectivity (FC), while often neglecting the neurobiological mechanisms by which the brain typically transmits and processes information in the form of spikes. To address this, we propose a biologically inspired deep-learning model called spike-based coupling neural networks (SCNNs). It can simulate spiking neural activity to more realistically reproduce the interaction between brain regions and the dynamic behavior of neuronal networks. Specifically, we first use spike neurons to capture the FC temporal characteristics of the original functional magnetic resonance imaging (fMRI) time series and the SC spatial characteristics of the structural brain network. Then, we use synaptic and neuronal filter effects to simulate the coupling mechanism of SC and FC in the brain at different temporal and spatial scales, thereby quantifying SC-FC coupling and providing support for the identification of brain diseases. The results on real datasets show that the proposed method can identify brain diseases and provide a new perspective for understanding SC-FC relationships.
1341: TreeKV: Smooth Key-Value Cache Compression with Tree Structures
Authors: Ziwei He, Jian Yuan, Haoli Bai, Jingwen Leng, Bo Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Efficient key-value (KV) cache compression is critical for scaling transformer-based Large Language Models (LLMs) in long sequences and resource-limited settings. Existing methods evict tokens based on their positions or importance, but position-based strategies can miss crucial information outside predefined regions, while those relying on global importance scores resulting in strong regional biases, limiting the KV cache’s overall context retention and potentially impairing the performance of LLMs on complex tasks. Our wavelet analysis reveals that as tokens approach the end of sequence, their contributions to generation gradually increase and tends to diverge more from neighboring tokens, indicating a smooth transition with increasing complexity and variability from distant to nearby context. Motivated by this observation, we propose TreeKV, an intuitive, training-free method that employs a tree structure for smooth cache compression. TreeKV maintains a fixed cache size, allowing LLMs to deliver high-quality output in long text scenarios and is applicable during both the generation and prefilling stages. TreeKV consistently surpasses all baseline models in language modeling tasks on PG19 and OpenWebText2, allowing LLMs trained with short context window to generalize to longer window with a 16x cache reduction. On the Longbench benchmark, TreeKV achieves the best performance with only 6% of the budget at optimal efficiency.
1350: Airdrop Games
Authors: Sotiris Georganas, Aggelos Kiayias, Paolo Penna
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Game Theory
Show Abstract
Launching a new blockchain system or application is frequently facilitated by a so called airdrop, where the system designer chooses a pre-existing set of potentially interested parties and allocates newly minted tokens to them with the expectation that they will participate in the system — such engagement, especially if it is of significant level — facilitates the system and raises its value and also the value of its newly minted token, hence benefiting the airdrop recipients. A number of challenging questions befuddle designers in this setting, such as how to choose the set of interested parties and how to allocate tokens to them. To address these considerations we put forward a game theoretic model for such airdrop games. Our model can be used to guide the designer’s choices based on the way the system’s value depends on participation (modeled by a “technology function” in our framework) and the costs that participations incurs. We identify both bad and good equilibria and identify the settings and the choices that can be made where the designer can influence the players towards good equilibria in an expedient manner.
1356: ADC-GS: Anchor-Driven Deformable and Compressed Gaussian Splatting for Dynamic Scene Reconstruction
Authors: He Huang, Qi Yang, Mufan Liu, Yiling Xu, Zhu Li
Location: Guangzhou | Day: TBD
Show Abstract
Existing 4D Gaussian Splatting methods rely on per-Gaussian deformation from a canonical space to target frames, which overlooks redundancy among adjacent Gaussian primitives and result in suboptimal performance. To address this limitation, we propose Anchor-Driven Deformable and Compressed Gaussian Splatting (ADC-GS), a compact and efficient representation for dynamic scene reconstruction. Specifically, ADC-GS organizes Gaussian primitives into an anchor-based structure within the canonical space, enhanced by a temporal significance-based anchor refinement strategy. To reduce deformation redundancy, ADC-GS introduces a hierarchical coarse-to-fine pipeline that captures motions at varying granularities. Moreover, a rate-distortion optimization is adopted to achieve an optimal balance between bitrate consumption and representation fidelity. Experimental results demonstrate that ADC-GS outperforms the per-Gaussian deformation approaches in rendering speed by 300%-800% while achieving state-of-the-art storage efficiency without compromising rendering quality. The code is released at https://github.com/H-Huang774/ADC-GS.git.
1358: Multimodal Inference with Incremental Tabular Attributes
Authors: Xinda Chen, Zhen Xing, Zixian Zhang, Weimin Tan, Bo Yan
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal Learning with visual and tabular modalities has become more and more popular nowadays, especially in the healthcare area. Due to the adaptation of new equipment or new factors being introduced, the tabular modality keeps changing. However, the standard process of training multimodal AI models requires tables to have fixed columns in training and inference; thus, it is not suitable for handling dynamically changed tables. Therefore, new methods are needed for efficiently handling such tables in multimodal learning. In this paper, we introduce a new task, multimodal inference with incremental tabular attributes, which aims to enable trained multimodal models to leverage incremental attributes in tabular modality during the inference stage efficiently. We implement a specialized encoder to disentangle the latent representation of incremental tabular attributes inside itself and with the old attributes to reduce information redundancy and further align the incremental attributes with the visual modality with consistency loss to improve information richness. Experimental results across five public datasets show that our method effectively utilizes incremental tabular attributes, achieving state-of-the-art performance in general scenarios. Beyond the inference, we also find that our method achieved better performance in fully supervised settings, evoking a new training style for multimodal learning with tables.
1369: Beyond Statistical Analysis: Multimodal Framework for Time Series Forecasting with LLM-Driven Temporal Pattern
Authors: Jiahong Xiong, Chengsen Wang, Haifeng Sun, Yuhan Jing, Qi Qi, Zirui Zhuang, Lei Zhang, Jianxin Liao, Jingyu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Accurate forecasting of time series is crucial for many applications in the real world. Conventional methods primarily rely on statistical analysis of historical data, often leading to overfitting and failing to account for background information and constraints imposed by external events. Therefore, introducing large language models (LLMs) with robust textual capabilities holds significant potential. However, due to the inherent limitations of LLMs in handling numerical data, they do not exhibit advantages in precise numerical prediction tasks. Therefore, we propose a framework to integrate LLMs with conventional methods synergistically. Rather than directly outputting numerical predictions, we leverage the capabilities of the LLMs to generate textual temporal patterns, thereby fully utilizing their inherent knowledge and reasoning abilities. Additionally, we introduce a memory network designed to decode these textual representations into a format that numerical models can effectively interpret. This approach not only capitalizes on the strengths of the LLM in text processing but also bridges the gap between textual and numerical data, enhancing the overall predictive performance of the model. Our experimental results demonstrate the framework’s effectiveness, achieving state-of-the-art performance on various benchmark datasets.
1372: Approximate Lifted Model Construction
Authors: Malte Luttermann, Jan Speller, Marcel Gehrke, Tanya Braun, Ralf Möller, Mattis Hartwig
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Uncertainty in AI
Show Abstract
Probabilistic relational models such as parametric factor graphs enable efficient (lifted) inference by exploiting the indistinguishability of objects. In lifted inference, a representative of indistinguishable objects is used for computations. To obtain a relational (i.e., lifted) representation, the Advanced Colour Passing (ACP) algorithm is the state of the art. The ACP algorithm, however, requires underlying distributions, encoded as potential-based factorisations, to exactly match to identify and exploit indistinguishabilities. Hence, ACP is unsuitable for practical applications where potentials learned from data inevitably deviate even if associated objects are indistinguishable. To mitigate this problem, we introduce the ε-Advanced Colour Passing (ε-ACP) algorithm, which allows for a deviation of potentials depending on a hyperparameter ε. ε-ACP efficiently uncovers and exploits indistinguishabilities that are not exact. We prove that the approximation error induced by ε-ACP is strictly bounded and our experiments show that the approximation error is close to zero in practice.
1377: Prototype-guided Knowledge Propagation with Adaptive Learning for Lifelong Person Re-identification
Authors: Zhijie Lu, Wuxuan Shi, He Li, Mang Ye
Location: Guangzhou | Day: TBD
Show Abstract
Lifelong Person Re-identification (LReID) is essential in dynamic camera networks, which continually adapts to new environments while preserving previously acquired knowledge. Existing LReID techniques often preserve samples from past datasets to maintain old knowledge, potentially leading to privacy risks. While prototype-based methods offer privacy advantages, current approaches primarily focus on adjusting classifiers for image classification tasks, neglecting representation biases between old and new identities in person re-identification. This study introduces a novel Prototype-guided Knowledge Propagation (PKP) method, which mitigates discrepancies in similar identity images between old and new tasks by guiding prototype construction through triplet loss constraints. Additionally, to address disparities between prototypes and the updated feature extractor, an Adaptive Parameter Evolution (APE) strategy is proposed. APE optimizes the integration of the old and new models by assessing the importance of the new tasks, dynamically selecting the most pertinent parameters for updates according to their contribution to the current task. Extensive experiments on the LReID benchmark demonstrate that our approach surpasses state-of-the-art prototype-based LReID methods in terms of mAP and rank-1 accuracy. Code is available at https://github.com/joyner-7/IJCAI2025-PKA.
1379: The Role of Video Generation in Enhancing Data-Limited Action Understanding
Authors: Wei Li, Dezhao Luo, Dongbao Yang, Zhenhang Li, Weiping Wang, Yu Zhou
Location: Montreal | Day: August 21st | Time: 15:00 | Session: CV: videos
Show Abstract
Video action understanding tasks in real-world scenarios often suffer from data limitations. In this paper, we address the data-limited action understanding problem by bridging data scarcity. We propose a novel method that leverages a text-to-video diffusion transformer to generate annotated data for model training. This paradigm enables the generation of realistic annotated data on an infinite scale without human intervention. We proposed the Information Enhancement Strategy and the Uncertainty-Based Soft Target tailored to generate sample training. Through quantitative and qualitative analyzes, we discovered that real samples generally contain a richer level of information compared to generated samples. Based on this observation, the information enhancement strategy was designed to enhance the informational content of the generated samples from two perspectives: the environment and the character. Furthermore, we observed that a portion of low-quality generated samples might negatively affect model training. To address this, we devised an uncertainty-based label-smoothing strategy to increase the smoothing of these low-quality samples, thereby reducing their impact. We demonstrate the effectiveness of the proposed method on four datasets and five tasks, and achieve state-of-the-art performance for zero-shot action recognition.
1383: GRAML: Goal Recognition As Metric Learning
Authors: Matan Shamir, Reuth Mirsky
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Planning and Scheduling (3/5)
Show Abstract
Goal Recognition (GR) is the problem of recognizing an agent’s objectives based on observed actions.
Recent data-driven approaches for GR alleviate the need for costly, manually crafted domain models.
However, these approaches can only reason about a pre-defined set of goals, and time-consuming training is needed for new emerging goals.
To keep this model-learning automated while enabling quick adaptation to new goals, this paper introduces GRAML: Goal Recognition As Metric Learning.
GRAML frames GR as a deep metric learning problem, using a Siamese network composed of recurrent units to learn an embedding space where traces leading to the same goal are close, and those leading to different goals are distant.
This metric is particularly effective for adapting to new goals, even when only a single example trace is available per goal.
Evaluated on a versatile set of environments, GRAML shows speed, flexibility, and runtime improvements over the state-of-the-art GR while maintaining accurate recognition.
1386: Reliable and Calibrated Semantic Occupancy Prediction by Hybrid Uncertainty Learning
Authors: Song Wang, Zhongdao Wang, Jiawei Yu, Wentong Li, Bailan Feng, Junbo Chen, Jianke Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Vision-centric semantic occupancy prediction plays a crucial role in autonomous driving, which requires accurate and reliable predictions from low-cost sensors. Although having notably narrowed the accuracy gap with LiDAR, there is still few research effort to explore the reliability and calibration in predicting semantic occupancy from camera. In this paper, we conduct a comprehensive evaluation of existing semantic occupancy prediction models from a reliability perspective for the first time. Despite the gradual alignment of camera-based models with LiDAR in terms of accuracy, a significant reliability gap still persists. To address this concern, we propose ReliOcc, a method designed to enhance the reliability of camera-based occupancy networks. ReliOcc provides a plug-and-play scheme for existing models, which integrates hybrid uncertainty from individual voxels with sampling-based noise and relative voxels through mix-up learning. Besides, an uncertainty-aware calibration strategy is devised to further improve model reliability in offline mode. Extensive experiments under various settings demonstrate that ReliOcc significantly enhances the reliability of learned model while maintaining the accuracy for both geometric and semantic predictions. Notably, our proposed approach exhibits robustness to sensor failures and out of domain noises during inference.
1393: Transferable Relativistic Predictor: Mitigating Cross-Task Cold-Start Issue in NAS
Authors: Nan Li, Bing Xue, Lianbo Ma, Mengjie Zhang
Location: Guangzhou | Day: TBD
Show Abstract
In neural architecture search (NAS), the relativistic predictor has recently emerged as an attractive technique to solve ranking issue for performance evaluation by predicting the relativistic ranking of architecture pair rather than the absolute performance of an architecture. However, it suffers from a significant cold-start issue, requiring a large amount of evaluated architectures to train an effective predictor on new datasets. In this paper, we propose a transferable relativistic predictor (TRP). Specifically, we construct a proxy dataset using the transferable cheaper-to-obtain performance estimation to softly label the rank between architectural pairs. The soft label with a smooth and easy-to-optimize loss function facilitates the learning of expressive and generalizable representations on the proxy dataset. Furthermore, we construct Chebyshev interpolation for correlation curve to adaptively determine the number of evaluated architectures required on each dataset. Extensive experimental results in different search spaces show the superior performance of TRP compared with state-of-the-art predictors. TRP requires only 54 and 73 evaluated architectures for a warm start on the CIFAR-10 and CIFAR-100 under the DARTS search space.
1406: Few-shot Novel Category Discovery
Authors: Chunming Li, Shidong Wang, Haofeng Zhang
Location: Guangzhou | Day: TBD
Show Abstract
The recently proposed Novel Category Discovery (NCD) adapt paradigm of transductive learning hinders its application in more real-world scenarios. In fact, few labeled data in part of new categories can well alleviate this burden, which coincides with the ease that people can label few of new category data. Therefore, this paper presents a new setting in which a trained agent is able to flexibly switch between the tasks of identifying examples of known (labelled) classes and clustering novel (completely unlabeled) classes as the number of query examples increases by leveraging knowledge learned from only a few (handful) support examples. Drawing inspiration from the discovery of novel categories using prior-based clustering algorithms, we introduce a novel framework that further relaxes its assumptions to the real-world open set level by unifying the concept of model adaptability in few-shot learning. We refer to this setting as Few-Shot Novel Category Discovery (FSNCD) and propose Semi-supervised Hierarchical Clustering (SHC) and Uncertainty-aware K-means Clustering (UKC) to examine the model’s reasoning capabilities. Extensive experiments and detailed analysis on five commonly used datasets demonstrate that our methods can achieve leading performance levels across different task settings and scenarios. Code is available at: https://github.com/Ashengl/FSNCD.
1407: A Medical Image Classification Network Based on Multi-View Consistent Momentum Contrastive Learning
Authors: Chuangui Cao, Shifei Ding, Lili Guo
Location: Guangzhou | Day: TBD
Show Abstract
Due to variations in imaging conditions, images often exhibit discrepancies in color reproduction. Furthermore, motion-induced blur can lead to edge degradation, making color sensitivity and edge blurriness two prevalent and challenging issues in both natural image processing and medical image analysis. To address these challenges, we propose a model termed the Three-View Consistency Mo-mentum Contrastive with Sobel Operator (SVCMC). Specifically, we first design a three-view momen-tum-update architecture that employs a So-bel-augmented ResNet as the backbone. We then introduce a novel contrastive loss, referred to as the Three-View Consistency Momentum Contrastive Loss. Next, to mitigate the oscillations and slow convergence commonly observed in contrastive learning, we construct a dynamic contrastive loss function that adapts in real time over the training process. Finally, we validated the superiority of our model on two medical image datasets and one natural image dataset, where its classification ac-curacy and convergence speed significantly out-performed existing state-of-the-art contrastive models.
1408: Enhancing User-Oriented Proactivity in Open-Domain Dialogues with Critic Guidance
Authors: Yufeng Wang, Jinwu Hu, Ziteng Huang, Kunyang Lin, Zitian Zhang, Peihao Chen, Yu Hu, Qianyue Wang, Zhuliang Yu, Bin Sun, Xiaofen Xing, Qingfang Zheng, Mingkui Tan
Location: Guangzhou | Day: TBD
Show Abstract
Open-domain dialogue systems aim to generate natural and engaging conversations, providing significant practical value in real applications such as social robotics and personal assistants. The advent of large language models (LLMs) has greatly advanced this field by improving context understanding and conversational fluency. However, existing LLM-based dialogue systems often fall short in proactively understanding the user’s chatting preferences and guiding conversations toward user-centered topics. This lack of user-oriented proactivity can lead users to feel unappreciated, reducing their satisfaction and willingness to continue the conversation in human-computer interactions. To address this issue, we propose a User-oriented Proactive Chatbot (UPC) to enhance the user-oriented proactivity. Specifically, we first construct a critic to evaluate this proactivity inspired by the LLM-as-a-judge strategy. Given the scarcity of high-quality training data, we then employ the critic to guide dialogues between the chatbot and user agents, generating a corpus with enhanced user-oriented proactivity. To ensure the diversity of the user backgrounds, we introduce the ISCO-800, a diverse user background dataset for constructing user agents. Moreover, considering the communication difficulty varies among users, we propose an iterative curriculum learning method that trains the chatbot from easy-to-communicate users to more challenging ones, thereby gradually enhancing its performance. Experiments demonstrate that our proposed training method is applicable to different LLMs, improving user-oriented proactivity and attractiveness in open-domain dialogues. Code and appendix are available at github.com/wang678/LLM-UPC.
1412: Multi-player Multi-armed Bandits with Delayed Feedback
Authors: Jingqi Fan, Zilong Wang, Shuai Li, Linghe Kong
Location: Guangzhou | Day: TBD
Show Abstract
Multi-player multi-armed bandits (MP-MAB) have been extensively studied due to their application in cognitive radio networks. In this setting, multiple players simultaneously select arms and instantly receive feedback. However, in realistic decentralized networks, feedback is often delayed due to sensing latency and signal processing. Without a central coordinator, explicit communication is impossible, and delayed feedback disrupts implicit coordination, since it depends on synchronous observations. As a result, collisions are frequent and system performance degrades significantly. In this paper, we propose an algorithm in MP-MAB with stochastic delay feedback. Each player in the algorithm independently maintains an estimate of the optimal arm set based on their own delayed rewards but only pulls arms from the set, which is, with high probability, identical to those of other players, thus avoiding collisions. The identical arm set also enables implicit communication, allowing players to utilize the exploration results of others. We establish a regret upper bound and derive a lower bound to prove the algorithm is near-optimal. Numerical experiments on both synthetic and real-world datasets validate the effectiveness of our algorithm.
1420: Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation
Authors: Jie Li, Shifei Ding, Lili Guo, Xuan Li
Location: Guangzhou | Day: TBD
Show Abstract
Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation. Generating efficient and modality-specific representations for each utterance remains a significant challenge. Previous studies have proposed various models to integrate features extracted using different modality-specific encoders. However, they neglect the varying contributions of modalities to this task and introduce high complexity by aligning modalities at the frame level. To address these challenges, we propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance textual modality representations, while knowledge distillation is utilized to strengthen representations of weaker modalities. Furthermore, we introduce a multi-modal anchor gated transformer to effectively integrate utterance-level representations across modalities. Extensive experiments on the IEMOCAP and MELD datasets demonstrate the effectiveness of knowledge distillation in enhancing modality representations and achieve state-of-the-art performance in emotion recognition. Our code is available at: https://github.com/JieLi-dd/MAGTKD.
1423: Collaborative Multi-LoRA Experts with Achievement-based Multi-Tasks Loss for Unified Multimodal Information Extraction
Authors: Li Yuan, Yi Cai, Xudong Shen, Qing Li, Qingbao Huang, Zikun Deng, Tao Wang
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal Information Extraction (MIE) has gained attention for extracting structured information from multimedia sources. Traditional methods tackle MIE tasks separately, missing opportunities to share knowledge across tasks. Recent approaches unify these tasks into a generation problem using instruction-based T5 models with visual adaptors, optimized through full-parameter fine-tuning. However, this method is computationally intensive, and multi-task fine-tuning often faces gradient conflicts, limiting performance.
To address these challenges, we propose collaborative multi-LoRA experts with achievement-based multi-task loss (C-LoRAE) for MIE tasks. C-LoRAE extends the low-rank adaptation (LoRA) method by incorporating a universal expert to learn shared multimodal knowledge from cross-MIE tasks and task-specific experts to learn specialized instructional task features. This configuration enhances the model’s generalization ability across multiple tasks while maintaining the independence of various instruction tasks and mitigating gradient conflicts. Additionally, we propose an achievement-based multi-task loss to balance training progress across tasks, addressing the imbalance caused by varying numbers of training samples in MIE tasks. Experimental results on seven benchmark datasets across three key MIE tasks demonstrate that C-LoRAE achieves superior overall performance compared to traditional fine-tuning methods and LoRA methods while utilizing a comparable number of training parameters to LoRA.
1428: Smoothed Online Convex Optimization with Delayed Feedback
Authors: Sifan Yang, Wenhao Yang, Wei Jiang, Yuanyu Wan, Lijun Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Smoothed online convex optimization (SOCO), in which the online player incurs both a hitting cost and a switching cost for changing its decisions, has garnered significant attention in recent years. While existing studies typically assume that the gradient information is revealed immediately, such an assumption may not hold in some real-world applications. To overcome this limitation, we investigate SOCO with delayed feedback, and develop two online algorithms that can minimize the dynamic regret with switching cost. Firstly, we extend Mild-OGD, an existing algorithm that adopts the meta-expert framework for online convex optimization with delayed feedback, to account for switching cost. Specifically, we analyze the switching cost in the expert-algorithm of Mild-OGD, and then modify its meta-algorithm to incorporate this cost when assigning the weight to each expert. We demonstrate that our proposed method, Smelt-DOGD can achieve an O(√(dT(P_T+1))) dynamic regret bound with switching cost, where d is the maximum delay and P_T is the path-length. Secondly, we develop an efficient variant to reduce the number of projections per round from O(log T) to 1, yet maintaining the same theoretical guarantee. The key idea is to construct a new surrogate loss defined over a simpler domain for expert-algorithms so that these experts do not need to perform the complex projection operations in each round. Finally, we conduct experiments to validate the effectiveness and efficiency of our algorithms.
1430: Device-Cloud Collaborative Correction for On-Device Recommendation
Authors: Tianyu Zhan, Shengyu Zhang, Zheqi Lv, Jieming Zhu, Jiwei Li, Fan Wu, Fei Wu
Location: Guangzhou | Day: TBD
Show Abstract
With the rapid development of recommendation models and device computing power, device-based recommendation has become an important research area due to its better real-time performance and privacy protection. Previously, Transformer-based sequential recommendation models have been widely applied in this field because they outperform Recurrent Neural Network (RNN)-based recommendation models in terms of performance. However, as the length of interaction sequences increases, Transformer-based models introduce significantly more space and computational overhead compared to RNN-based models, posing challenges for device-based recommendation. To balance real-time performance and high performance on devices, we propose Device-Cloud Collaborative Correction Framework for On-Device Recommendation (CoCorrRec). CoCorrRec uses a self-correction network (SCN) to correct parameters with extremely low time cost. By updating model parameters during testing based on the input token, it achieves performance comparable to current optimal but more complex Transformer-based models. Furthermore, to prevent SCN from overfitting, we design a global correction network (GCN) that processes hidden states uploaded from devices and provides a global correction solution. Extensive experiments on multiple datasets show that CoCorrRec outperforms existing Transformer-based and RNN-based device recommendation models in terms of performance, with fewer parameters and lower FLOPs, thereby achieving a balance between real-time performance and high efficiency. Code is available at https:
//github.com/Yuzt-zju/CoCorrRec.
1441: TextMEF: Text-guided Prompt Learning for Multi-exposure Image Fusion
Authors: Jinyuan Liu, Qianjun Huang, Guanyao Wu, Di Wang, Zhiying Jiang, Long Ma, Risheng Liu, Xin Fan
Location: Guangzhou | Day: TBD
Show Abstract
Multi-exposure image fusion~(MEF) aims to integrate a set of low dynamic range images, producing a single image with a higher dynamic range than either one. Despite significant advancements, current MEF approaches still struggle to handle extremely over- or under-exposed conditions, resulting in unsatisfactory visual effects such as hallucinated details and distorted color tones. With this regard, we propose TextMEF, a prompt-driven fusion method enhanced by prompt learning, for multi-exposure image fusion. Specifically, we learn a set of prompts based on text-image similarity among negative and positive samples (over-exposed, under-exposed images, and well-exposed ones). These learned prompts are seamlessly integrated into the loss function, providing high-level guidance for constraining non-uniform exposure regions. Furthermore, we develop a attention Mamba module effectively translates over-/under- exposed regional features into exposure invariant space and ensure them to build efficient long-range dependency to high dynamic range image. Extensive experimental results on three publicly available benchmarks demonstrate that our TextMEF significantly outperforms state-of-the-art approaches in both visual inspection and objective analysis.
1445: Dual-Balancing for Physics-Informed Neural Networks
Authors: Chenhong Zhou, Jie Chen, Zaifeng Yang, Ching Eng Png
Location: Guangzhou | Day: TBD
Show Abstract
Physics-informed neural networks (PINNs) have emerged as a new learning paradigm for solving partial differential equations (PDEs) by enforcing the constraints of physical equations, boundary conditions (BCs), and initial conditions (ICs) into the loss function. Despite their successes, vanilla PINNs still suffer from poor accuracy and slow convergence due to the intractable multi-objective optimization issue. In this paper, we propose a novel Dual-Balanced PINN (DB-PINN), which dynamically adjusts loss weights by integrating inter-balancing and intra-balancing to alleviate two imbalance issues in PINNs. Inter-balancing aims to mitigate the gradient imbalance between PDE residual loss and condition-fitting losses by determining an aggregated weight that offsets their gradient distribution discrepancies. Intra-balancing acts on condition-fitting losses to tackle the imbalance in fitting difficulty across diverse conditions. By evaluating the fitting difficulty based on the loss records, intra-balancing can allocate the aggregated weight proportionally to each condition loss according to its fitting difficulty level. We further introduce a robust weight update strategy to prevent abrupt spikes and arithmetic overflow in instantaneous weight values caused by large loss variances, enabling smooth weight updating and stable training. Extensive experiments demonstrate that DB-PINN achieves significantly superior performance than those popular gradient-based weighting methods in terms of convergence speed and prediction accuracy. Our code and supplementary material are available at https://github.com/chenhong-zhou/DualBalanced-PINNs.
1451: Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization
Authors: Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Jiang Zong, Hao Peng, Jianwei Yin
Location: Guangzhou | Day: TBD
Show Abstract
Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks. Since the majority of knowledge is acquired during pre-training, attributing the predictions of fine-tuned LLMs to their pre-training data may provide valuable insights. Influence functions have been proposed as a means to explain model predictions based on training data. However, existing approaches often fail to compute "multi-stage" influence and lack scalability to billion-scale LLMs.

In this paper, we propose multi-stage influence functions to attribute the downstream predictions of fine-tuned LLMs to pre-training data under the full-parameter fine-tuning paradigm. To enhance the efficiency and practicality of our multi-stage influence function, we leverage Eigenvalue-corrected Kronecker-Factored (EK-FAC) parameterization for efficient approximation.
Empirical results validate the superior scalability of EK-FAC approximation and the effectiveness of our multi-stage influence function. Additionally, case studies on a real-world LLM, dolly-v2-3b, demonstrate its interpretive power, with exemplars illustrating insights provided by multi-stage influence estimates.
1453: Explaining Black-box Model Predictions via Two-level Nested Feature Attributions with Consistency Property
Authors: Yuya Yoshikawa, Masanari Kimura, Ryotaro Shimizu, Yuki Saito
Location: Montreal | Day: August 21st | Time: 10:00 | Session: ML: Explainable/Interpretable machine learning
Show Abstract
Techniques that explain the predictions of black-box machine learning models are crucial to make the models transparent, thereby increasing trust in AI systems.
The input features to the models often have a nested structure that consists of high- and low-level features, and each high-level feature is decomposed into multiple low-level features.
For such inputs, both high-level feature attributions (HiFAs) and low-level feature attributions (LoFAs) are important for better understanding the model’s decision.
In this paper, we propose a model-agnostic local explanation method that effectively exploits the nested structure of the input to estimate the two-level feature attributions simultaneously.
A key idea of the proposed method is to introduce the consistency property that should exist between the HiFAs and LoFAs, thereby bridging the separate optimization problems for estimating them.
Thanks to this consistency property, the proposed method can produce HiFAs and LoFAs that are both faithful to the black-box models and consistent with each other, using a smaller number of queries to the models.
In experiments on image classification in multiple instance learning and text classification using language models, we demonstrate that the HiFAs and LoFAs estimated by the proposed method are accurate, faithful to the behaviors of the black-box models, and provide consistent explanations.
1473: OT-DETECTOR: Delving into Optimal Transport for Zero-shot Out-of-Distribution Detection
Authors: Yu Liu, Hao Tang, Haiqi Zhang, Jing Qin, Zechao Li
Location: Guangzhou | Day: TBD
Show Abstract
Out-of-distribution (OOD) detection is crucial for ensuring the reliability and safety of machine learning models in real-world applications. While zero-shot OOD detection, which requires no training on in-distribution (ID) data, has become feasible with the emergence of vision-language models like CLIP, existing methods primarily focus on semantic matching and fail to fully capture distributional discrepancies. To address these limitations, we propose OT-DETECTOR, a novel framework that employs Optimal Transport (OT) to quantify both semantic and distributional discrepancies between test samples and ID labels. Specifically, we introduce cross-modal transport mass and transport cost as semantic-wise and distribution-wise OOD scores, respectively, enabling more robust detection of OOD samples. Additionally, we present a semantic-aware content refinement (SaCR) module, which utilizes semantic cues from ID labels to amplify the distributional discrepancy between ID and hard OOD samples. Extensive experiments on several benchmarks demonstrate that OT-DETECTOR achieves state-of-the-art performance across various OOD detection tasks, particularly in challenging hard-OOD scenarios.
1474: Curriculum Hierarchical Knowledge Distillation for Bias-Free Survival Prediction
Authors: Chaozhuo Li, Zhihao Tang, Mingji Zhang, Zhiquan Liu, Litian Zhang, Xi Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Survival prediction is a pivotal task for estimating mortality risk within a given timeframe based on whole slide images (WSIs). Conventional models typically assume that WSIs across patients are independent and identically distributed, an assumption that may not hold due to inherent variability in WSI preparation and the uncertain condition of infected tissues. These uncontrollable external factors introduce significant variability in the numbers and resolutions of WSIs across patients, leading to bias and compromised performance, particularly for tail patients with limited data. In this paper, we propose a novel approach, PathoKD, based on knowledge distillation. Recognizing the hierarchical nature of disease progression and the data scarcity issues associated with vanilla knowledge distillation methods, PathoKD integrates a novel curriculum learning framework with hierarchical knowledge distillation. This integration effectively mitigates the performance gap between head and tail patients, thereby enhancing prediction accuracy across patient groups. Our proposal is extensively evaluated over popular datasets and experimental results demonstrate its superiority.
1476: An Efficient Core-Guided Solver for Weighted Partial MaxSAT
Authors: Shiwei Pan, Yiyuan Wang, Shaowei Cai
Location: Guangzhou | Day: TBD
Show Abstract
The maximum satisfiability problem (MaxSAT) is a crucial combinatorial optimization problem with widespread applications across various critical domains. This paper presents CASHWMaxSAT, an efficient core-guided MaxSAT solver based on two novel ideas.
The first and most important idea is the introduction of an extended stratification technique that progressively focuses on solving high-weight soft clauses. Second, we integrate disjoint unsatisfiable cores with the goal of minimizing the unsatisfiable core, allowing the solver to learn multiple high-quality clauses in a single conflict analysis step. These innovations enable our MaxSAT solver to efficiently identify key constraints and reduce redundant reasoning, significantly enhancing solving efficiency. Experimental results on benchmarks from the complete weighted track of the MaxSAT Evaluations 2022-2024 demonstrate that the proposed methods lead to substantial improvements, with CASHWMaxSAT outperforming state-of-the-art MaxSAT solvers across all benchmarks. Additionally, it enabled us to achieve the top two positions in the exact weighted category of the MaxSAT Evaluation 2024.
1480: DiffusionIMU: Diffusion-Based Inertial Navigation with Iterative Motion Refinement
Authors: Xiaoqiang Teng, Chenyang Li, Shibiao Xu, Zhihao Hao, Deke Guo, Jingyuan Li, Haisheng Li, Weiliang Meng, Xiaopeng Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Inertial navigation enables self-contained localization using only Inertial Measurement Units (IMUs), making it widely applicable in various domains such as navigation, augmented reality, and robotics. However, existing methods suffer from drift accumulation due to the sensor noise and difficulty capturing long-range temporal dependencies, limiting their robustness and accuracy. To address these challenges, we propose DiffusionIMU, a novel diffusion-based framework for inertial navigation. DiffusionIMU enhances direct velocity regression from IMU data through an iterative generative denoising process, progressively refining motion state estimation. It integrates the noise-adaptive feature modulation for sensor variability handling, the feature alignment mechanism for representation consistency, and the diffusion-based temporal modeling to decrease accumulated drift. Experiments show that DiffusionIMU consistently outperforms existing methods, demonstrating superior generalization to unseen users while alleviating the impact of the sensor noise.
1484: Beyond Feature Mapping GAP: Integrating Real HDRTV Priors for Superior SDRTV-to-HDRTV Conversion
Authors: Gang He, Kepeng Xu, Li Xu, Wenxin Yu, Xianyun Wu
Location: Guangzhou | Day: TBD
Show Abstract
The rise of HDR-WCG display devices has highlighted the need to convert SDRTV to HDRTV, as most video sources are still in SDR. Existing methods primarily focus on designing neural networks to learn a single-style mapping from SDRTV to HDRTV. However, the limited information in SDRTV and the diversity of styles in real-world conversions render this process an ill-posed problem, thereby constraining the performance and generalization of these methods. Inspired by generative approaches, we propose a novel method for SDRTV to HDRTV conversion guided by real HDRTV priors. Despite the limited information in SDRTV, introducing real HDRTV as reference priors significantly constrains the solution space of the originally high-dimensional ill-posed problem. This shift transforms the task from solving an unreferenced prediction problem to making a referenced selection, thereby markedly enhancing the accuracy and reliability of the conversion process. Specifically, our approach comprises two stages: the first stage employs a Vector Quantized Generative Adversarial Network to capture HDRTV priors, while the second stage matches these priors to the input SDRTV content to recover realistic HDRTV outputs. We evaluate our method on public datasets, demonstrating its effectiveness with significant improvements in both objective and subjective metrics across real and synthetic datasets.
1499: Progressive Prefix-Memory Tuning for Complex Logical Query Answering on Knowledge Graphs
Authors: Xingrui Zhuo, Shirui Pan, Jiapu Wang, Gongqing Wu, Zan Zhang, Rui Li, Zizhong Wei, Xindong Wu
Location: Guangzhou | Day: TBD
Show Abstract
Conducting complex logical queries over knowledge graphs remains a significant challenge. Recent research has successfully leveraged Pre-trained Language Models (PLMs) to tackle Knowledge Graph Complex Query Answering (KGCQA) tasks, which is attributed to PLMs’ ability to comprehend logical semantics of queries through context learning. However, existing PLM-based KGCQA methods usually overlook the harm of disordered syntax or fragmented contexts within a serialized query, posing the problem of “impossible language” to limit PLMs in grasping the logical semantics. To address this problem, we propose a Progressive Prefix-Memory Tuning (PPMT) framework for KGCQA tasks, which effectively rectifies erroneous segments in serialized queries to assist PLMs in query answering. First, we propose a prefix-memory rectification mechanism embedded in a PLM module. This mechanism assigns rectification parameters in memory stores to polish the language segments of entities, relations, and queries through specific prefixes. To further capture the logical semantics in queries, we design a progressive fine-tuning strategy, which optimizes our model through a conditional gradient update process guided by knowledge translation constraints. Extensive experiments on widely used KGCQA benchmarks demonstrate the significant superiority of PPMT in terms of HR@3 and MRR. Our codes are available at https://github.com/lazyloafer/PPMT.
1521: Causal View of Time Series Imputation: Some Identification Results on Missing Mechanism
Authors: Ruichu Cai, Kaitao Zheng, Junxian Huang, Zijian Li, Zhengming Chen, Boyan Xu, Zhifeng Hao
Location: Guangzhou | Day: TBD
Show Abstract
Time series imputation is one of the most challenging problems and has broad applications in various fields like health care and the Internet of Things. Existing methods mainly aim to model the temporally latent dependencies and the generation process from the observed time series data. In real-world scenarios, different types of missing mechanisms, like MAR (Missing At Random) and MNAR (Missing Not At Random), can occur in time series data. However, existing methods often overlook the difference among the aforementioned missing mechanisms and use a single model for time series imputation, which can easily lead to misleading results due to mechanism mismatching. In this paper, we propose a framework for the time series imputation problem by exploring Different Missing Mechanisms (DMM in short) and tailoring solutions accordingly. Specifically, we first analyze the data generation processes with temporal latent states and missing cause variables for different mechanisms. Sequentially, we model these generation processes via variational inference and estimate prior distributions of latent variables via a normalizing flow-based neural architecture. Furthermore, we establish identifiability results under the nonlinear independent component analysis framework to show that latent variables are identifiable. Experimental results show that our method surpasses existing time series imputation techniques across various datasets with different missing mechanisms, demonstrating its effectiveness in real-world applications.
1523: The Devil is in Fine-tuning and Long-tailed Problems: A New Benchmark for Scene Text Detection
Authors: Tianjiao Cao, Jiahao Lyu, Weichao Zeng, Weimin Mu, Yu Zhou
Location: Montreal | Day: August 21st | Time: 11:30 | Session: CV: Benchmarks
Show Abstract
Scene text detection has seen the emergence of high-performing methods that excel on academic benchmarks. However, these detectors often fail to replicate such success in real-world scenarios. We uncover two key factors contributing to this discrepancy through extensive experiments. First, a Fine-tuning Gap, where models leverage Dataset-Specific Optimization (DSO) paradigm for one domain at the cost of reduced effectiveness in others, leads to inflated performances on academic benchmarks. Second, the suboptimal performance in practical settings is primarily attributed to the longtailed distribution of texts, where detectors struggle with rare and complex categories as artistic or overlapped text. Given that the DSO paradigm might undermine the generalization ability of models, we advocate for a Joint-Dataset Learning (JDL) protocol to alleviate the Fine-tuning Gap. Additionally, an error analysis is conducted to identify three major categories and 13 subcategories of challenges in long-tailed scene text, upon which we propose a Long-Tailed Benchmark (LTB). LTB facilitates a comprehensive evaluation of ability to handle a diverse range of long-tailed challenges. We further introduce MAEDet, a self-supervised learningbased method, as a strong baseline for LTB. The code is available at https://github.com/pd162/LTB.
1532: Critical Node-aware Augmentation for Hypergraph Contrastive Learning
Authors: Zhuo Li, Yuena Lin, Yipeng Wang, Wenmao Liu, Mingliang Yu, Zhen Yang, Gengyu Lyu
Location: Guangzhou | Day: TBD
Show Abstract
Hypergraph contrastive learning enables effective representation learning for hypergraphs without requiring labels. However, existing methods typically rely on randomly deleting or replacing nodes during hypergraph augmentation, which may lead to the absence of critical nodes and further disrupt the higher-order structural relationships within augmented hypergraphs. To address this issue, we propose a Critical Node-aware hypergraph contrastive learning method, which is the first attempt to leverage hyperedge prediction to retain critical nodes and accordingly maintain the reliable higher-order structural relationships within augmented hypergraphs. Specifically, we first employ contrastive learning to align the augmented hypergraphs, and then generate hyperedge embeddings to characterize node representations and their structural correlations. During the hyperedge embedding encoding process, we introduce a hyperedge prediction discriminator to score these embeddings, which quantifies the nodes’ contributions to identify the critical nodes and maintain the higher-order structural relationships within augmented hypergraphs. Compared with previous studies, our proposed method can effectively alleviate the erroneous deletion or replacement of critical nodes and steadily maintain the inherent structural relationships between original hypergraph and augmented hypergraphs, naturally guiding better hypergraph representations for downstream tasks. Extensive experiments on various tasks demonstrate that our method is significantly superior to state-of-the-art methods.
1533: Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models
Authors: Xin Huang, Ruibin Li, Tong Jia, Wei Zheng, Ya Wang
Location: Guangzhou | Day: TBD
Show Abstract
Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model, thereby enhancing its overall performance. AHNPL also introduces a contrastive learning approach using a multimodal hard negative loss to improve the model’s discrimination of hard negatives within each modality and a dynamic margin loss that adjusts the contrastive margin according to sample difficulty to enhance the distinction of challenging sample pairs. Experiments on three public datasets demonstrate that our method effectively boosts VLMs’ performance on complex CR tasks. The source code is available at https://github.com/nynu-BDAI/AHNPL.
1545: Adaptive Deep Learning from Crowds
Authors: Hang Yang, Zhiwu Li, Witold Pedrycz
Location: Guangzhou | Day: TBD
Show Abstract
In the data-driven era, collecting high-quality labeled data requiring human labor is a common approach for training data-hungry models, called crowdsourcing. Recently, end-to-end learning from crowds has shown its flexibility and practicality. However, existing works in an end-to-end manner focus on learning after collecting labels, which results in noisy annotations and also requires cost. Inspired by computerized adaptive testing, we argue that the characteristics of workers should be mined as soon as possible to make the best use of talents. To this end, we propose an adaptive learning from crowds method, AdaCrowd, as a cost-effective solution. Specifically, we propose a probabilistic model to capture the informativeness of possible instances for each worker. The informativeness is considered to be the uncertainty of the annotation prediction model output in its current status. The adaptive learning procedure is optimized by maximizing data likelihood and can be used with existing crowdsourcing models. Extensive experiments are conducted on real-world datasets, LabelMe and CIFAR-10H. The experimental results, e.g., the reduction of annotations without performance degradation, demonstrate the effectiveness.
1551: Mixture-of-Queries Transformer: Camouflaged Instance Segmentation via Queries Cooperation and Frequency Enhancement
Authors: Weiwei Feng, Nanqing Xu, Tengfei Liu, Weiqiang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Due to the high similarity between camouflaged instances and the surroundings and the widespread camouflage-like scenarios, the recently proposed camouflaged instance segmentation (CIS) is a challenging and relevant task. Previous approaches achieve some progress on CIS, while many overlook camouflaged objects’ color and contour nature and then decide on each candidate instinctively. In this paper, we contribute a Mixture-of-Queries
Transformer (MoQT) in an end-to-end manner for CIS based on two key designs (a Frequency Enhancement Feature Extractor and a Mixture-of-Queries Decoder). First, the Frequency Enhancement Feature Extractor is responsible for capturing the camouflaged clues in the frequency domain. To expose camouflaged instances, the extractor enhances the effectiveness of contour, eliminates the interference color, and obtains suitable features simultaneously. Second, a Mixture-of-Queries Decoder utilizes multiple newly initialized experts of queries (a group of queries considered an expert) in each layer for spotting camouflaged characteristics with cooperation. These experts collaborate to generate outputs with the mixture-of-queries mechanism, refined hierarchically to a fine-grained level for more accurate instance masks. Coupling these two components enables MoQT to use multiple experts to integrate effective clues of camouflaged objects in both spatial and frequency domains. Extensive experimental results demonstrate our MoQT outperforms 19 state-of-the-art CIS approaches on both COD10K and NC4K datasets.
1555: BankTweak: Adversarial Attack Against Multi-Object Trackers by Manipulating Feature Banks
Authors: Woojin Shin, Donghwa Kang, Daejin Choi, Brent Byunghoon Kang, Jinkyu Lee, Hyeongboo Baek
Location: Montreal | Day: August 21st | Time: 10:00 | Session: CV: attacks
Show Abstract
Modern multi-object tracking (MOT) predominantly relies on the tracking-by-detection paradigm to construct object trajectories. Traditional MOT attacks primarily degrade detection quality in specific frames only, lacking efficiency, while state-of-the-art (SOTA) approaches induce persistent identity (ID) switches by manipulating object positions during the association phase, even after the attack ends. In this paper, we reveal that these SOTA attacks can be easily counteracted by adjusting distance-related parameters in the association phase, exposing their lack of robustness. To overcome these limitations, we propose BankTweak, a novel adversarial attack targeting feature-based MOT systems to induce persistent ID switches (efficiency) without modifying object positions (robustness). BankTweak exploits a critical vulnerability in the Hungarian matching algorithm of MOT systems by strategically injecting altered features into feature banks during the association phase. Extensive experiments on MOT17 and MOT20 datasets, combining various detectors, feature extractors, and trackers, demonstrate that BankTweak significantly outperforms SOTA attacks up to 11.8 times, exposing fundamental vulnerabilities in the tracking-by-detection framework.
1571: Circuit-Aware d-DNNF Compilation
Authors: Vincent Derkinderen, Jean-Marie Lagniez
Location: Montreal | Day: August 21st | Time: 15:00 | Session: Knowledge Representation and Reasoning (4/4)
Show Abstract
Boolean circuits in d-DNNF (determinstic Decomposable Negation Normal Form) enable tractable probabilistic inference, motivating research into compilers that transform arbitrary Boolean circuit into this form. However, d-DNNF compilers commonly require the input to be in conjunctive normal form (CNF), which means that a user must first convert their Boolean circuit into CNF. In this work, we argue that d-DNNF compilation would substantially benefit from reasoning over the original input circuit’s structure, rather than solely relying on its CNF representation. To this end, we adapt an existing compiler and implement an optimisation that becomes more readily available once we reason over the input circuit: the identification and elimination of don’t care variables. We empirically demonstrate the effectiveness of this approach, achieving a significant improvement in both the number of solved instances and the size of the resulting circuits.
1579: Cyclic Vision-Language Manipulator: Towards Reliable and Fine-Grained Image Interpretation for Automated Report Generation
Authors: Yingying Fang, Zihao Jin, Shaojie Guo, Jinda Liu, Zhiling Yue, Yijian Gao, Junzhi Ning, Zhi Li, Simon Walsh, Guang Yang
Location: Montreal | Day: August 21st | Time: 10:00 | Session: AI Ethics, Trust, Fairness (2/3)
Show Abstract
Despite significant advancements in automated report generation, the opaqueness of text interpretability continues to cast doubt on the reliability of the content produced. This paper introduces a novel approach to identify specific image features in X-ray images that influence the outputs of report generation models. Specifically, we propose Cyclic Vision-Language Manipulator (CVLM), a module to generate a manipulated X-ray from an original X-ray and its report from a designated report generator. The essence of CVLM is that cycling manipulated X-rays to the report generator produces altered reports aligned with the alterations pre-injected into the reports for X-ray generation, achieving the term “cyclic manipulation”. This process allows direct comparison between original and manipulated X-rays, clarifying the critical image features driving changes in reports and enabling model users to assess the reliability of the generated texts. Empirical evaluations demonstrate that CVLM can identify more precise and reliable features compared to existing explanation methods, significantly enhancing the transparency and applicability of AI-generated reports.
1585: Zero-shot Generalist Graph Anomaly Detection with Unified Neighborhood Prompts
Authors: Chaoxi Niu, Hezhe Qiao, Changlu Chen, Ling Chen, Guansong Pang
Location: Montreal | Day: August 21st | Time: 15:00 | Session: DM: Graph Data Mining
Show Abstract
Graph anomaly detection (GAD), which aims to identify nodes in a graph that significantly deviate from normal patterns, plays a crucial role in broad application domains. However, existing GAD methods are one-model-for-one-dataset approaches, i.e., training a separate model for each graph dataset. This largely limits their applicability in real-world scenarios. To overcome this limitation, we propose a novel zero-shot generalist GAD approach UNPrompt that trains a one-for-all detection model, requiring the training of one GAD model on a single graph dataset and then effectively generalizing to detect anomalies in other graph datasets without any retraining or fine-tuning. The key insight in UNPrompt is that i) the predictability of latent node attributes can serve as a generalized anomaly measure and ii) generalized normal and abnormal graph patterns can be learned via latent node attribute prediction in a properly normalized node attribute space. UNPrompt achieves a generalist mode for GAD through two main modules: one module aligns the dimensionality and semantics of node attributes across different graphs via coordinate-wise normalization, while another module learns generalized neighborhood prompts that support the use of latent node attribute predictability as an anomaly score across different datasets. Extensive experiments on real-world GAD datasets show that UNPrompt significantly outperforms diverse competing methods under the generalist GAD setting, and it also has strong superiority under the one-model-for-one-dataset setting. Code is available at https://github.com/mala-lab/UNPrompt.
1588: GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer
Authors: Yihong Lin, Zhaoxin Fan, Xianjia Wu, Lingyu Xiong, Xiandong Li, Wenxiong Kang, Liang Peng, Songju Lei, Huang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Speech-driven talking head generation is a critical yet challenging task with applications in augmented reality and virtual human modeling. While recent approaches using autoregressive and diffusion-based models have achieved notable progress, they often suffer from modality inconsistencies, particularly misalignment between audio and mesh, leading to reduced motion diversity and lip-sync accuracy. To address this, we propose GLDiTalker, a novel speech-driven 3D facial animation model based on a Graph Latent Diffusion Transformer. GLDiTalker resolves modality misalignment by diffusing signals within a quantized spatiotemporal latent space. It employs a two-stage training pipeline: the Graph-Enhanced Quantized Space Learning Stage ensures lip-sync accuracy, while the Space-Time Powered Latent Diffusion Stage enhances motion diversity. Together, these stages enable GLDiTalker to generate realistic, temporally stable 3D facial animations. Extensive evaluations on standard benchmarks demonstrate that GLDiTalker outperforms existing methods, achieving superior results in both lip-sync accuracy and motion diversity.
1596: Guaranteed Top-Adaptive-K in Recommendation
Authors: Nitin Bisht, Xiuwen Gong, Guandong Xu
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Machine Learning (1/4)
Show Abstract
Recommender systems (RS) are crucial in offering personalized suggestions tailored to user preferences. While conventionally, Top-K recommendation approach is widely adopted, its reliance on fixed recommendation sizes overlooks the diverse needs of users, leading to some relevant items not being recommended or vice versa. While recent work has made progress, they determine K by searching over all possible recommendation sizes for each user during inference. In real-world scenarios, with large datasets and numerous users with diverse and extensive preferences, this process becomes computationally impractical. Moreover, there is no theoretical guarantee of improved performance with the personalized K. In this paper, we propose a novel framework, Top-Adaptive-K, which determines dynamic K-prediction set size for each user efficiently and effectively. Generally, the framework formulates the recommendation problem within the Conformal Risk Control paradigm and proposes the loss function based on user utility functions. A novel greedy optimization algorithm, K-Adapt, is designed to efficiently learn prediction sets. Theoretical analysis is provided to ensure recommendation performance by establishing upper bounds on the expected risk. Extensive experiments on multiple datasets demonstrate that the Top-Adaptive-K framework outperforms baseline methods in both performance and time efficiency, offering a guaranteed solution to the fixed Top-K challenges.
1600: Polynomial-Time Relational Probabilistic Inference in Open Universes
Authors: Luise Ge, Brendan Juba, Kris Nilsson
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Uncertainty in AI
Show Abstract
Reasoning under uncertainty is a fundamental challenge in Artificial Intelligence. As with most of these challenges, there is a harsh dilemma between the expressive power of the language used, and the tractability of the computational problem posed by reasoning. Inspired by human reasoning, we introduce a method of first-order relational probabilistic inference that satisfies both criteria, and can handle hybrid (discrete and continuous) variables. Specifically, we extend sum-of-squares logic of expectation to relational settings, demonstrating that lifted reasoning in the bounded-degree fragment for knowledge bases of bounded quantifier rank can be performed in polynomial time, even with an a priori unknown and/or countably infinite set of objects. Crucially, our notion of tractability is framed in proof-theoretic terms, which extends beyond the syntactic properties of the language or queries. We are able to derive the tightest bounds provable by proofs of a given degree and size and establish completeness in our sum-of-squares refutations for fixed degrees.
1607: SE(3)-Equivariant Diffusion Models for 3D Object Analysis
Authors: Xie Min, Zhao Jieyu, Shen Kedi, Chen Kangxin
Location: Guangzhou | Day: TBD
Show Abstract
SE(3)-equivariance is a critical property for capturing pose information in 3D vision tasks, enabling models to handle transformations such as rotations and translations effectively. While equivariant diffusion models have recently demonstrated promise in 3D object reassembly due to their generative and denoising capabilities, they face key challenges when applied to this task. Specifically, traditional diffusion models rely on fixed input sizes, which limits their adaptability to varying part quantities, and their linear noise addition and removal processes struggle to address the inherently nonlinear transformations of 3D parts. To overcome these limitations, this paper proposes an SE(3)-equivariant diffusion model for pose denoising and 3D object reassembly from fragmented parts. The model incorporates an equivariant encoder to extract SE(3)-equivariant features, a Lie algebra mapping to linearize noise addition and removal, and an elastic diffusion framework capable of adapting to varying part quantities and nonlinear transformations. By leveraging these components, the method achieves accurate and robust pose predictions across diverse input configurations. Experiments conducted on the Breaking Bad dataset, a real-world RePAIR and a self-constructed 3D mannequin dataset demonstrate the effectiveness of the proposed model, outperforming state-of-the-art methods across metrics such as root mean square error and part accuracy. Ablation studies further validate the critical contributions of key modules, emphasizing their roles in improving accuracy and robustness for 3D part reassembly tasks.
1625: CASA: CNN Autoencoder-based Score Attention for Efficient Multivariate Long-term Time-series Forecasting
Authors: Minhyuk Lee, Hyekyung Yoon, MyungJoo Kang
Location: Montreal | Day: August 20th | Time: 14:00 | Session: ML: time series, sequences and signals
Show Abstract
Multivariate long-term time series forecasting is critical for applications such as weather prediction, and traffic analysis. In addition, the implementation of Transformer variants has improved prediction accuracy. Following these variants, different input data process approaches also enhanced the field, such as tokenization techniques including point-wise, channel-wise, and patch-wise tokenization. However, previous studies still have limitations in time complexity, computational resources, and cross-dimensional interactions. To address these limitations, we introduce a novel CNN Autoencoder-based Score Attention mechanism (CASA), which can be introduced in diverse Transformers model-agnosticically by reducing memory and leading to improvement in model performance. Experiments on eight real-world datasets validate that CASA decreases computational resources by up to 77.7%, accelerates inference by 44.0%, and achieves state-of-the-art performance, ranking first in 87.5% of evaluated metrics. Our code is available at https://github.com/lmh9507/CASA.
1626: Evaluating and Mitigating Linguistic Discrimination in Large Language Models: Perspectives on Safety Equity and Knowledge Equity
Authors: Guoliang Dong, Haoyu Wang, Jun Sun, Xinyu Wang
Location: Montreal | Day: August 21st | Time: 10:00 | Session: AI Ethics, Trust, Fairness (2/3)
Show Abstract
Large language models (LLMs) typically provide multilingual support and demonstrate remarkable capabilities in solving tasks described in different languages. However, LLMs can exhibit linguistic discrimination due to the uneven distribution of training data across languages. That is, LLMs struggle to maintain consistency when handling the same task in different languages, compromising both safety equity and knowledge equity. In this paper, we first systematically evaluate the linguistic discrimination of LLMs from two aspects: safety and quality, using a form of metamorphic testing. The metamorphic relationship we examine is that LLMs are expected to deliver outputs with similar semantics when prompted with inputs that have the same meaning. We conduct this evaluation with two datasets based on four representative LLMs. The results show that LLMs exhibit stronger human alignment capabilities with queries in English, French, Russian, and Spanish compared to queries in Bengali, Georgian, Nepali and Maithili. Moreover, for queries in English, Danish, Czech and Slovenian, LLMs tend to produce responses with a higher quality compared to the other languages. Upon these findings, we propose LDFighter, a similarity-based voting method, to mitigate the linguistic discrimination in LLMs. We comprehensively evaluate LDFighter against a spectrum of queries including benign, harmful, and adversarial prompts. The results show that LDFighter significantly reduces jailbreak success rates and improves response quality. All code, data, and the technical appendix are publicly available at: \url{https://github.com/dgl-prc/ldfighter}.
1636: Decision-Aware Preference Modeling for Multi-Behavior Recommendation
Authors: Qingfeng Li, Wei Liu, Zaiqiao Meng, Jian Yin
Location: Guangzhou | Day: TBD
Show Abstract
In recommender systems, multi-behavior methods have demonstrated significant effectiveness in addressing issues such as data sparsity—challenges commonly encountered by traditional single-behavior recommendation methods. These methods typically infer user preferences from various auxiliary behaviors and apply them to recommendations for the target behavior. However, existing methods face challenges in uncovering the interaction patterns for different behaviors from multi-behavior implicit feedback, as users exhibit varying preference strengths for different items across behaviors. To address this issue, this paper introduces a novel approach, Decision-Aware Preference Modeling (DAPM), for multi-behavior recommendation. We first construct a behavior-agnostic graph to learn comprehensive representations that are not affected by behavior factors, complementing the behavior-specific representations. Subsequently, we introduce an innovative contrastive learning paradigm that emphasizes inter-behavior consistency and intra-behavior uniformity to alleviate the “false repulsion” problem in traditional contrastive learning. Furthermore, we propose a multi-behavior hinge loss with boundary constraints to explicitly model users’ decision boundaries across different behaviors, thereby enhancing the model’s ability to accurately capture users’ inconsistent preference intensities. Extensive experiments on three real-world datasets demonstrate the consistent improvements achieved by DAPM over thirteen state-of-the-art baselines. We release our code at https://github.com/Breeze-del/DAPM.
1646: G3PT: Unleash the Power of Autoregressive Modeling in 3D Generation via Cross-Scale Querying Transformer
Authors: Jinzhi Zhang, Feng Xiong, Guangyu Wang, Mu Xu
Location: Guangzhou | Day: TBD
Show Abstract
Autoregressive transformers have revolutionized generative models in language processing and shown substantial promise in image and video generation. However, these models face significant challenges when extended to 3D generation tasks due to their reliance on next-token prediction to learn token sequences, which is incompatible with the unordered nature of 3D data. Instead of imposing an artificial order on 3D data, in this paper, we introduce G3PT, a scalable, coarse-to-fine 3D native generative model with cross-scale vector quantization and cross-scale autoregressive modeling. The key is to map point-based 3D data into discrete tokens with different levels of detail, naturally establishing a sequential relationship across a variety of scales suitable for autoregressive modeling. Remarkably, our method connects tokens globally across different levels of detail without manually specified ordering. Benefiting from this approach, G3PT features a versatile 3D generation pipeline that effortlessly supports the generation of 3D shapes under diverse conditional modalities. Extensive experiments demonstrate that G3PT achieves superior generation quality and generalization ability compared to previous baselines. Most importantly, for the first time in 3D generation, scaling up G3PT reveals distinct power-law scaling behaviors.
1649: Picturized and Recited with Dialects: A Multimodal Chinese Representation Framework for Sentiment Analysis of Classical Chinese Poetry
Authors: Xiaocong Du, Haoyu Pei, Haipeng Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Classical Chinese poetry is a vital and enduring part of Chinese literature, conveying profound emotional resonance. Existing studies analyze sentiment based on textual meanings, overlooking the unique rhythmic and visual features inherent in poetry,especially since it is often recited and accompanied by Chinese paintings. In this work, we propose a dialect-enhanced multimodal framework for classical Chinese poetry sentiment analysis. We extract sentence-level audio features from the poetry and incorporate audio from multiple dialects,which may retain regional ancient Chinese phonetic features, enriching the phonetic representation. Additionally, we generate sentence-level visual features, and the multimodal features are fused with textual features enhanced by LLM translation through multimodal contrastive representation learning. Our framework outperforms state-of-the-art methods on two public datasets, achieving at least 2.51% improvement in accuracy and 1.63% in macro F1. We open-source the code to facilitate research in this area and provide insights for general multimodal Chinese representation.
1655: BTPG: A Platform and Benchmark for Behavior Tree Planning in Everyday Service Robots
Authors: Xinglin Chen, Yishuai Cai, Minglong Li, Yunxin Mao, Zhou Yang, Wenjing Yang, Weixia Xu, Ji Wang
Location: Guangzhou | Day: TBD
Show Abstract
Behavior Trees (BTs) are a widely used control architecture in robotics, renowned for their robustness and safety, which are especially crucial for everyday service robots. Recently, several methods have been proposed to automatically plan BTs to accomplish specific tasks. However, existing research in BT planning lacks two main aspects: (1) the absence of a standard platform for modeling and planning BTs, along with testing benchmarks; and (2) insufficient metrics for a comprehensive evaluation of BT planning algorithms. In this paper, we propose Behavior Tree Planning Gym (BTPG), the first platform and benchmark for BT planning in everyday service robots.
In BTPG, behavior nodes are represented by predicate logic, and objects are categorized to better define the predicate domains and action models. The BT planning problem is then formulated in the STRIPS style. We support four environments and three simulators with different action models, which cover most of the needs of everyday service activities. We design a dataset generator for each environment and test three state-of-the-art BT planning algorithms, as well as one proposed by us, using various common metrics. In addition, we design three advanced metrics, planning progress, region distance, and execution robustness, to gain deeper insights into these BT planning algorithms. With a standard test benchmark, we hope BTPG can inspire and accelerate progress in the field of BT planning. Our codes are available at https://github.com/DIDS-EI/BTPG.
1675: A Game-Theoretic Perspective on Inconsistency Handling
Authors: Yakoub Salhi
Location: Montreal | Day: August 21st | Time: 15:00 | Session: Knowledge Representation and Reasoning (4/4)
Show Abstract
This paper introduces a game-theoretic framework for restoring consistency in propositional bases. The process is modeled as an interactive dialogue between two agents: a Proponent, who seeks to isolate a unique, consistent subset by posing strategic questions, and an Opponent, who aims to obstruct that goal through adversarial responses. We show that this framework provides a foundation for quantifying the effort involved in restoring consistency, revealing a connection between this effort and entropy in information theory. Focusing on the case where consistency is achieved by isolating a single maximal consistent subset, we establish links between the structure and number of such subsets and the existence of winning strategies. Finally, we demonstrate how the quantified restoration effort can serve as a basis for measuring inconsistency.
1685: Model Rake: A Defense Against Stealing Attacks in Split Learning
Authors: Qinbo Zhang, Xiao Yan, Yanfeng Zhao, Fangcheng Fu, Quanqing Xu, Yukai Ding, Xiaokai Zhou, Chuang Hu, Jiawei Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Split learning is a prominent framework for vertical federated learning, where multiple clients collaborate with a central server for model training by exchanging intermediate embeddings. Recently, it is shown that an adversarial server can exploit the intermediate embeddings to train surrogate models to replace the bottom models on the clients (i.e., model stealing). The surrogate models can also be used to reconstruct private training data of the clients (i.e., data stealing).
To defend against these stealing attacks, we propose Model Rake (i.e., Rake), which runs two bottom models on each client and differentiates their output spaces to make the two models distinct. Rake hinders the stealing attacks because it is difficult for a surrogate model to approximate two distinct bottom models. We prove that, under some assumptions, the surrogate model converges to the average of the two bottom models and thus will be inaccurate. Extensive experiments show that Rake is much more effective than existing methods in defending against both model and data stealing attacks, and the accuracy of normal model training is not affected.
1691: Deep Learning-Based Pedestrian Simulation with Limited Real-World Training Data: An Evaluation Framework
Authors: Vahid Mahzoon, Abigail Liu, Slobodan Vucetic
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Agent-based and Multi-agent Systems (2/3)
Show Abstract
Simulating pedestrian movement is important for applications such as disaster management, robotics, and game design. While deep learning models have been extensively used on related problems, their use as pedestrian simulators remains relatively unexplored. This paper aims to encourage more research in this direction in two ways. First, it proposes an evaluation framework that is applicable to both traditional and deep learning based simulators. Second, it proposes and evaluates several ideas related to input representation, choice of neural architecture, exploiting knowledge-based simulators in data poor regimes, and repurposing trajectory prediction models. Our extensive experiments provide several useful insights for future research in pedestrian simulation. The code is available at https://github.com/vmahzoon76/DL-Crowd-Sim.
1692: Approximately EFX and fPO Allocations for Bivalued Chores
Authors: Zehan Lin, Xiaowei Wu, Shengwei Zhou
Location: Guangzhou | Day: TBD
Show Abstract
We consider the computation for allocations of indivisible chores that are approximately EFX and fractional Pareto optimal (fPO). It has been shown that 3-EFX and fPO allocations for bi-valued instances always exist, where the cost of an item to an agent is either 1 or k (where k > 1), by rounding the (fractional) earning restricted equilibrium. In this work, we improve the approximation ratio to (2-1/k), while preserving the fractional Pareto optimality. Instead of rounding fractional equilibrium, our algorithm starts with the integral EF1 equilibrium for bi-valued chores and reallocates items until approximate EFX is achieved. We further improve our result for the case when k=2 and devise an algorithm that computes EFX and fPO allocations.
1698: Long-Term Individual Causal Effect Estimation via Identifiable Latent Representation Learning
Authors: Ruichu Cai, Junjie Wan, Weilin Chen, Zeqin Yang, Zijian Li, Peng Zhen, Jiecheng Guo
Location: Guangzhou | Day: TBD
Show Abstract
Estimating long-term causal effects by combining long-term observational and short-term experimental data is a crucial but challenging problem in many real-world scenarios. In existing methods, several ideal assumptions, e.g. latent unconfoundedness assumption or additive equi-confounding bias assumption, are proposed to address the latent confounder problem raised by the observational data. However, in real-world applications, these assumptions are typically violated which limits their practical effectiveness. In this paper, we tackle the problem of estimating the long-term individual causal effects without the aforementioned assumptions. Specifically, we propose to utilize the natural heterogeneity of data, such as data from multiple sources, to identify latent confounders, thereby significantly avoiding reliance on idealized assumptions. Practically, we devise a latent representation learning-based estimator of long-term causal effects. Theoretically, we establish the identifiability of latent confounders,
with which we further achieve long-term effect identification. Extensive experimental studies, conducted on multiple synthetic and semi-synthetic datasets, demonstrate the effectiveness of our proposed method.
1699: Wave-wise Discriminative Tracking by Phase-Amplitude Separation, Augmentation and Mixture
Authors: Huibin Tan, Mingyu Cao, Kun Hu, Xihuai He, Zhe Wang, Hao Li, Long Lan, Mengzhu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Distinguishing key features in complex visual tasks is challenging. A novel approach treats image patches (tokens) as waves. By using both phase and amplitude, it captures richer semantics and specific invariances compared to pixel-based methods, and allows for feature fusion across regions for a holistic image representation. Based on this, we propose the Wave-wise Discriminative Transformer Tracker (WDT). During tracking, WDT represents features via phase-amplitude separation, enhancement, and mixture. First, we designed a Mutual Exclusive Phase-Amplitude Extractor (MEPAE) to separate phase and amplitude features with distinct semantics, representing spatial target info and background brightness respectively. Then, Wave-wise Feature Augmentation is carried out with two submodules: Phase-Amplitude Feature Augmentation and Mixture. The augmentation module disrupts the separated features in the same batch, and the mixture module recombines them to generate positive and negative waves. The original features are aggregated into the original wave. Positive waves have the same phase but different amplitudes, and negative waves have different phase components. Finally, self-supervised and tracking-supervised losses guide the global and local representation learning for original, positive, and negative waves, enhancing wave-level discrimination. Experiments on five benchmarks prove the effectiveness of our method.
1712: Asymptotic Analysis of Weighted Fair Division
Authors: Pasin Manurangsi, Warut Suksompong, Tomohiko Yokoyama
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Game Theory and Economic Paradigms
Show Abstract
Several resource allocation settings involve agents with unequal entitlements represented by weights. We analyze weighted fair division from an asymptotic perspective: if m items are divided among n agents whose utilities are independently sampled from a probability distribution, when is it likely that a fair allocation exist? We show that if the ratio between the weights is bounded, a weighted envy-free allocation exists with high probability provided that m = Omega(n log n / log log n), generalizing a prior unweighted result. For weighted proportionality, we establish a sharp threshold of m = n / (1 – \mu) for the transition from non-existence to existence, where \mu in (0,1) denotes the mean of the distribution. In addition, we prove that for two agents, a weighted envy-free (and weighted proportional) allocation is likely to exist if m = omega(sqrt{r}), where r denotes the ratio between the two weights.
1714: Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition
Authors: Xiaogang Xu, Kun Zhou, Tao Hu, Jiafei Wu, Ruixing Wang, Hao Peng, Bei Yu
Location: Guangzhou | Day: TBD
Show Abstract
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise. In this paper, we present an innovative video decomposition strategy that incorporates view-independent and view-dependent components to enhance the performance of LLVE. We leverage dynamic cross-frame correspondences for the view-independent term (which primarily captures intrinsic appearance) and impose a scene-level continuity constraint on the view-dependent term (which mainly describes the shading condition) to achieve consistent and satisfactory decomposition results. To further ensure consistent decomposition, we introduce a dual-structure enhancement network featuring a cross-frame interaction mechanism. By supervising different frames simultaneously, this network encourages them to exhibit matching decomposition features. This mechanism can seamlessly integrate with encoder-decoder single-frame networks, incurring minimal additional parameter costs. Extensive experiments are conducted on widely recognized LLVE benchmarks, covering diverse scenarios. Our framework consistently outperforms existing methods, establishing a new SOTA performance.
1726: Egocentric Object-Interaction Anticipation with Retentive and Predictive Learning
Authors: Guo Chen, Yifei Huang, Yin-dong Zheng, Yicheng Liu, Jiahao Wang, Tong Lu
Location: Guangzhou | Day: TBD
Show Abstract
Egocentric object-interaction anticipation is critical for applications like augmented reality and robotics, but existing methods struggle with misaligned egocentric encoding, insufficient supervision, and underutilized historical context. These limitations stem from a lack of focus on retention, i.e., retaining long-term object-centric interactions, and prediction, i.e., future-centric encoding and future uncertainty modeling. We introduce EgoAnticipator, a novel Retentive and Predictive Learning framework that addresses these challenges. Our approach combines retentive pre-training for domain-specific encoding, predictive pre-training for future uncertainty modeling, and mirror distillation to transfer future-informed knowledge. Additionally, we propose long-term memory prompting to integrate historical interaction cues. We evaluate the effectiveness of our framework using the Ego4D short-term object interaction anticipation benchmark, covering both STAv1 and STAv2. Extensive experiments demonstrate that our framework outperforms existing methods, while ablation studies highlight the effectiveness of each design inside our retentive and predictive learning framework.
1736: Knowledge Editing for Multi-Hop Question Answering Using Semantic Analysis
Authors: Dominic Simon, Rickard Ewetz
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Natural Language Processing (2/2)
Show Abstract
Large Language Models (LLMs) require lightweight avenues of updating stored information that has fallen out of date. Knowledge Editing (KE) approaches have been successful in updating model knowledge for simple factual queries but struggle with handling tasks that require compositional reasoning such as multi-hop question answering (MQA). We observe that existing knowledge editors leverage decompositional techniques that result in illogical reasoning processes. In this paper, we propose a knowledge editor for MQA based on semantic analysis called CHECK. Our framework is based on insights from an analogy between compilers and reasoning using LLMs. Similar to how source code is first compiled before being executed, we propose to semantically analyze reasoning chains before executing the chains to answer questions. Reasoning chains with semantic errors are revised to ensure consistency through logic optimization and re-prompting the LLM model at a higher temperature. We evaluate the effectiveness of CHECK against five state-of-the-art frameworks on four datasets and achieve an average 22.8% improved MQA accuracy.
1738: A Novel Local Search Algorithm for the Vertex Bisection Minimization Problem
Authors: Rui Sun, Xinyu Wang, Yiyuan Wang, Jiangnan Li, Yi Zhou
Location: Guangzhou | Day: TBD
Show Abstract
The vertex bisection minimization problem (VBMP) is a fundamental graph partitioning problem with numerous real-world applications. In this study, we propose a (k, l, S)-cluster guided local search algorithm to address this challenge.
First, we propose a novel (k,l,S)-cluster enumeration procedure, which is based on two key concepts: the (k, l, S)-cluster and the local cluster core. The (k, l, S)-cluster limits both the connectivity and distinct boundaries of a given vertex set, and the local cluster core represents the most cohesive substructure within a (k, l, S)-cluster. Building up on the above (k, l, S)-cluster enumeration procedure, we present a novel (k, l, S)-cluster guided perturbation mechanism designed to escape from local optima.
Next, we propose a two-manner local search procedure that employs two distinct search models to explore the neighboring search space efficiently. Experimental results demonstrate that the proposed algorithm performs best on nearly all instances.
1748: Few-Shot Incremental Multi-modal Learning via Touch Guidance and Imaginary Vision Synthesis
Authors: Lina Wei, Yuhang Ma, Zhongsheng Lin, Fangfang Wang, Canghong Jin, Hanbin Zhao, Dapeng Chen
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal perception, which integrates vision and touch, is increasingly demonstrating its significance in domains such as embodied intelligence and human-computer interaction. However, in open-world scenarios, multimodal data streams face significant challenges, including catastrophic forgetting and overfitting, during few-shot class incremental learning (FSCIL), leading to a severe degradation in model performance. In this work, we propose a novel approach named Few-Shot Incremental Multi-modal Learning via Touch Guidance and Imaginary Vision Synthesis (TIFS). Our method leverages vision imagination synthesis to enhance the semantic understanding and integrates touch and vision fusion to improve the problem of modal imbalance. Specifically, we introduce a framework that employs touch-guided vision information for cross-modal contrastive learning to address the challenges of few-shot learning. Additionally, we incorporate multiple learning mechanisms, including regularization, memory mechanisms, and attention mechanisms, to mitigate catastrophic forgetting during multi-incremental step learning. Experimental results on the Touch and Go and VisGel datasets demonstrate that the TIFS framework exhibits robust continuous learning capabilities and strong generalization performance in touch-vision few-shot incremental learning tasks. Our code is available at https://github.com/Vision-Multimodal-Lab-HZCU/TIFS.
1755: Backdoor Attack on Vertical Federated Graph Neural Network Learning
Authors: Jirui Yang, Peng Chen, Zhihui Lu, Jianping Zeng, Qiang Duan, Xin Du, Ruijun Deng
Location: Guangzhou | Day: TBD
Show Abstract
Federated Graph Neural Network (FedGNN) integrate federated learning (FL) with graph neural networks (GNNs) to enable privacy-preserving training on distributed graph data. Vertical Federated Graph Neural Network (VFGNN), a key branch of FedGNN, handles scenarios where data features and labels are distributed among participants. Despite the robust privacy-preserving design of VFGNN, we have found that it still faces the risk of backdoor attacks, even in situations where labels are inaccessible. This paper proposes BVG, a novel backdoor attack method that leverages multi-hop triggers and backdoor retention, requiring only four target-class nodes to execute effective attacks. Experimental results demonstrate that BVG achieves nearly 100% attack success rates across three commonly used datasets and three GNN models, with minimal impact on the main task accuracy. We also evaluated various defense methods, and the BVG method maintained high attack effectiveness even under existing defenses. This finding highlights the need for advanced defense mechanisms to counter sophisticated backdoor attacks in practical VFGNN applications.
1774: RLBCD: Residual-guided Latent Brownian-bridge Co-Diffusion for Anatomical-to-Metabolic Image Synthesis
Authors: Tianxu Lv, Hongnian Tian, Jiansong Fan, Yuan Liu, Lihua Li, Xiang Pan
Location: Guangzhou | Day: TBD
Show Abstract
While metabolic imaging can facilitate early diagnosis by revealing physiological changes of lesions, it is limited by high cost, high radiation risk, and potential renal impairment. Thus, developing an effective approach for Anatomical-to-Metabolic Image Synthesis (A2MIS) is highly required. However, existing methods are heavily hindered by the gap between distinct domains, and fail to provide a confidence score for the synthesized images, severely restricting their clinical applications. Here, we propose a novel Residual-guided Latent Brownian-bridge Co-Diffusion (RLBCD) model for A2MIS. Specifically, RLBCD starts with a co-diffusion process that leverages a residual diffusion branch to capture inter-domain differences, which are injected into an enhanced diffusion branch to maximally reconstruct modality-specific details. Furthermore, to explore desired residual guidance, we investigate the encoder and decoder features in diffusion models, and accordingly design a Hybrid-Granularity Fusion to integrate consistent semantics and complementary information for fine-grained reconstruction. Additionally, a latent consistency score is developed to enhance the restoration of modality-specific information, which also serves as an indicator of the inherent confidence of the synthesized images. Extensive experiments conducted on five public and in-house datasets demonstrate that RLBCD not only outperforms state-of-the-art methods for A2MIS, but also is valuable for downstream clinic applications.
1781: EDyGS: Event Enhanced Dynamic 3D Radiance Fields from Blurry Monocular Video
Authors: Mengxu Lu, Zehao Chen, Yan Liu, De Ma, Huajin Tang, Qian Zheng, Gang Pan
Location: Montreal | Day: August 21st | Time: 15:00 | Session: CV: videos
Show Abstract
The task of generating novel views in dynamic scenes plays a critical role in the 3D vision domain. Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have shown great promise in this domain but struggle with motion blur, which often arises in real-world scenarios due to camera or object motion. Existing methods address camera motion blur but fall short in dynamic scenes, where the coupling of camera and object motion complicates multi-view consistency and temporal coherence. In this work, we propose EDyGS, a model designed to reconstruct sharp novel views from event streams and monocular videos of dynamic scenes with motion blur. Our approach introduces a motion-mask 3D Gaussian model that assigns each Gaussian an additional attribute to distinguish between static and dynamic regions. By leveraging this motion mask field, we separate and optimize the static and dynamic regions independently. A progressive learning strategy is adopted, where static regions are reconstructed by jointly optimizing camera poses and learnable 3D Gaussians, while dynamic regions are modeled using an implicit deformation field alongside learnable 3D Gaussians. We conduct both quantitative and qualitative experiments on synthetic and real-world data. Experimental results demonstrate that EDyGS effectively handles blurry inputs in dynamic scenes.
1788: Projection, Interaction and Fusion: A Progressive Difference Fusion Network for Salient Object Detection
Authors: Xiao Ke, Weijie Zhou, Yuzhen Niu
Location: Guangzhou | Day: TBD
Show Abstract
In recent years, deep learning-based Salient Object Detection (SOD) methods have made tremendous progress; however, their performance in complex scenarios has reached a bottleneck. In this paper, we propose a novel Progressive Difference Fusion Network (PDFNet) based on fine-grained feature fusion. First, to address the scale variability of salient objects, we introduce a Self-Guided Module (SGM) with dynamic receptive fields. Second, to tackle the shape variability of salient objects, we design a Feature Aggregation Module (FAM) incorporating cross convolutions and a feedback loop. Finally, to alleviate the issue of confusion between global and detail information during multi-scale feature fusion in existing models, we develop a Progressive Difference Fusion Unit (PDFU) to project multi-scale features into fine-grained nodes and enhance them through node interaction based on difference features. Additionally, we propose a Conditional Random Field Based on Patch (CRFbp), which focuses on handling discrete points, further improving the model’s performance. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) performance on five benchmark datasets. Code is available at: https://github.com/pdfnet2025/PDFNet.git.
1790: Recalling The Forgotten Class Memberships: Unlearned Models Can Be Noisy Labelers to Leak Privacy
Authors: Zhihao Sui, Liang Hu, Jian Cao, Dora D. Liu, Usman Naseem, Zhongyuan Lai, Qi Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Machine Unlearning (MU) technology facilitates the removal of the influence of specific data instances from trained models on request. Despite rapid advancements in MU technology, its vulnerabilities are still underexplored, posing potential risks of privacy breaches through leaks of ostensibly unlearned information. Current limited research on MU attacks requires access to original models containing privacy data, which violates the critical privacy-preserving objective of MU. To address this gap, we initiate the innovative study on recalling the forgotten class memberships from unlearned models (ULMs) without requiring access to the original one. Specifically, we implement a Membership Recall Attack (MRA) framework with a teacher-student knowledge distillation architecture, where ULMs serve as noisy labelers to transfer knowledge to student models. Then, it is translated into a Learning with Noisy Labels (LNL) problem for inferring correct labels of the forgetting instances. Extensive experiments on state-of-the-art MU methods with multiple real datasets demonstrate that the proposed MRA strategy exhibits high efficacy in recovering class memberships of unlearned instances. As a result, our study and evaluation have established a benchmark for future research on MU vulnerabilities.
1791: Progressive Modality-Adaptive Interactive Network for Multi-Modality Image Fusion
Authors: Chaowei Huang, Yaru Su, Huangbiao Xu, Xiao Ke
Location: Guangzhou | Day: TBD
Show Abstract
Multi-modality image fusion (MMIF) integrates features from distinct modalities to enhance visual quality and improve downstream task performance. However, existing methods often overlook the sparsity variations and dynamic correlations between infrared and visible images, potentially limiting the utilization of both modalities. To address these challenges, we propose the Progressive Modality-Adaptive Interactive Network (PoMAI), a novel framework that not only dynamically adapts to the sparsity and structural disparities of each modality but also enhances inter-modal correlations, thereby optimizing fusion quality. The training process consists of two stages: in the first stage, the Neighbor-Group Matching Model (NGMM) models the high sparsity of infrared features, while the Context-Aware Modeling Network (CAMN) captures rich structural details in visible features, jointly refining modality-specific characteristics for fusion. In the second stage, the Modality-Interactive Compensation Module (MICM) refines inter-modal correlations via dynamic compensation mechanism, while freezing the first-stage modules to focus MICM solely on the compensation task. Extensive experiments on benchmark datasets demonstrate that PoMAI surpasses state-of-the-art methods in fusion quality and excels in downstream tasks.
1792: Concentrate on Weakness: Mining Hard Prototypes for Few-Shot Medical Image Segmentation
Authors: Jianchao Jiang, Haofeng Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Few-Shot Medical Image Segmentation (FSMIS) has been widely used to train a model that can perform segmentation from only a few annotated images. However, most existing prototype-based FSMIS methods generate multiple prototypes from the support image solely by random sampling or local averaging, which can cause particularly severe boundary blurring due to the tendency for normal features accounting for the majority of features of a specific category. Consequently, we propose to focus more attention to those weaker features that are crucial for clear segmentation boundary. Specifically, we design a Support Self-Prediction (SSP) module to identify such weak features by comparing true support mask with one predicted by global support prototype. Then, a Hard Prototypes Generation (HPG) module is employed to generate multiple hard prototypes based on these weak features. Subsequently, a Multiple Similarity Maps Fusion (MSMF) module is devised to generate final segmenting mask in a dual-path fashion to mitigate the imbalance between foreground and background in medical images. Furthermore, we introduce a boundary loss to further constraint the edge of segmentation. Extensive experiments on three publicly available medical image datasets demonstrate that our method achieves state-of-the-art performance. Code is available at https://github.com/jcjiang99/CoW.
1793: FGeo-HyperGNet: Geometric Problem Solving Integrating FormalGeo Symbolic System and Hypergraph Neural Network
Authors: Xiaokai Zhang, Yang Li, Na Zhu, Cheng Qin, Zhenbing Zeng, Tuo Leng
Location: Guangzhou | Day: TBD
Show Abstract
Geometric problem solving has always been a long-standing challenge in the fields of mathematical reasoning and artificial intelligence. We built a neural-symbolic system, called FGeo-HyperGNet, to automatically perform human-like geometric problem solving. The symbolic component is a formal system built on FormalGeo, which can automatically perform geometric relational reasoning and algebraic calculations and organize the solution into a hypergraph with conditions as hypernodes and theorems as hyperedges. The neural component, called HyperGNet, is a hypergraph neural network based on the attention mechanism, including an encoder to effectively encode the structural and semantic information of the hypergraph and a theorem predictor to provide guidance in solving problems. The neural component predicts theorems according to the hypergraph, and the symbolic component applies theorems and updates the hypergraph, thus forming a predict-apply cycle to ultimately achieve readable and traceable automatic solving of geometric problems. Experiments demonstrate the correctness and effectiveness of this neural-symbolic architecture. We achieved state-of-the-art results with a TPA of 93.50% and a PSSR of 88.36% on the FormalGeo7K dataset.
1801: Open-Vocabulary Fine-Grained Hand Action Detection
Authors: Ting Zhe, Mengya Han, Xiaoshuai Hao, Yong Luo, Zheng He, Xiantao Cai, Jing Zhang
Location: Guangzhou | Day: TBD
Show Abstract
In this work, we address the new challenge of open-vocabulary fine-grained hand action detection, which aims to recognize hand actions from both known and novel categories using textual descriptions. Traditional hand action detection methods are limited to closed-set detection, making it difficult for them to generalize to new, unseen hand action categories. While current open-vocabulary detection (OVD) methods are effective at detecting novel objects, they face challenges with fine-grained action recognition, particularly when data is limited and heterogeneous. This often leads to poor generalization and performance bias between base and novel categories. To address these issues, we propose a novel approach, Open-FGHA (Open-vocabulary Fine-Grained Hand Action), which learns to distinguish fine-grained features across multiple modalities from limited heterogeneous data. It then identifies optimal matching relationships among these features, enabling accurate open-vocabulary fine-grained hand action detection. Specifically, we introduce three key components: Hierarchical Heterogeneous Low-Rank Adaptation, Bidirectional Selection and Fusion Mechanism, and Cross-Modality Query Generator. These components work in unison to enhance the alignment and fusion of multimodal fine-grained features. Extensive experiments demonstrate that Open-FGHA outperforms existing OVD methods, showing its strong potential for open-vocabulary hand action detection. The source code is available at OV-FGHAD.
1806: DaringFed: A Dynamic Bayesian Persuasion Pricing for Online Federated Learning Under Two-sided Incomplete Information
Authors: Yun Xin, Jianfeng Lu, Shuqin Cao, Gang Li, Haozhao Wang, Guanghui Wen
Location: Guangzhou | Day: TBD
Show Abstract
Online Federated Learning (OFL) is a real-time learning paradigm that sequentially executes parameter aggregation immediately for each random arriving client. To motivate clients to participate in OFL, it is crucial to offer appropriate incentives to offset the training resource consumption. However, the design of incentive mechanisms in OFL is constrained by the dynamic variability of Two-sided Incomplete Information (TII) concerning resources, where the server is unaware of the clients’ dynamically changing computational resources, while clients lack knowledge of the real-time communication resources allocated by the server. To incentivize clients to participate in training by offering dynamic rewards to each arriving client, we design a novel Dynamic Bayesian persuasion pricing for online Federated learning (DaringFed) under TII. Specifically, we begin by formulating the interaction between the server and clients as a dynamic signaling and pricing allocation problem within a Bayesian persuasion game, and then demonstrate the existence of a unique Bayesian persuasion Nash equilibrium. By deriving the optimal design of DaringFed under one-sided incomplete information, we further analyze the approximate optimal design of DaringFed with a specific bound under TII. Finally, extensive evaluation conducted on real datasets demonstrate that DaringFed optimizes accuracy and converges speed by 16.99%, while experiments with synthetic datasets validate the convergence of estimate unknown values and the effectiveness of DaringFed in improving the server’s utility by up to 12.6%.
1808: AdaptPFL: Unlocking Cross-Device Palmprint Recognition via Adaptive Personalized Federated Learning with Feature Decoupling
Authors: Zirui Zhang, Donghai Guan, Çetin Kaya Koç, Jie Wen, Qi Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Contactless palmprint recognition has recently emerged as a promising biometric technology. However, traditional methods that require sharing user data introduce substantial security risks. While federated learning offers privacy-preserving solutions, it often compromises recognition accuracy due to feature distribution drift caused by external factors such as lighting and devices. To address this issue, we propose an adaptive personalized federated learning framework (AdaptPFL). The central innovation lies in decomposing palmprint features into identity-related and contextual-related components using a feature decoupling mechanism. This design isolates the influence of external environmental factors on identity recognition through de-entanglement. Furthermore, two adaptive aggregation strategies are introduced to correct client drift: (1) Intra-Local Adaptive Aggregation (ILAA), which addresses intra-client drift by adaptively combining the two decoupled feature types; (2) Global-Local Adaptive Aggregation (GLAA), which corrects inter-client drift by adaptively aggregating model parameters. Experimental results demonstrate that AdaptPFL achieves superior performance compared to existing state-of-the-art methods.
1815: Credit Assignment and Fine-Tuning Enhanced Reinforcement Learning for Collaborative Spatial Crowdsourcing
Authors: Wei Chen, Yafei Li, Baolong Mei, Guanglei Zhu, Jiaqi Wu, Mingliang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Collaborative spatial crowdsourcing leverages distributed workers’ collective intelligence to accomplish spatial tasks. A central challenge is to efficiently assign suitable workers to collaborate on these tasks. Although mainstream reinforcement learning (RL) methods have proven effective in task allocation, they face two key obstacles: delayed reward feedback and non-stationary data distributions, both hindering optimal allocation and collaborative efficiency. To address these limitations, we propose CAFE (credit assignment and fine-tuning enhanced), a novel multi-agent RL framework for spatial crowdsourcing. CAFE introduces a credit assignment mechanism that distributes rewards based on workers’ contributions and spatiotemporal constraints, coupled with bi-level meta-optimization to jointly optimize credit assignment and RL policy. To handle non-stationary spatial task distributions, CAFE employs an adaptive fine-tuning procedure that efficiently adjusts credit assignment parameters while preserving collaborative knowledge. Experiments on two real-world datasets validate the effectiveness of our framework, demonstrating superior performance in terms of task completion and equitable reward redistribution.
1823: CADP: Towards Better Centralized Learning for Decentralized Execution in MARL
Authors: Yihe Zhou, Shunyu Liu, Yunpeng Qing, Tongya Zheng, Kaixuan Chen, Jie Song, Mingli Song
Location: Guangzhou | Day: TBD
Show Abstract
Centralized Training with Decentralized Execution (CTDE) has recently emerged as a popular framework for cooperative Multi-Agent Reinforcement Learning (MARL), where agents can use additional global state information to guide training in a centralized way and make their own decisions only based on decentralized local policies. Despite the encouraging results achieved, CTDE makes an independence assumption on agent policies, which limits agents from adopting global cooperative information from each other during centralized training. Therefore, we argue that the existing CTDE framework cannot fully utilize global information for training, leading to an inefficient joint exploration and perception, which can degrade the final performance. In this paper, we introduce a novel Centralized Advising and Decentralized Pruning (CADP) framework for MARL, that not only enables an efficacious message exchange among agents during training but also guarantees the independent policies for decentralized execution. Firstly, CADP endows agents the explicit communication channel to seek and take advice from different agents for more centralized training. To further ensure the decentralized execution, we propose a smooth model pruning mechanism to progressively constrain the agent communication into a closed one without degradation in agent cooperation capability. Empirical evaluations on different benchmarks and across various MARL backbones demonstrate that the proposed framework achieves superior performance compared with the state-of-the-art counterparts. Our code is available at https://github.com/zyh1999/CADP
1829: BEVTrack: A Simple and Strong Baseline for 3D Single Object Tracking in Bird’s-Eye View
Authors: Yuxiang Yang, Yingqi Deng, Mian Pan, Zheng-Jun Zha, Jing Zhang
Location: Guangzhou | Day: TBD
Show Abstract
3D Single Object Tracking (SOT) is a fundamental task in computer vision and plays a critical role in applications like autonomous driving. However, existing algorithms often involve complex designs and multiple loss functions, making model training and deployment challenging. Furthermore, their reliance on fixed probability distribution assumptions (e.g., Laplacian or Gaussian) hinders their ability to adapt to diverse target characteristics such as varying sizes and motion patterns, ultimately affecting tracking precision and robustness. To address these issues, we propose BEVTrack, a simple yet effective motion-based tracking method. BEVTrack directly estimates object motion in Bird’s-Eye View (BEV) using a single regression loss. To enhance accuracy for targets with diverse attributes, it learns adaptive likelihood functions tailored to individual targets, avoiding the limitations of fixed distribution assumptions in previous methods. This approach provides valuable priors for tracking and significantly boosts performance. Comprehensive experiments on KITTI, NuScenes, and Waymo Open Dataset demonstrate that BEVTrack achieves state-of-the-art results while operating at 200 FPS, enabling real-time applicability. The code will be released at https://github.com/xmm-prio/BEVTrack.
1833: VQCounter: Designing Visual Prompt Queue for Accurate Open-World Counting
Authors: Fanfan Ye, Yiqi Fan, Qiaoyong Zhong, Shicai Yang, Di Xie, Jie Song, Mingli Song
Location: Guangzhou | Day: TBD
Show Abstract
Class-agnostic counting enables enumerating arbitrary object classes beyond those seen during training. Recent studies attempted to exploit the potential of visual foundation models such as GroundingDINO. Despite the considerable progress, we observe certain shortcomings, including the limited diversity of visual prompts and suboptimal training regimen.
To address these issues, we introduce VQCounter, which incorporates a visual prompt queue mechanism designed to enrich the diversity of visual prompts.
A random modality switching strategy is proposed during training to strengthen both textual and visual modalities.
Besides, in light of weak point supervision, a Voronoi diagram-based cost (VoronoiCost) is designed to improve Hungarian matching, leading to more stable and faster convergence.
Building upon the Voronoi diagram, we also propose a novel set of more stringent evaluation metrics, which take point localization into account.
Extensive experiments on the FSC-147 and CARPK datasets demonstrate that VQCounter achieves state-of-the-art performance in both zero-shot and few-shot settings, significantly outperforming existing methods across nearly all evaluations.
1837: A General Framework for Representing Controlled Natural Language Sentences and Translation to KR Formalisms
Authors: Simone Caruso, Carmine Dodaro, Marco Maratea, Alice Tarzariol
Location: Montreal | Day: August 21st | Time: 15:00 | Session: Knowledge Representation and Reasoning (4/4)
Show Abstract
Languages for Knowledge Representation and Reasoning, such as ASP, CP, and SMT, excel at solving some complex problems, but encoding them into a higher-level language may be more profitable, leaving these formalisms as targets for solving. Recent studies aim to convert controlled natural languages into formal representations, yet these solutions are often tailored to specific languages and require significant effort.
This paper introduces a general framework that generates grammars for target representation languages, enabling the translation of problems stated in CNL into formal representations. The related system, CNLWizard, offers a flexible, high-level approach to defining desired grammars, significantly reducing the time and effort needed to create custom grammars. Finally, we demonstrate the system’s effectiveness through an experimental analysis.
1839: Multi Objective Quantile Based Reinforcement Learning for Modern Urban Planning
Authors: Lukasz Pelcner, Leandro Soriano Marcolino, Matheus Aparecido do Carmo Alves, Paula A. Harrison, Peter M. Atkinson
Location: Montreal | Day: August 21st | Time: 15:00 | Session: Agent-based and Multi-agent Systems (3/3)
Show Abstract
We present a novel Multi-Agent Reinforcement Learning approach to understand and improve policy development by land-shaping agents, such as governments and institutional bodies. We derive the underlying policy decisions by analyzing the land and developing an intelligent system that proposes optimal land conversion strategies. The aim is an efficient method for allocating residential spaces while considering the dynamic population influx in different regions, jurisdictional constraints, and the intrinsic characteristics of the land. Our main goal is to be sustainable, preserving desirable land types such as forests and fluvial lands while optimizing land organization. We introduce an attractiveness metric that quantifies the proximity to different land types and other factors to optimize land usage. It distinguishes two types of agents: “top-down” agents, which are policymakers and shareholders, and “bottom-up” agents representing individuals or groups with specific housing preferences. Our main objective is to create a synergistic environment where the top-down policy meets the bottom-up preferences to devise a comprehensive land use and conversion strategy. This paper, thus, serves as a pivotal reference point for future urban planning and policy-making processes, contributing to a sustainable and efficient landscape design model.
1840: Physical Adversarial Camouflage Through Gradient Calibration and Regularization
Authors: Jiawei Liang, Siyuan Liang, Jianjie Huang, Chenxi Si, Ming Zhang, Xiaochun Cao
Location: Guangzhou | Day: TBD
Show Abstract
The advancement of deep object detectors has greatly affected safety-critical fields like autonomous driving. However, physical adversarial camouflage poses a significant security risk by altering object textures to deceive detectors. Existing techniques struggle with variable physical environments, facing two main challenges: 1) inconsistent sampling point densities across distances hinder the gradient optimization from ensuring local continuity, and 2) updating texture gradients from multiple angles causes conflicts, reducing optimization stability and attack effectiveness. To address these issues, we propose a novel adversarial camouflage framework based on gradient optimization. First, we introduce a gradient calibration strategy, which ensures consistent gradient updates across distances by propagating gradients from sparsely to unsampled texture points, thereby expanding the attack’s effective range. Additionally, we develop a gradient decorrelation method, which prioritizes and orthogonalizes gradients based on loss values, enhancing stability and effectiveness in multi-angle optimization by eliminating redundant or conflicting updates. Extensive experimental results on various detection models, angles, and distances show that our method significantly surpasses the state-of-the-art, with an average attack success rate (ASR) increase of 13.46\% across distances and 11.03\% across angles. Furthermore, experiments in real-world settings confirm the method’s threat potential, highlighting the urgent need for more robust autopilot systems less prone to spoofing.
1844: DDPA-3DVG: Vision-Language Dual-Decoupling and Progressive Alignment for 3D Visual Grounding
Authors: Hongjie Gu, Jinlong Fan, Liang Zheng, Jing Zhang, Yuxiang Yang
Location: Guangzhou | Day: TBD
Show Abstract
3D visual grounding aims to localize target objects in point clouds based on free-form natural language, which often describes both target and reference objects. Effective alignment between visual and text features is crucial for this task. However, existing two-stage methods that rely solely on object-level features can yield suboptimal accuracy, while one-stage methods that align only point-level features can be prone to noise. In this paper, we propose DDPA-3DVG, a novel framework that progressively aligns visual locations and language descriptions at multiple granularities. Specifically, we decouple natural language descriptions into distinct representations of target objects, reference objects, and their mutual relationships, while disentangling 3D scenes into object-level, voxel-level, and point-level features. By progressively fusing these dual-decoupled features from coarse to fine, our method enhances cross-modal alignment and achieves state-of-the-art performance on three challenging benchmarks—ScanRefer, Nr3D, and Sr3D. The code will be released at https://github.com/HDU-VRLab/DDPA-3DVG.
1852: DONIS: Importance Sampling for Training Physics-Informed DeepONet
Authors: Shudong Huang, Rui Huang, Ming Hu, Wentao Feng, Jiancheng Lv
Location: Guangzhou | Day: TBD
Show Abstract
Deep Operator Network (DeepONet) effectively learns complex operator mappings, especially for systems governed by differential equations. Physics-informed DeepONet (PI-DeepONet) extends these capabilities by integrating physical constraints, enabling robust performance with limited or no labeled data. However, combining operator learning with these constraints increases computational complexity, which makes training more difficult and convergence slower, particularly for nonlinear or high-dimensional problems. In this work, we present an enhanced PI-DeepONet framework, that applies importance sampling to both of DeepONet inputs (i.e., the functions and the collocation points) to alleviate these training challenges. By focusing on critical data regions in both input domains, our approach showcases accelerated convergence and improved accuracy across various complex applications.
1861: Most Probable Explanation in Probabilistic Answer Set Programming
Authors: Damiano Azzolini, Giuseppe Mazzotta, Francesco Ricca, Fabrizio Riguzzi
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Uncertainty in AI
Show Abstract
Most Probable Explanation (MPE) is a fundamental problem in statistical relational artificial intelligence.
In the context of Probabilistic Answer Set Programming (PASP), solving MPE is still an open research problem.
In this paper, we present three novel approaches for solving the MPE task in PASP that are based on: i) Algebraic Model Counting, ii) Answer Set Programming (ASP), and iii) ASP with quantifiers (ASP(Q)).
These approaches are implemented and evaluated against existing solvers across different datasets and configurations.
Empirical results demonstrate that the novel solutions consistently outperform existing alternatives for non-stratified programs.
1866: A Dynamic Stiefel Graph Neural Network for Efficient Spatio-Temporal Time Series Forecasting
Authors: Jiankai Zheng, Liang Xie
Location: Guangzhou | Day: TBD
Show Abstract
Spatio-temporal time series (STTS) have been widely used in many applications. However, accurately forecasting STTS is challenging due to complex dynamic correlations in both time and space dimensions. Existing graph neural networks struggle to balance effectiveness and efficiency in modeling dynamic spatio-temporal relations. To address this problem, we propose the Dynamic Spatio-Temporal Stiefel Graph Neural Network (DST-SGNN) to efficiently process STTS. For DST-SGNN, we first introduce the novel Stiefel Graph Spectral Convolution (SGSC) and Stiefel Graph Fourier Transform (SGFT). The SGFT matrix in SGSC is constrained to lie on the Stiefel manifold, and SGSC can be regarded as a filtered graph spectral convolution. We also propose the Linear Dynamic Graph Optimization on Stiefel Manifold (LDGOSM), which can efficiently learn the SGFT matrix from the dynamic graph and significantly reduce the computational complexity. Finally, we propose a multi-layer SGSC (MSGSC) that efficiently captures complex spatio-temporal correlations. Extensive experiments on seven spatio-temporal datasets show that DST-SGNN outperforms state-of-the-art methods while maintaining relatively low computational costs.
1880: Sanitizing Backdoored Graph Neural Networks: A Multidimensional Approach
Authors: Rong Zhao, Jilian Zhang, Yu Wang, Yinyan Zhang, Jian Weng
Location: Montreal | Day: August 19th | Time: 15:00 | Session: ML: Reinforcement learning (1/2)
Show Abstract
Graph Neural Networks (GNNs) are known to be prone to adversarial attacks, among which backdoor attack is a major security threat. By injecting backdoor triggers into a graph and assigning a target class label to nodes attached to the triggers, the attacker can mislead the GNN model trained on the poisoned graph to classify test nodes attached with a trigger to the target class. To defend against backdoor attacks, existing defense methods rely on anomaly detection in feature distribution or label transformation. However, these approaches are incapable of detecting in-distribution triggers or clean-label attacks that do not alter the class label of target nodes. To tackle these threats, we empirically analyze triggers from a multidimensional aspect, and our analysis shows that there are clear distinctions between trigger nodes and normal ones in terms of node feature values, node embeddings, and class prediction probabilities. Based on these findings, we propose a Multidimensional Anomaly Detection framework (MAD) that can effectively minimize the impact of triggers by pruning away anomalous nodes and edges. Extensive experiments show that at the cost of slight loss in clean classification accuracy, MAD achieves considerably lower attack success rate as compared to state-of-the-art backdoor defense methods.
1890: FreqMoE: Dynamic Frequency Enhancement for Neural PDE Solvers
Authors: Tianyu Chen, Haoyi Zhou, Ying Li, Hao Wang, Zhenzhe Zhang, Tianchen Zhu, Shanghang Zhang, Jianxin Li
Location: Guangzhou | Day: TBD
Show Abstract
Fourier Neural Operators (FNO) have emerged as promising solutions for efficiently solving partial differential equations (PDEs) by learning infinite-dimensional function mappings through frequency domain transformations. However, the sparsity of high-frequency signals limits computational efficiency for high-dimensional inputs, and fixed-pattern truncation often causes high-frequency signal loss, reducing performance in scenarios such as high-resolution inputs or long-term predictions. To address these challenges, we propose FreqMoE, an efficient and progressive training framework that exploits the dependency of high-frequency signals on low-frequency components. The model first learns low-frequency weights and then applies a sparse upward-cycling strategy to construct a mixture of experts (MoE) in the frequency domain, effectively extending the learned weights to high-frequency regions. Experiments on both regular and irregular grid PDEs demonstrate that FreqMoE achieves up to 16.6 percent accuracy improvement while using merely 2.1 percent parameters (47.32x reduction) compared to dense FNO. Furthermore, the approach demonstrates remarkable stability in long-term predictions and generalizes seamlessly to various FNO variants and grid structures, establishing a new Low frequency Pretraining, High frequency Fine-tuning” paradigm for solving PDEs.
1895: Physics-Assisted and Topology-Informed Deep Learning for Weather Prediction
Authors: Jiaqi Zheng, Qing Ling, Yerong Feng
Location: Guangzhou | Day: TBD
Show Abstract
Although deep learning models have demonstrated remarkable potential in weather prediction, most of them overlook either the physics of the underlying weather evolution or the topology of the Earth’s surface. In light of these disadvantages, we develop PASSAT, a novel Physics-ASSisted And Topology-informed deep learning model for weather prediction. PASSAT attributes the weather evolution to two key factors: (i) the advection process that can be characterized by the advection equation and the Navier-Stokes equation; (ii) the Earth-atmosphere interaction that is difficult to both model and calculate. PASSAT also takes the topology of the Earth’s surface into consideration, other than simply treating it as a plane. With these considerations, PASSAT numerically solves the advection equation and the Navier-Stokes equation on the spherical manifold, utilizes a spherical graph neural network to capture the Earth-atmosphere interaction, and generates the initial velocity fields that are critical to solving the advection equation from the same spherical graph neural network. In the 5.625-degree resolution ERA5 data set, PASSAT outperforms both the state-of-the-art deep learning-based weather prediction models and the operational numerical weather prediction model IFS T42.
1915: Conditional Information Bottleneck-Based Multivariate Time Series Forecasting
Authors: Xinhui Li, Liang Duan, Lixing Yu, Kun Yue, Yuehua Li
Location: Guangzhou | Day: TBD
Show Abstract
Multivariate time series (MTS) forecasting endeavors to anticipate the forthcoming sequence of interdependent variables through the utilization of past observations. The prevailing methodologies, relying on deep neural networks, Transformer, or information bottleneck frameworks, persist in confronting challenges such as overlooking or inadequately capturing the inter / intra-series correlations evident in practical MTS datasets. In response to these challenges, we introduce a conditional information bottleneck-based strategy for MTS forecasting, grounded in information theory. Initially, we establish a conditional information bottleneck principle to capture the inter-series correlations via conditioning on non-target variables. Subsequently, a conditional mutual information-based technique is introduced to extract intra-series correlations by conditioning historical data, ensuring temporal consistency within each variable. Lastly, we devise a unified optimization objective and propose a training algorithm to collectively capture inter / intra-series correlations. Empirical investigations on authentic datasets underscore the superiority of our proposed approach over other cutting-edge competitors. Our code is available at https:
//github.com/Xinhui-Lee/CIB-MTSF.
1921: FreEformer: Frequency Enhanced Transformer for Multivariate Time Series Forecasting
Authors: Wenzhen Yue, Yong Liu, Xianghua Ying, Bowei Xing, Ruohao Guo, Ji Shi
Location: Guangzhou | Day: TBD
Show Abstract
This paper presents FreEformer, a simple yet effective model that leverages a Frequency Enhanced Transformer for multivariate time series forecasting. Our work is based on the assumption that the frequency spectrum provides a global perspective on the composition of series across various frequencies and is highly suitable for robust representation learning. Specifically, we first convert time series into the complex frequency domain using the Discrete Fourier Transform (DFT). The Transformer architecture is then applied to the frequency spectra to capture cross-variate dependencies, with the real and imaginary parts processed independently. However, we observe that the vanilla attention matrix exhibits a low-rank characteristic, thus limiting representation diversity. To address this, we enhance the vanilla attention mechanism by introducing an additional learnable matrix to the original attention matrix, followed by row-wise L1 normalization. Theoretical analysis demonstrates that this enhanced attention mechanism improves both feature diversity and gradient flow. Extensive experiments demonstrate that FreEformer consistently outperforms state-of-the-art models on eighteen real-world benchmarks covering electricity, traffic, weather, healthcare and finance. Notably, the enhanced attention mechanism also consistently improves the performance of state-of-the-art Transformer-based forecasters. Code is available at https://anonymous.4open.science/r/FreEformer.
1923: Bidirectional Search while Ensuring Meet-In-The-Middle via Effective and Efficient-to-Compute Termination Conditions
Authors: Yi Wang, Eyal Weiss, Bingxian Mu, Oren Salzman
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Search
Show Abstract
In bidirectional heuristic search, the meeting-in-the-middle property (MMP) and the theory of must-expand pairs (MEP) have driven significant recent developments in search efficiency. However, these methodologies typically terminate the search based on minimal priority metrics in the forward and backward open lists, requiring exploration of all potentially better solutions and potentially incurring substantial computational burden. In this paper, we investigate the reasons that contribute to the potential inefficiency in MM , and introduce a tighter termination condition that enables earlier termination without exhaustive exploration while still ensuring both MMP and optimality. This results in a highly efficient bidirectional search algorithm.
Experimental comparisons demonstrate that our algorithm outperforms MM in terms of running time by at least two orders of magnitude and is on par or better compared to A*, highlighting its potential in a wide range of applications.
1932: Contrastive Unlearning: A Contrastive Approach to Machine Unlearning
Authors: Hong kyu Lee, Qiuchen Zhang, Carl Yang, Jian Lou, Li Xiong
Location: Montreal | Day: August 21st | Time: 10:00 | Session: MTA: Security and privacy
Show Abstract
Machine unlearning aims to eliminate the influence of a subset of training samples (i.e., unlearning samples) from a trained model. Effectively and efficiently removing the unlearning samples without negatively impacting the overall model performance is challenging. Existing works mainly exploit input and output space and classification loss, which can result in ineffective unlearning or performance loss. In addition, they utilize unlearning or remaining samples ineffectively, sacrificing either unlearning efficacy or efficiency.
Our main insight is that the direct optimization on the representation space utilizing both unlearning and remaining samples can effectively remove influence of unlearning samples while maintaining representations learned from remaining samples. We propose a contrastive unlearning framework, leveraging the concept of representation learning for more effective unlearning. It removes the influence of unlearning samples by contrasting their embeddings against the remaining samples’ embeddings
so that their embeddings are closer to the embeddings of unseen samples.
Experiments on a variety of datasets and models on both class unlearning and sample unlearning showed that contrastive unlearning achieves the best unlearning effects and efficiency with the lowest performance loss compared with the state-of-the-art algorithms. In addition, it is generalizable to different contrastive frameworks and other models such as vision-language models. Our main code is available on github.com/Emory-AIMS/Contrastive-Unlearning
1941: FCKT: Fine-Grained Cross-Task Knowledge Transfer with Semantic Contrastive Learning for Targeted Sentiment Analysis
Authors: Wei Chen, Zhao Zhang, Meng Yuan, Kepeng Xu, Fuzhen Zhuang
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we address the task of targeted sentiment analysis , which involves two sub-tasks, i.e., identifying specific aspects from reviews and determining their corresponding senti-ments. Aspect extraction forms the foundation for sentiment prediction, highlighting the critical dependency between these two tasks for effective cross-task knowledge transfer.
While most existing studies adopt a multi-task learning paradigm to align task-specific features in the latent space, they predominantly rely on coarse-grained knowledge transfer. Such approaches lack fine-grained control over aspect-sentiment relationships, often assuming uniform sentiment polarity within related aspects. This oversimplification neglects contextual cues that differentiate sentiments, leading to negative transfer.
To overcome these limitations, we propose FCKT, a fine-grained cross-task knowledge transfer framework tailored for TSA. By explicitly incorporating aspect-level information into sentiment prediction, our framework achieves fine-grained knowledge transfer, effectively mitigating negative transfer and enhancing task performance.
Extensive experiments on three real-world datasets, including comparisons with various baselines and large language models (LLMs), demonstrate the effectiveness of FCKT. The source code
is available on https://github.com/cwei01/FCKT.
1950: Multi-Objective Neural Bandits with Random Scalarization
Authors: Ji Cheng, Bo Xue, Chengyu Lu, Ziqiang Cui, Qingfu Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-objective multi-armed bandit (MOMAB) problems are crucial for complex decision-making scenarios where multiple conflicting objectives must be simultaneously optimized. However, most existing works are based on the linear assumption of the feedback rewards, which significantly constrains their applicability and efficacy in capturing the intricate dynamics of real-world environments. This paper explores a multi-objective neural bandit (MONB) framework, which integrates the universal approximators, neural networks, with the classical MOMABs. We adopt random scalarization to accommodate the special needs of a practitioner by setting an appropriate distribution on the regions of interest. Using the trade-off capabilities of upper confidence bound (UCB) and Thompson sampling (TS) strategies, we propose two novel algorithms, MONeural-UCB and MONeural-TS. Theoretical and empirical analysis demonstrate the superiority of our methods in multi-objective or multi-task bandit problems, which makes great improvement over the classical linear MOMABs.
1959: Leveraging Pretrained Diffusion Models for Zero-Shot Part Assembly
Authors: Ruiyuan Zhang, Qi Wang, Jiaxiang Liu, Yuchi Huo, Chao Wu
Location: Guangzhou | Day: TBD
Show Abstract
3D part assembly aims to understand part relationships and predict their 6-DoF poses to construct realistic 3D shapes, addressing the growing demand for autonomous assembly, which is crucial for robots. Existing methods mainly estimate the transformation of each part by training neural networks under supervision, which requires a substantial quantity of manually labeled data. However, the high cost of data collection and the immense variability of real-world shapes and parts make traditional methods impractical for large-scale applications. In this paper, we propose first a zero-shot part assembly method that utilizes pre-trained point cloud diffusion models as discriminators in the assembly process, guiding the manipulation of parts to form realistic shapes. Specifically, we theoretically demonstrate that utilizing a diffusion model for zero-shot part assembly can be transformed into an Iterative Closest Point (ICP) process. Then, we propose a novel pushing-away strategy to address the overlap parts, thereby further enhancing the robustness of the method. To verify our work, we conduct extensive experiments and quantitative comparisons to several strong baseline methods, demonstrating the effectiveness of the proposed approach, which even surpasses the supervised learning method. The code has been released on https://github.com/Ruiyuan-Zhang/Zero-Shot-Assembly.
1963: Trace: Structural Riemannian Bridge Matching for Transferable Source Localization in Information Propagation
Authors: Li Sun, Suyang Zhou, Bowen Fang, Hechuan Zhang, Junda Ye, Yutong Ye, Philip S. Yu
Location: Guangzhou | Day: TBD
Show Abstract
Source localization, the inverse problem of information diffusion, shows fundamental importance for understanding social dynamics. While achieving notable progress, existing solutions are typically exposed to the risk of error accumulation, and require a large number of observations for effective inference. However, it is often impractical to obtain quantities of observations in real scenarios, highlighting the need for a transferable model with broad applicability. Recently, Riemannian geometry has demonstrated its effectiveness in information diffusion and offers guidance in knowledge transfer, but has yet to be explored in source localization. In light of the issues above, we propose to study transferable source localization from a fresh geometric perspective, and present a novel approach (Trace) on the Riemannian manifold. Concretely, we establish a structural Schrodinger bridge to directly model the map between source and final distributions, where a functional curvature, encapsulating the graph structure, is formulated to govern the Schrodinger bridge and facilitate domain adaptation. Furthermore, we design a simple yet effective learning algorithm for Riemannian Schrodinger bridges (geodesics bridge matching) in which we prove the optimal projection holds for Riemannian measure so that the expensive iterative procedure is avoided. Extensive experiments demonstrate the effectiveness and transferability of Trace on both synthetic and real datasets.
1978: Equitable Mechanism Design for Facility Location
Authors: Toby Walsh
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Agent-based and Multi-agent Systems (1/3)
Show Abstract
We consider strategy proof mechanisms for facility location which maximize equitability between agents. As is common in the literature, we measure equitability with the Gini index. We first prove a simple but fundamental impossibility result that no strategy proof mechanism can bound the approximation ratio of the optimal Gini index of utilities for one or more facilities. We propose instead computing approximation ratios of the complemented Gini index of utilities, and consider how well both deterministic and randomized mechanisms approximate this. In addition, as Nash welfare is often put forwards as an equitable compromise between egalitarain and utilitarian outcomes, we consider how well mechanisms approximate the Nash welfare.
1980: A Centrality-based Graph Learning Framework
Authors: Jiajun Yu, Zhihao Wu, Jielong Lu, Tianyue Wang, Haishuai Wang
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) have become powerful models for both node- and graph-level tasks. While node-level learning focuses on individual nodes and their local structures, graph-level learning encounters challenges in capturing the global properties of graphs. In this paper, we conduct a theoretical and experimental analysis of existing graph-level learning frameworks and find that these frameworks typically adopt a single-view perspective based solely on node degree, which limits their ability to capture comprehensive graph characteristics.
To address these issues, we propose a multi-view approach that leverages different types of centrality measures to capture diverse aspects of graph structure. We design an attention-based mechanism to adaptively integrate these multiple views, and use it as a readout function to perform weighted summation of node embeddings, termed as Adaptive Centrality Readout (ACRead). ACRead demonstrates enhanced flexibility and effectiveness when integrated with various GNN architectures, outperforming state-of-the-art readout methods, including KerRead and Set Transformer.
Additionally, this multi-view centrality approach can serve as a standalone graph-level learning framework without relying on GNNs, referred to as Adaptive Centrality-based Graph Learning (ACGL), which achieves competitive performance by effectively combining different centrality perspectives.
1990: Bridging Local and Global Knowledge via Transformer in Board Games
Authors: Yan-Ru Ju, Tai-Lin Wu, Chung-Chin Shih, Ti-Rong Wu
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Multidisciplinary Topics and Applications (1/2)
Show Abstract
Although AlphaZero has achieved superhuman performance in board games, recent studies reveal its limitations in handling scenarios requiring a comprehensive understanding of the entire board, such as recognizing long-sequence patterns in Go. To address this challenge, we propose ResTNet, a network that interleaves residual and Transformer blocks to bridge local and global knowledge. ResTNet improves playing strength across multiple board games, increasing win rate from 54.6% to 60.8% in 9×9 Go, 53.6% to 60.9% in 19×19 Go, and 50.4% to 58.0% in 19×19 Hex. In addition, ResTNet effectively processes global information and tackles two long-sequence patterns in 19×19 Go, including circular pattern and ladder pattern. It reduces the mean square error for circular pattern recognition from 2.58 to 1.07 and lowers the attack probability against an adversary program from 70.44% to 23.91%. ResTNet also improves ladder pattern recognition accuracy from 59.15% to 80.01%. By visualizing attention maps, we demonstrate that ResTNet captures critical game concepts in both Go and Hex, offering insights into AlphaZero’s decision-making process. Overall, ResTNet shows a promising approach to integrating local and global knowledge, paving the way for more effective AlphaZero-based algorithms in board games. Our code is available at https://rlg.iis.sinica.edu.tw/papers/restnet.
1997: SDDiff: Boosting Radar Perception via Spatial-Doppler Diffusion
Authors: Shengpeng Wang, Xin Luo, Yulong Xie, Wei Wang
Location: Guangzhou | Day: TBD
Show Abstract
Point cloud extraction (PCE) and ego velocity estimation (EVE) are key capabilities gaining attention in 3D radar perception. However, existing work typically treats these two tasks independently, which may neglect the interplay between radar’s spatial and Doppler domain features, potentially introducing additional bias. In this paper, we observe an underlying correlation between 3D points and ego velocity, which offers reciprocal benefits for PCE and EVE. To fully unlock such inspiring potential, we take the first step to design a Spatial-Doppler Diffusion (SDDiff) model for simultaneously dense PCE and accurate EVE. To seamlessly tailor it to radar perception, SDDiff improves the conventional latent diffusion process in three major aspects. First, we introduce a representation that embodies both spatial occupancy and Doppler features. Second, we design a directional diffusion with radar priors to streamline the sampling. Third, we propose Iterative Doppler Refinement to enhance the model’s adaptability to density variations and ghosting effects. Extensive evaluations show that SDDiff significantly outperforms state-of-the-art baselines by achieving 59% higher in EVE accuracy, 4X greater in valid generation density while boosting PCE effectiveness and reliability. The code and dataset will be available on https://github.com/StellarEsti/SDDiff.
2001: Large-Scale Trade-Off Curve Computation for Incentive Allocation with Cardinality and Matroid Constraints
Authors: Yu Cong, Chao Xu, Yi Zhou
Location: Montreal | Day: August 21st | Time: 11:30 | Session: Constraint Satisfaction and Optimization (3/3)
Show Abstract
We consider a large-scale incentive allocation problem where the entire trade-off curve between budget and profit has to be maintained approximately at all time. The application originally comes from assigning coupons to users of the ride-sharing apps, where each user can have a limit on the number of coupons been assigned. We consider a more general form, where the coupons for each user forms a matroid, and the coupon assigned to each user must be an independent set. We show the entire trade-off curve can be maintained approximately in near real time.
2003: Seeing the Unseen: Composing Outliers for Compositional Zero-Shot Learning
Authors: Chenchen Jing, Mingyu Liu, Hao Chen, Yuling Xi, Xingyuan Bu, Dong Gong, Chunhua Shen
Location: Guangzhou | Day: TBD
Show Abstract
Compositional zero-shot learning (CZSL) is to recognize unseen attribute-object compositions by learning from seen compositions. The distribution shift between unseen compositions and seen compositions poses challenges to CZSL models, especially when test images are mixed with both seen and unseen compositions. The challenge will be addressed more easily if a model can distinguish unseen/seen compositions and treat them with specific recognition strategies. However, identifying images with unseen compositions is non-trivial, considering that unseen compositions are absent in training and usually contain only subtle differences from seen compositions. In this paper, we propose a novel compositional zero-shot learning method called COMO, which composes outliers in training for distinguishing seen and unseen compositions and further applying specific strategies for them. Specifically, we compose attribute-object representations for unseen compositions based on primitive representations of training images as outliers to enable the model to identify unseen compositions in inference. At test time, the method distinguishes images containing seen/unseen compositions and uses different weights for composition classification and primitive classification to recognize seen/unseen compositions. Experimental results on three datasets show the effectiveness of our method in both the closed-world setting and the open-world setting.
2020: Automated Strategy Invention for Confluence of Term Rewrite Systems
Authors: Liao Zhang, Fabian Mitterwallner, Jan Jakubuv, Cezary Kaliszyk
Location: Guangzhou | Day: TBD
Show Abstract
Term rewriting plays a crucial role in software verification and compiler optimization. With dozens of highly parameterizable techniques developed to prove various system properties, automatic term rewriting tools work in an extensive parameter space. This complexity exceeds human capacity for parameter selection, motivating an investigation into automated strategy invention. In this paper, we focus on confluence of term rewrite systems, and apply AI techniques to invent strategies for automatic confluence proving. Moreover, we randomly generate a large dataset to analyze confluence for term rewrite systems. We improve the state-of-the-art automatic confluence prover CSI: When equipped with our invented strategies, it surpasses its human-designed strategies both on the augmented dataset and on the original human-created benchmark dataset ARI-COPS, proving/disproving the confluence of several term rewrite systems for which no automated proofs were known before.
2035: Enhancing Transferability of Audio Adversarial Example for Both Frequency- and Time-domain
Authors: Zilin Tian, Yunfei Long, Liguo Zhang, Jiahong Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Audio adversarial examples impose acoustically imperceptible perturbations to clean audio examples, fooling classification models into producing incorrect results. Transferability is a critical property of audio adversarial examples, making black-box attacks applicable in practice and attracting increasing interest. Despite recent studies achieving transferability across models within the same domain, they consistently fail to achieve transferability across different domains. Given that time-domain and frequency-domain models are the two predominant approaches in audio classification, we observe that adversarial examples generated for one domain demonstrate significantly constrained transferability to the other. To address this limitation, we propose an Adaptive Inter-domain Ensemble (AIE) attack, which integrates transferable adversarial information from both domains and dynamically optimizes their contributions through adaptive weighting, improving the cross-domain transferability of audio adversarial examples. Extensive evaluations on diverse datasets consistently demonstrate that AIE outperforms existing methods, establishing its effectiveness in enhancing adversarial transferability across domains.
2036: Model-Based Closed-Loop Control Algorithm for Stochastic Partial Differential Equation Control
Authors: Peiyan Hu, Haodong Feng, Yue Wang, Zhiming Ma
Location: Guangzhou | Day: TBD
Show Abstract
Neural operators have demonstrated promise in modeling and controlling systems governed by Partial Differential Equations (PDEs). Beyond PDEs, Stochastic Partial Differential Equations (SPDEs) play a critical role in modeling systems influenced by randomness, with applications in finance, physics, and beyond. However, controlling SPDE-governed systems remains a significant challenge. On the one hand, the regularity of the system’s state (which can be intuitively understood as smoothness) deteriorates, making modeling and generalization more challenging. On the other hand, this stochasticity also renders control more unstable and thus less accurate. To address this gap, we propose the Model-Based Closed-Loop Control Algorithm (MB-CC), the first model-based closed-loop control method for SPDEs. MB-CC introduces two key innovations to enhance control robustness and efficiency: a Regularity Feature (RF) block and a closed-loop strategy with an operator-encoded policy network. The RF block, inspired by the regularity structure theory of SPDEs, addresses noise-induced irregularities by transforming the network’s input—including the system state and noise-perturbed external forces—into a refined feature space for improved forward prediction. Compared to previous works using regularity features, we introduce a new parameterization, data augmentation, and extend the RF block as a plug-and-play component. Additionally, to achieve closed-loop control, we introduce an operator-encoded policy network to map the current state to optimal control, which integrates physical priors and swiftly makes decisions based on states returned by the environment. We conduct a systematic evaluation of MB-CC on two notable SPDEs, showcasing its effectiveness and efficiency. The ablation studies show its ability to handle stochasticity more effectively.
2038: Partial Label Clustering
Authors: Yutong Xie, Fuchao Yang, Yuheng Jia
Location: Guangzhou | Day: TBD
Show Abstract
Partial label learning (PLL) is a significant weakly supervised learning framework, where each training example corresponds to a set of candidate labels and only one label is the ground-truth label. For the first time, this paper investigates the partial label clustering problem, which takes advantage of the limited available partial labels to improve the clustering performance. Specifically, we first construct a weight matrix of examples based on their relationships in the feature space and disambiguate the candidate labels to estimate the ground-truth label based on the weight matrix. Then, we construct a set of must-link and cannot-link constraints based on the disambiguation results. Moreover, we propagate the initial must-link and cannot-link constraints based on an adversarial prior promoted dual-graph learning approach. Finally, we integrate weight matrix construction, label disambiguation, and pairwise constraints propagation into a joint model to achieve mutual enhancement. We also theoretically prove that a better disambiguated label matrix can help improve clustering performance. Comprehensive experiments demonstrate our method realizes superior performance when comparing with state-of-the-art constrained clustering methods, and outperforms PLL and semi-supervised PLL methods when only limited samples are annotated. The code and appendix are publicly available at https://github.com/xyt-ml/PLC.
2041: Privacy Preserving Solution of DCOPs by Local Search
Authors: Shmuel Goldklang, Tal Grinshpoun, Tamir Tassa
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Constraint Satisfaction and Optimization (1/3)
Show Abstract
One of the main reasons for solving constraint optimization problems in a distributed manner is maintaining agents’ privacy. Several studies in the past decade devised privacy-preserving versions of Distributed Constraint Optimization Problem (DCOP) algorithms. Some of those algorithms were complete, i.e., finding an optimal solution, while others were incomplete. The main advantage of the incomplete approach is in its scalability to large problems. One of the important incomplete paradigms for solving DCOPs is local search. Yet, so far no privacy-preserving algorithm for solving DCOPs by means of local search was devised. We present P-DSA, a privacy-preserving implementation of the classical local-search algorithm DSA that preserves topology, constraint, and assignment/decision privacy. Comparing its performance to that of P-Max-Sum, which is another privacy-preserving implementation of an incomplete DCOP algorithm, shows that P-DSA is significantly more scalable and issues much better solutions than P-Max-Sum. Therefore, P-DSA emerges as a suitable solution for practitioners addressing large-scale DCOPs with privacy considerations.
2050: fairGNN-WOD: Fair Graph Learning Without Complete Demographics
Authors: Zichong Wang, Fang Liu, Shimei Pan, Jun Liu, Fahad Saeed, Meikang Qiu, Wenbin Zhang
Location: Montreal | Day: August 21st | Time: 11:30 | Session: ETF: Fairness and diversity
Show Abstract
Graph Neural Networks (GNNs) have excelled in diverse applications due to their outstanding predictive performance, yet they often overlook fairness considerations, prompting numerous recent efforts to address this societal concern. However, most fair GNNs assume complete demographics by design, which is impractical in most real-world socially sensitive applications due to privacy, legal, or regulatory restrictions. For example, the Consumer Financial Protection Bureau (CFPB) mandates that creditors ensure fairness without requesting or collecting information about an applicant’s race, religion, nationality, sex, or other demographics. To this end, this paper proposes fairGNN-WOD, a first-of-its-kind framework that considers mitigating unfairness in graph learning without using demographic information. In addition, this paper provides a theoretical perspective on analyzing bias in node representations and establishes the relationship between utility and fairness objectives. Experiments on three real-world graph datasets illustrate that fairGNN-WOD outperforms state-of-the-art baselines in achieving fairness but also maintains comparable prediction performance.
2056: HSRMamba: Contextual Spatial-Spectral State Space Model for Single Hyperspectral Image Super-Resolution
Authors: Shi Chen, Lefei Zhang, Liangpei Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Mamba has demonstrated exceptional performance in visual tasks due to its powerful global modeling capabilities and linear computational complexity, offering considerable potential in hyperspectral image super-resolution (HSISR). However, in HSISR, Mamba faces challenges as transforming images into 1D sequences neglects the spatial-spectral structural relationships between locally adjacent pixels, and its performance is highly sensitive to input order, which affects the restoration of both spatial and spectral details. In this paper, we propose HSRMamba, a contextual spatial-spectral modeling state space model for HSISR, to address these issues both locally and globally. Specifically, a local spatial-spectral partitioning mechanism is designed to establish patch-wise causal relationships among adjacent pixels in 3D features, mitigating the local forgetting issue. Furthermore, a global spectral reordering strategy based on spectral similarity is employed to enhance the causal representation of similar pixels across both spatial and spectral dimensions. Finally, experimental results demonstrate our HSRMamba outperforms the state-of-the-art methods in quantitative quality and visual results. Code is available at: https://github.com/Tomchenshi/HSRMamba.
2066: Steady-State Strategy Synthesis for Swarms of Autonomous Agents
Authors: Martin Jonáš, Antonín Kučera, Vojtěch Kůr, Jan Mačák
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Agent-based and Multi-agent Systems (1/3)
Show Abstract
The steady-state synthesis aims to construct a policy for a given MDP D such that the long-run average frequencies of visits to the vertices of D satisfy given numerical constraints. This problem is solvable in polynomial time, and memoryless policies are sufficient for approximating an arbitrary frequency vector achievable by a general (infinite-memory) policy.

We study the steady-state synthesis problem for multiagent systems, where multiple autonomous agents jointly strive to achieve a suitable frequency vector. We show that the problem for multiple agents is computationally hard (PSPACE or NP hard, depending on the variant), and memoryless strategy profiles are insufficient for approximating achievable frequency vectors. Furthermore, we prove that even evaluating the frequency vector achieved by a given memoryless profile is computationally hard. This reveals a severe barrier to constructing an efficient synthesis algorithm, even for memoryless profiles. Nevertheless, we design an efficient and scalable synthesis algorithm for a subclass of full memoryless profiles, and we evaluate this algorithm on a large class of randomly generated instances. The experimental results demonstrate a significant improvement against a naive algorithm based on strategy sharing.
2068: Towards Anytime Retrieval: A Benchmark for Anytime Person Re-Identification
Authors: Xulin Li, Yan Lu, Bin Liu, Jiaze Li, Qinhong Yang, Tao Gong, Qi Chu, Mang Ye, Nenghai Yu
Location: Guangzhou | Day: TBD
Show Abstract
In real applications, person re-identification (ReID) expects to retrieve the target person at any time, including both daytime and nighttime, ranging from short-term to long-term. However, existing ReID tasks and datasets cannot meet this requirement, as they are constrained by available time and only provide training and evaluation for specific scenarios. Therefore, we investigate a new task called Anytime Person Re-identification (AT-ReID), which aims to achieve effective retrieval in multiple scenarios based on variations in time. To address the AT-ReID problem, we collect the first large-scale dataset, AT-USTC, which contains 135k images of individuals wearing multiple clothes captured by RGB and IR cameras. Our data collection spans over an entire year and 270 volunteers were photographed on average 29.1 times across different dates or scenes, 4-15 times more than current datasets, providing conditions for follow-up investigations in AT-ReID. Further, to tackle the new challenge of multi-scenario retrieval, we propose a unified model named Uni-AT, which comprises a multi-scenario ReID (MS-ReID) framework for scenario-specific features learning, a Mixture-of-Attribute-Experts (MoAE) module to alleviate inter-scenario interference, and a Hierarchical Dynamic Weighting (HDW) strategy to ensure balanced training across all scenarios. Extensive experiments show that our model leads to satisfactory results and exhibits excellent generalization to all scenarios.
2078: InstGAN: Instant Actor-Critic-Driven GAN for De Novo Molecule Generation and Property Optimization
Authors: Huidong Tang, Chen Li, Sayaka Kamei, Yoshihiro Yamanishi, Yasuhiko Morimoto
Location: Guangzhou | Day: TBD
Show Abstract
Deep generative models, such as generative adversarial networks (GANs), have been employed for de~novo molecular generation in drug discovery. Most prior studies have utilized reinforcement learning (RL) algorithms, particularly Monte Carlo tree search (MCTS), to handle the discrete nature of molecular representations in GANs. However, due to the inherent instability in training GANs and RL models, along with the high computational cost associated with MCTS sampling, MCTS RL-based GANs struggle to scale to large chemical databases. To tackle these challenges, this study introduces a novel GAN based on actor-critic RL with instant and global rewards, called InstGAN, to generate molecules at the token-level with multi-property optimization. Furthermore, maximized information entropy is leveraged to alleviate the mode collapse. The experimental results demonstrate that InstGAN outperforms other baselines, achieves comparable performance to state-of-the-art models, and efficiently generates molecules with multi-property optimization. The code is available at: https://github.com/tang777777/InstGAN.
2090: Screening, Rectifying, and Re-Screening: A Unified Framework for Tuning Vision-Language Models with Noisy Labels
Authors: Chaowei Fang, Hangfei Ma, Zhihao Li, De Cheng, Yue Zhang, Guanbin Li
Location: Guangzhou | Day: TBD
Show Abstract
Pre-trained vision-language models have shown remarkable potential for downstream tasks. However, their fine-tuning under noisy labels remains an open problem due to challenges like self-confirmation bias and the limitations of conventional small-loss criteria. In this paper, we propose a unified framework to address these issues, consisting of three key steps: Screening, Rectifying, and Re-Screening. First, a dual-level semantic matching mechanism is introduced to categorize samples into clean, ambiguous, and noisy samples by leveraging both macro-level and micro-level textual prompts. Second, we design tailored pseudo-labeling strategies to rectify noisy and ambiguous labels, enabling their effective incorporation into the training process. Finally, a re-screening step, utilizing cross-validation with an auxiliary vision-language model, mitigates self-confirmation bias and enhances the robustness of the framework. Extensive experiments across ten datasets demonstrate that the proposed method significantly outperforms existing approaches for tuning vision-language pre-trained models with noisy labels.
2098: Inverse Game Theory: An Incenter-Based Approach
Authors: Lvye Cui, Haoran Yu, Pierre Pinson, Dario Paccagnan
Location: Guangzhou | Day: TBD
Show Abstract
Estimating player utilities from observed equilibria is crucial for many applications. Existing approaches to tackle this problem are either limited to specific games or do not scale well with the number of players. Our work addresses these issues by proposing a novel utility estimation method for general multi-player non-cooperative games. Our main idea consists in reformulating the inverse game problem as an inverse variational inequality problem and in selecting among all utility parameters consistent with the data, the so-called incenter. We show that the choice of the incenter can produce parameters that are most robust to the observed equilibrium behaviors. However, its computation is challenging, as the number of constraints in the corresponding optimization problem increases with the number of players and the behavior space size. To tackle this challenge, we propose a loss function-based algorithm, making our method scalable to games with many players or a continuous action space. Furthermore, we show that our method can be extended to incorporate prior knowledge of player utilities, and that it can handle inconsistent data, i.e., data where players do not play exact equilibria. Numerical experiments on three game applications demonstrate that our methods outperform the state of the art. The code, datasets, and supplementary material are available at https://github.com/cuilvye/Incenter-Project.
2101: DcDsDiff: Dual-Conditional and Dual-Stream Diffusion Model for Generative Image Tampering Localization
Authors: Qixian Hao, Shaozhang Niu, Jiwei Zhang, Kai Wang
Location: Guangzhou | Day: TBD
Show Abstract
Generative Image Tampering (GIT), due to its high diversity and realism, poses a significant challenge to traditional image tampering localization techniques. Consequently, this paper introduces a denoising diffusion probabilistic model-based DcDsDiff, which comprises a Dual-View Conditional Network (DVCN) and a Dual-Stream Denoising Network (DSDN). DVCN provides clues about the tampered areas. It extracts tampering features in the high-frequency view and integrates them with spatial domain features using attention mechanisms. DSDN jointly generates mask image and detail image, enhancing the generalization capability of the model against new tampering forms through iterative denoising. A multi-stream interaction mechanism enables the two generative tasks to promote each other, prompting the model to generate localization results that are rich in detail and complete. Experiments show that DcDsDiff outperforms mainstream methods in accurate localization, generalization, extensibility, and robustness. Code page: https://github.com/QixianHao/DcDsDiff-and-GIT10K.
2104: RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations
Authors: Zunhai Su, Hanyu Wei, Zhe Chen, Wang Shen, Linge Li, Huangqi Yu, Kehong Yuan
Location: Guangzhou | Day: TBD
Show Abstract
Key-Value (KV) cache facilitates efficient large language models (LLMs) inference by avoiding recomputation of past KVs.
As the batch size and context length increase, the oversized KV caches become a significant memory bottleneck, highlighting the need for efficient compression.
Existing KV quantization rely on fine-grained quantization or the retention of a significant portion of high bit-widths caches, both of which compromise compression ratio and often fail to maintain robustness at extremely low average bit-widths.
In this work, we explore the potential of rotation technique for 2-bit KV quantization and propose RotateKV, which achieves accurate and robust performance through the following innovations:
(i) Outlier-Aware Rotation, which utilizes channel-reordering to adapt the rotations to varying channel-wise outlier distributions without sacrificing the computational efficiency of the fast Walsh-Hadamard transform (FWHT);
(ii) Pre-RoPE Grouped-Head Rotation, which mitigates the impact of rotary position embedding (RoPE) on proposed outlier-aware rotation and further smooths outliers across heads;
(iii) Attention-Sink-Aware Quantization, which leverages the massive activations to precisely identify and protect attention sinks.
RotateKV achieves less than 0.3 perplexity (PPL) degradation with 2-bit quantization on WikiText-2 using LLaMA-2-13B, maintains strong CoT reasoning and long-context capabilities, with less than 1.7% degradation on GSM8K, outperforming existing methods even at lower average bit-widths.
RotateKV also showcases a 3.97× reduction in peak memory usage, supports 5.75× larger batch sizes, and achieves a 2.32× speedup in decoding stage.
2106: Accurate Sublayer Pruning for Large Language Models by Exploiting Latency and Tunability Information
Authors: Seungcheol Park, Sojin Lee, Jongjin Kim, Jinsik Lee, Hyunjik Jo, U Kang
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Natural Language Processing (2/2)
Show Abstract
How can we accelerate large language models (LLMs) without sacrificing accuracy? The slow inference speed of LLMs hinders us to benefit from their remarkable performance in diverse applications. This is mainly because numerous sublayers are stacked together in LLMs. Sublayer pruning compresses and expedites LLMs via removing unnecessary sublayers. However, existing sublayer pruning algorithms are limited in accuracy since they naively select sublayers to prune, overlooking the different characteristics of each sublayer.
In this paper, we propose SPRINT (Sublayer Pruning with Latency and Tunability Information), an accurate sublayer pruning method for LLMs. SPRINT accurately selects a target sublayer to prune by considering 1) the amount of latency reduction after pruning and 2) the tunability of sublayers. SPRINT iteratively prunes redundant sublayers and swiftly tunes the parameters of remaining sublayers. Experiments show that SPRINT achieves the best accuracy-speedup trade-off, exhibiting up to 23.88%p higher accuracy on zero-shot commonsense reasoning benchmarks compared to existing pruning algorithms.
2112: Q-MiniSAM2: A Quantization-based Benchmark for Resource-Efficient Video Segmentation
Authors: Xuanxuan Ren, Xiangyu Li, Kun Wei, Xu Yang, Yanhua Yang
Location: Guangzhou | Day: TBD
Show Abstract
Segment Anything Model 2 (SAM2) is a new-generation, high-precision model for image and video segmentation, offering extensive application prospects across numerous computer vision fields. However, as a large-scale model, its huge memory demands and expansive computing costs pose challenges for practical deployment. This paper presents Q-MiniSAM2, an efficient Quantization-based segmentation benchmark tailored to optimize SAM2 by Minimizing memory consumption and accelerating computations. We begin with applying Post-Training Quantization (PTQ) to SAM2, requiring only a relatively small dataset for network calibration, thereby eliminating the need for retraining. Building upon PTQ, we further introduce a Hierarchy-based Video Quantization method to enhance the model’s capacity to capture video semantics and temporal correlations across different time scales. Furthermore, we observe that SAM2’s memory overhead is predominantly concentrated on processing historical frames, and the redundant cross-attention computations significantly increase memory and computational costs due to the imperceptible change of the short time intervals between these frames. To tackle this issue, an Adaptive Mutual-KV mechanism is proposed to mitigate excessive cross-attention by leveraging inter-frame similarities. Comprehensive experiments demonstrate that the proposed approach achieves superior performance compared to state-of-the-art methods, underscoring its potential for efficient and scalable video segmentation.
2116: ARPDL: Adaptive Relational Prior Distribution Loss as an Adapter for Document-Level Relation Extraction
Authors: Huangming Xu, Fu Zhang, Jingwei Cheng, Xin Li
Location: Guangzhou | Day: TBD
Show Abstract
The goal of document-level relation extraction (DocRE) is to identify relations between entities from multiple sentences. As a multi-label classification task, a common approach is to determine whether there are relations for an entity pair by selecting a multi-label classification threshold, with scores of relations above the threshold predicted as positive and the rest as negative. However, we find that predicting multiple relations for entity pairs causes the decrease of predicted scores in positive classes. This could lead to many positive classes being incorrectly predicted as negative. Additionally, our analysis suggests that fitting the distribution of predicted relations to the prior distribution of relations can help improve prediction performance. However, previous studies have not explored or leveraged the prior distribution of relations. To address these issues and findings, we for the first time propose the idea of incorporating the relational prior distribution into the loss calculation in DocRE tasks. We innovatively propose an Adaptive Relational Prior Distribution Loss (ARPDL), which can adaptively adjust relation prediction scores based on the relational prior distribution. Our designed relational prior distribution component can also be integrated as an adapter into other threshold-based losses to improve prediction performance. Experimental results demonstrate that ARPDL consistently improves the performance of existing DocRE models, achieving new state-of-the-art results. Furthermore, integrating our relational prior distribution adapter into other losses significantly enhances their performance in DocRE tasks, validating the effectiveness and generality of our approach. Code is available at https://github.com/xhm-code/ARPDL.
2117: EFX Feasible Scheduling for Time-dependent Resources
Authors: Jiazhu Fang, Qizhi Fang, Minming Li, Wenjing Liu
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we study a fair resource scheduling problem involving the assignment of a set of interval jobs among a group of heterogeneous machines. Each job is associated with a release time, a deadline, and a processing time. A machine can process a job if the entire processing period falls within the release time and deadline of the job. Each machine can process at most one job at any given time, and different jobs yield different utilities for the machines. The goal is to find a fair and efficient schedule of the jobs. We discuss the compatibility between envy-freeness up to any item (EFX) and various efficiency concepts. Additionally, we present polynomial-time algorithms for various settings.
2120: A Correlation Manifold Self-Attention Network for EEG Decoding
Authors: Chen Hu, Rui Wang, Xiaoning Song, Tao Zhou, Xiao-Jun Wu, Nicu Sebe, Ziheng Chen
Location: Guangzhou | Day: TBD
Show Abstract
Riemannian neural networks, which generalize the deep learning paradigm to non-Euclidean geometries, have garnered widespread attention across diverse applications in artificial intelligence. Among these, the representative attention models have been studied on various non-Euclidean spaces to geometrically capture the spatiotemporal dependencies inherent in time series data, e.g., electroencephalography (EEG). Recent studies have highlighted the full-rank correlation matrix as an advantageous alternative to the covariance matrix for data representation, owing to its invariance to the scale of variables. Motivated by these advancements, we propose the Correlation Attention Network (CorAtt) tailored for full-rank correlation matrices and implement it under the permutation-invariant and computationally efficient Off-Log and Log-Scaled geometries, respectively. Extensive evaluations on three benchmarking EEG datasets provide substantial evidence for the effectiveness of our introduced CorAtt. The code and supplementary material can be found at https://github.com/ChenHu-ML/CorAtt.
2131: Exploiting Text Semantics for Few and Zero Shot Node Classification on Text-attributed Graph
Authors: Yuxiang Wang, Xiao Yan, Shiyu Jin, Quanqing Xu, Chuang Hu, Yuanyuan Zhu, Bo Du, Jia Wu, Jiawei Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Text-attributed graph (TAG) provides a text description for each graph node, and few- and zero-shot node classification on TAGs have many applications in fields such as academia and social networks. Existing work utilizes various graph-based augmentation techniques to train the node and text embeddings, while text-based augmentations are largely unexplored. In this paper, we propose Text Semantics Augmentation (TSA) to improve accuracy by introducing more text semantic supervision signals. Specifically, we design two augmentation techniques, i.e., positive semantics matching and negative semantics contrast, to provide more reference texts for each graph node or text description. Positive semantic matching retrieves texts with similar embeddings to match with a graph node. Negative semantic contrast adds a negative prompt to construct a text description with the opposite semantics, which is contrasted with the original node and text. We evaluate TSA on 5 datasets and compare with 13 state-of-the-art baselines. The results show that TSA consistently outperforms all baselines, and its accuracy improvements over the best-performing baseline are usually over 5%. The code is at https://github.com/wyx11112/TSA.
2135: Unleashing the Semantic Adaptability of Controlled Diffusion Model for Image Colorization
Authors: Xiangcheng Du, Zhao Zhou, Yanlong Wang, Yingbin Zheng, Xingjiao Wu, Peizhu Gong, Cheng Jin
Location: Guangzhou | Day: TBD
Show Abstract
Recent data-driven image colorization methods have leveraged pre-trained Text-to-Image (T2I) diffusion models as generative prior, while still suffering from unsatisfactory and inaccurate semantic-level color control. To address these issues, we propose a Semantic Adaptation method (SeAda) that enhances the prior while considering the semantic discrepancy between color and grayscale image pairs. The SeAda employs a semantic adapter to produce refined semantic embeddings and a controlled T2I diffusion model to create reasonably colored images. Specifically, the semantic adapter transfers the embedding from grayscale to color domain, while the diffusion model utilizes the refined embedding and prior knowledge to achieve realistic and diverse results. We also design a three-staged training strategy to improve semantic comprehension and prior integration for further performance improvement. Extensive experiments on public datasets demonstrate that our method outperforms existing state-of-the-art techniques, yielding superior performance in image colorization.
2157: TsCA: On the Semantic Consistency Alignment via Conditional Transport for Compositional Zero-Shot Learning
Authors: Miaoge Li, Jingcai Guo, Richard Yi Da Xu, Dongsheng Wang, Xiaofeng Cao, Zhijie Rao, Song Guo
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Machine Learning (3/4)
Show Abstract
Compositional Zero-Shot Learning (CZSL) aims to recognize novel state-object compositions by leveraging the shared knowledge of their primitive components. Despite considerable progress, effectively calibrating the bias between semantically similar multimodal representations, as well as generalizing pre-trained knowledge to novel compositional contexts, remains an enduring challenge. In this paper, our interest is to revisit the conditional transport (CT) theory and its homology to the visual-semantics interaction in CZSL and further, propose a novel Trisets Consistency Alignment framework (dubbed TsCA) that well-addresses these issues. Concretely, we utilize three distinct yet semantically homologous sets, i.e., patches, primitives, and compositions, to construct pairwise CT costs to minimize their semantic discrepancies. To further ensure the consistency transfer within these sets, we implement a cycle-consistency constraint that refines the learning by guaranteeing the feature consistency of the self-mapping during transport flow, regardless of modality. Moreover, we extend the CT plans to an open-world setting, which enables the model to effectively filter out unfeasible pairs, thereby speeding up the inference as well as increasing the accuracy. Extensive experiments are conducted to verify the effectiveness of the proposed method. The code is available at https://github.com/keepgoingjkg/TsCA.
2164: Leveraging Peer-Informed Label Consistency for Robust Graph Neural Networks with Noisy Labels
Authors: Kailai Li, Jiawei Sun, Jiong Lou, Zhanbo Feng, Hefeng Zhou, Chentao Wu, Guangtao Xue, Wei Zhao, Jie Li
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) excel in many applications but struggle when trained with noisy labels, especially as noise can propagate through the graph structure.
Despite recent progress in developing robust GNNs, few methods exploit the intrinsic properties of graph data to filter out noise.
In this paper, we introduce ProCon, a novel framework that identifies mislabeled nodes by measuring label consistency among semantically similar peers, which are determined by feature similarity and graph adjacency.
Mislabeled nodes typically exhibit lower consistency with these peers, a signal we measure using pseudo-labels derived from representational prototypes.
A Gaussian Mixture Model is fitted to the consistency distribution to identify clean samples, which refine prototype quality in an iterative feedback loop.
Experiments on multiple datasets demonstrate that ProCon significantly outperforms state-of-the-art methods, effectively mitigating label noise and enhancing GNN robustness.
2169: Zero-shot Federated Unlearning via Transforming from Data-Dependent to Personalized Model-Centric
Authors: Wenhan Wu, Huanghuang Liang, Jingling Yuan, Jiawei Jiang, Kanye Ye Wang, Chuang Hu, Xiaobo Zhou, Dazhao Cheng
Location: Guangzhou | Day: TBD
Show Abstract
Federated Unlearning (FU) addresses the "right to be forgotten" in federated learning by removing specific client data’s contribution without retraining from scratch. Existing FUs are data-dependent, which make the assumption that systems can access original training data or stored historical parameter updates during unlearning. However, the assumption cannot always hold in practice, as users usually request the deletion of client data and historical parameter updates due to privacy concerns or storage limitations. Therefore, it is crucial to develop a zero-shot FU method without such data access. The key challenge is how to distinguish and remove the impact of target clients without data-level information. Motivated by the idea that if we can learn client-specific personalized information from the model instead of data, FU can be model-centric and data-free, we present the first zero-shot FU framework ZeroFU. By embedding client contributions into the model during learning via condition computation, ZeroFU enables the model to possess personalized features for unlearning. The unlearning is achieved using a proposed GAN-based distillation framework that obfuscates the personalized feature of the target client. Evaluations demonstrate its effectiveness in unlearning under non-IID settings.
2173: Training-free Fourier Phase Diffusion for Style Transfer
Authors: Siyuan Zhang, Wei Ma, Libin Liu, Zheng Li, Hongbin Zha
Location: Guangzhou | Day: TBD
Show Abstract
Diffusion models have shown significant potential for image style transfer tasks. However, achieving effective stylization while preserving content in a training-free setting remains a challenging issue due to the tightly coupled representation space and inherent randomness of the models. In this paper, we propose a Fourier phase diffusion model that addresses this challenge. Given that the Fourier phase spectrum encodes an image’s edge structures, we propose modulating the intermediate diffusion samples with the Fourier phase of a content image to conditionally guide the diffusion process. This ensures content retention while fully utilizing the diffusion model’s style generation capabilities. To implement this, we introduce a content phase spectrum incorporation method that aligns with the characteristics of the diffusion process, preventing interference with generative stylization. To further enhance content preservation, we integrate homomorphic semantic features extracted from the content image at each diffusion stage. Extensive experimental results demonstrate that our method outperforms state-of-the-art models in both content preservation and stylization. Code is available at https://github.com/zhang2002forwin/Fourier-Phase-Diffusion-for-Style-Transfer.
2182: Risk-Aware Task Migration for Multiplex Unmanned Swarm Networks in Adversarial Environments
Authors: Kai Di, Tienyu Zuo, Pan Li, Yuanshuang Jiang, Fulin Chen, Yichuan Jiang
Location: Guangzhou | Day: TBD
Show Abstract
With the rapid development and deep integration of artificial intelligence and automation technologies, autonomous unmanned swarms dynamically organize into multiplex network structures based on diverse task requirements in adversarial environments. Frequent task variations lead to load imbalances among agents and between network layers, significantly increasing the risk of enemy detection and destruction. Existing approaches typically simplify multiplex networks into single-layer structures for task scheduling, failing to address these load imbalance issues. Moreover, the coupling between task dynamics and network multiplexity dramatically increases the complexity of designing task migration strategies, and it is proven NP-hard to achieve such load balancing. To address these challenges, this paper proposes a risk-aware task migration method that achieves dynamic load balancing by matching task requirements with both intra-layer agent capabilities and inter-layer swarm capabilities. Simulation results demonstrate that our approach significantly outperforms benchmark algorithms in task completion cost, task completion proportion, and system robustness. In particular, the algorithm achieves solutions statistically indistinguishable from the optimal solutions computed by the CPLEX solver, while exhibiting significantly reduced computational overhead.
2204: DHTAGK: Deep Hierarchical Transitive-Aligned Graph Kernels for Graph Classification
Authors: Xinya Qin, Lu Bai, Lixin Cui, Ming Li, Ziyu Lyu, Hangyuan Du, Edwin Hancock
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we propose a family of novel Deep Hierarchical Transitive-Aligned Graph Kernels (DHTAGK) for graph classification. To this end, we commence by developing a new Hierarchical Aligned Graph Auto-Encoder (HA-GAE) to construct transitive-aligned embedding graphs that encapsulate the structural correspondence information between graphs. The DHTAGK kernels then measure either the Jensen-Shannon Divergence between the adjacency matrices or the Gaussian kernel between the node feature matrices of the embedding graphs. Unlike the classical R-convolution kernels and node-based alignment kernels, the DHTAGK kernels can capture the transitive structural correspondence information and thus ensure the positive definiteness. Furthermore, the HA-GAE enables the DHTAGK kernels to simultaneously reflect both local and global graph structures and identify common structural patterns. Experimental results show that the DHTAGK kernels outperform state-of-the-art graph kernels and deep learning methods on benchmark datasets.
2229: POLO: An LLM-Powered Project-Level Code Performance Optimization Framework
Authors: Jiameng Bai, Ruoyi Xu, Sai Wu, Dingyu Yang, Junbo Zhao, Gang Chen
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: MTA: Software engineering
Show Abstract
Program performance optimization is essential for achieving high execution efficiency, yet it remains a challenging task that requires expertise in both software and hardware.
Large Language Models (LLMs), trained on high-quality code from platforms like GitHub and other open-source sources, have shown promise in generating optimized code for simple snippets. However, current LLM-based solutions often fall short when tackling project-level programs due to the complexity of call graphs and the intricate interactions among functions. In this paper, we emulate the process a human expert might follow when optimizing project-level programs and introduce a three-phase framework POLO (PrOject-Level Optimizer) to address this limitation.
First, we profile the program to identify performance bottlenecks using an iterative weighting algorithm.
Next, we conduct structural analysis by scanning the project and generating a graph that represents the program’s structure.
Finally, two LLM agents collaborate in iterative cycles to rewrite and optimize the code at these hotspots, gradually improving performance.
We conduct experiments on open-source and proprietary projects. The results demonstrate that POLO accurately identifies performance bottlenecks and successfully applies optimizations. Under the O3 compilation flag, the optimized programs achieved speedups ranging from 1.34x to 21.5x.
2234: AKBR: Learning Adaptive Kernel-based Representations for Graph Classification
Authors: Lu Bai, Feifei Qian, Lixin Cui, Ming Li, Hangyuan Du, Yue Wang, Edwin Hancock
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we propose a new model to learn Adaptive Kernel-based Representations (AKBR) for graph classification. Unlike state-of-the-art R-convolution graph kernels that are defined by merely counting any pair of isomorphic substructures between graphs and cannot provide an end-to-end learning mechanism for the classifier, the proposed AKBR approach aims to define an end-to-end representation learning model to construct an adaptive kernel matrix for graphs. To this end, we commence by leveraging a novel feature-channel attention mechanism to capture the interdependencies between different substructure invariants of original graphs. The proposed AKBR model can thus effectively identify the structural importance of different substructures, and compute the R-convolution kernel between pairwise graphs associated with the more significant substructures specified by their structural attentions. Furthermore, the proposed AKBR model employs all sample graphs as the prototype graphs, naturally providing an end-to-end learning architecture between the kernel computation as well as the classifier. Experimental results show that the proposed AKBR model outperforms existing state-of-the-art graph kernels and deep learning methods on standard graph benchmarks.
2243: Sentiment-enhanced Multi-hop Connected Graph Attention Network for Multimodal Aspect-Based Sentiment Analysis
Authors: Linlin Zhu, Heli Sun, Xiaoyong Huang, Qi Zhang, Ruichen Cao, Liang He
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal aspect-based sentiment analysis aims to extract aspects from different data sources and recognize the corresponding sentiments. While current research has broadly focused on syntax relation-driven semantic comprehension, the impact of the importance of different syntactic relations on semantic understanding has not been adequately investigated. To address this issue, we propose a Sentiment-enhanced Multi-hop Connected Graph Attention Network (MCG), aiming to enhance the discriminative capability of model for sentiments and to delve into the syntactic relationships within the text. Firstly, we design a contrastive sentiment-enhanced pre-training task that expands the diversity and complexity of training samples to improve the recognition of multiple sentiments. Secondly, we construct a multi-hop connected syntactic dependency graph to deeply explore the rich syntactic dependencies in the text and to reveal the differences among syntactic relations. Moreover, we develop a multi-hop connected graph attention mechanism that enables the model to focus on the key syntactic relations within the syntactic structure, thereby enhancing the comprehension and predictive capabilities of model in multimodal sentiment analysis. Experimental results on two benchmark datasets demonstrate that our method outperforms state-of-the-art methods. The source code is provided in the supplementary materials.
2257: Improved Rank Aggregation Under Fairness Constraint
Authors: Diptarka Chakraborty, Himika Das, Sanjana Dey, Alvin Hong Yao Yan
Location: Montreal | Day: August 21st | Time: 11:30 | Session: ETF: Fairness and diversity
Show Abstract
Aggregating multiple input rankings into a consensus ranking is essential in various fields such as social choice theory, hiring, college admissions, web search, and databases. A major challenge is that the optimal consensus ranking might be biased against individual candidates or groups, especially those from marginalized communities. This concern has led to recent studies focusing on fairness in rank aggregation. The goal is to ensure that candidates from different groups are fairly represented in the top-k positions of the aggregated ranking.

We study this fair rank aggregation problem by considering the Kendall tau as the underlying metric. While we know of a polynomial-time approximation scheme (PTAS) for the classical rank aggregation problem, the corresponding fair variant only possesses a quite straightforward 3-approximation algorithm due to Wei et al., SIGMOD’22, and Chakraborty et al., NeurIPS’22, which finds closest fair ranking for each input ranking and then simply outputs the best one.

In this paper, we first provide a novel algorithm that achieves (2+ε)-approximation (for any ε > 0), significantly improving over the 3-approximation bound. Next, we provide a 2.881-approximation fair rank aggregation algorithm that works irrespective of the fairness notion, given one can find a closest fair ranking, beating the 3-approximation bound. We complement our theoretical guarantee by performing extensive experiments on various real-world datasets to establish the effectiveness of our algorithm further by comparing it with the performance of state-of-the-art algorithms.
2265: A Dual Stream Visual Tokenizer for LLM Image Generation
Authors: Yongqian Li, Yong Luo, Xiantao Cai, Zheng He, Zhennan Meng, Nidong Wang, Yunlin Chen, Zhifei Li
Location: Guangzhou | Day: TBD
Show Abstract
We proposes a novel visual tokenizer by combining high-level semantic tokens and low-level pixel tokens to represent images, aiming to address the challenges of image-to-sequence conversion for Large Language Models (LLMs). Existing visual tokenizers, such as VQ-VAE and diffusion-based models, either struggle with token explosion as image resolution increases or fail to capture detailed structural information. Our method introduces a dual-token system: high-level semantic tokens capture the main content of the image, while low-level pixel tokens preserve structural details. By integrating these tokens in a hybrid architecture, we leverage a VQ-VAE branch to generate low-resolution guidance and a diffusion process to reconstruct high-resolution images with both semantic coherence and structural accuracy. This approach significantly reduces the number of required tokens and enhances image reconstruction quality, offering an efficient solution for tasks like image generation and understanding based on LLMs.
2286: Graph Embedded Contrastive Learning for Multi-View Clustering
Authors: Hongqing He, Jie Xu, Guoqiu Wen, Yazhou Ren, Na Zhao, Xiaofeng Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Recently, numerous multi-view clustering (MVC) and multi-view graph clustering (MVGC) methods have been proposed. Despite significant progress, they still face two issues: I) MVC and MVGC are often developed independently for multi-view and multi-graph data. They have redundancy but lack a unified methodology to combine their strengths. II) Contrastive learning is usually adopted to explore the associations across multiple views. However, traditional contrastive losses ignore the neighbor relationship in multi-view scenarios and easily lead to false associations in sample pairs. To address these issues, we propose Graph Embedded Contrastive Learning for Multi-View Clustering. Concretely, we propose a process of view-specific pre-training with adaptive graph convolution to make our method compatible with both multi-view and multi-graph data, which aggregates the graph information into data and leverages autoencoders to learn view-specific representations. Furthermore, to explore the view-cross associations, we introduce the process of view-cross contrastive learning and clustering, where we propose the graph-guided contrastive learning that can generate global graph to mitigate the false association issue as well as the cluster-guided contrastive clustering for improving the model robustness. Finally, extensive experiments demonstrate that our method achieves superior performance on both MVC and MVGC tasks.
2292: Exploiting Self-Refining Normal Graph Structures for Robust Defense against Unsupervised Adversarial Attacks
Authors: Bingdao Feng, Di Jin, Xiaobao Wang, Dongxiao He, Jingyi Cao, Zhen Wang
Location: Guangzhou | Day: TBD
Show Abstract
Defending against adversarial attacks on graphs has become increasingly important. Graph refinement to enhance the quality and robustness of representation learning is a critical area that requires thorough investigation. We observe that representations learned from attacked graphs are often ineffective for refinement due to perturbations that cause the endpoints of perturbed edges to become more similar, complicating the defender’s ability to distinguish them. To address this challenge, we propose a robust unsupervised graph learning framework that utilizes cleaner graphs to learn effective representations. Specifically, we introduce an anomaly detection model based on contrastive learning to obtain a rough graph excluding a large number of perturbed structures. Subsequently, we then propose the Graph Pollution Degree (GPD), a mutual information-based measure that leverages the encoder’s representation capability on the rough graph to assess the trustworthiness of the predicted graph and refine the learned representations. Extensive experiments on four benchmark datasets demonstrate that our method outperforms nine state-of-the-art defense models, effectively defending against adversarial attacks and enhancing node classification performance.
2296: GBGC: Efficient and Adaptive Graph Coarsening via Granular-ball Computing
Authors: Shuyin Xia, Guan Wang, Gaojie Xu, Sen Zhao, Guoyin Wang
Location: Guangzhou | Day: TBD
Show Abstract
The objective of graph coarsening is to generate smaller, more manageable graphs while preserving key information of the original graph. Previous work were mainly based on the perspective of spectrum-preserving, using some predefined coarsening rules to make the eigenvalues of the Laplacian matrix of the original graph and the coarsened graph match as much as possible. However, they largely overlooked the fact that the original graph is composed of subregions at different levels of granularity, where highly connected and similar nodes should be more inclined to be aggregated together as nodes in the coarsened graph. By combining the multi-granularity characteristics of the graph structure, we can generate coarsened graph at the optimal granularity. To this end, inspired by the application of granular-ball computing in multi-granularity, we propose a new multi-granularity, efficient, and adaptive coarsening method via granular-ball (GBGC), which significantly improves the coarsening results and efficiency. Specifically, GBGC introduces an adaptive granular-ball graph refinement mechanism, which adaptively splits the original graph from coarse to fine into granular-balls of different sizes and optimal granularity, and constructs the coarsened graph using these granular-balls as supernodes. In addition, compared with other state-of-the-art graph coarsening methods, the processing speed of this method can be increased by tens to hundreds of times and has lower time complexity. The accuracy of GBGC is almost always higher than that of the original graph due to the good robustness and generalization of the granular-ball computing, so it has the potential to become a standard graph data preprocessing method.
2311: CFDONEval: A Comprehensive Evaluation of Operator-Learning Neural Network Models for Computational Fluid Dynamics
Authors: Menghan Liu, Jianhuan Cen, Ziyang Zhou, Haolong Fan, Hongji Li, Ping Wei, Guohang Peng, Changye He, Yuzhe Qin, Yutong Lu, Qingsong Zou
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we introduce CFDONEval, a comprehensive evaluation of 12 operator-learning-based neural network (ON) models to simulate 7 benchmark fluid dynamics problems. These problems cover a range of 2D scenarios, including Darcy flow, two-phase flow, Taylor-Green vortex, lid-driven cavity flow, tube flow, circular cylinder flow, and 3D periodic hill flow. For a rigorous evaluation, we establish 22 fluid dynamics datasets for these benchmark problems, 18 of which are newly generated using traditional numerical methods, such as the finite element method. Our evaluation tackles 5 key challenges: multiscale phenomena, convection dominance, long-term predictions, multiphase flows, and unstructured meshes over complex geometries. We assess computational accuracy, efficiency, and flow field visualization, offering valuable insights into the application of ON models in fluid dynamics research. Our findings show that attention-based models perform well in handling almost all challenges; models with a U-shaped structure excel in handling multiscale problems; and the NU-FNO model demonstrates the smallest relative error in L2 norm when processing nonuniform grid data. The related code, dataset, and appendix are publicly available at: https://github.com/Sysuzqs/CFDNNEval.
2320: Enhanced Graph Similarity Learning via Adaptive Multi-scale Feature Fusion
Authors: Cuifang Zou, Guangquan Lu, Wenzhen Zhang, Xuxia Zeng, Shilong Lin, Longqing Du, Shichao Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Graph similarity computation plays a crucial role in a variety of fields such as chemical molecular structure comparison, social network analysis and code clone detection. However, due to inadequate feature representation, existing methods often struggle to cope with complex graph structures, which in turn limits the feature fusion capability and leads to low accuracy of similarity computation. To address these issues, this paper introduces an Adaptive Multi-scale Feature Fusion(AMFF) framework. AMFF firstly enhances feature extraction through a residual graph neural network, which robustly captures key information in complex graph structures. Based on this, a multi-pooled attention network is used to aggregate multi-scale features and accurately extract key node features while minimizing information loss. Finally, the adaptive multi-scale feature fusion mechanism dynamically adjusts the feature fusion weights according to the interactions between nodes and graph embeddings, thus improving the accuracy and sensitivity of similarity computation. Extensive experiments on benchmark datasets including AIDS700nef, LINUX, IMDBMulti, and PTC show that AMFF significantly outperforms existing methods on several metrics. These results confirm the efficiency and robustness of AMFF in graph similarity computation, providing a promising solution for assessing the similarity of complex graph data.
2322: Rolling in Classical Planning with Conditional Effects and Constraints
Authors: Matteo Cardellini, Enrico Giunchiglia
Location: Guangzhou | Day: TBD
Show Abstract
In classical planning, conditional effects (CEs) allow modelling non-idempotent actions, where the resulting state may depend on how many times each action is consecutively repeated.
Though CEs have been widely studied in the literature, no one has ever studied how to exploit rolling, i.e., how to effectively model the consecutive repetition of an action.
In this paper, we fill this void by (i) showing that planning with CEs remains PSPACE-complete even in the limit case of problems with a single action, (ii) presenting a correct and complete planning as satisfiability encoding exploiting rolling while effectively dealing with constraints imposed on the set of reachable states, and (iii) theoretically and empirically showing its substantial benefits.
2323: Simulating Misinformation Diffusion on Social Media Through CoNVaI: A Textual- and Agent-Based Diffusion Model
Authors: Raquel Rodríguez-García, Roberto Centeno, Álvaro Rodrigo
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Agent-based and Multi-agent Systems (2/3)
Show Abstract
Misinformation has experienced increased online diffusion, leveraging strategies, such as emotional manipulation, to influence users’ opinions. Efforts are underway to develop tools to mitigate its effects, such as misinformation propagation models used to simulate the diffusion of information. There are different approaches within these models, although, they show a significant limitation by disregarding the content of the information shared, crucial to the diffusion. We consider it the central aspect of modeling information dissemination. To this end, we focus on Agent-Based Modeling due to its suitability to simulate the complex interactions and heterogeneous behaviors observed on social media. We base our approach on a state-of-the-art Agent-Based Model that we modify and extend to account for the texts of the messages shared, focusing on two aspects that influence agents’ decisions: i) the novelty of the content and; ii) its diffusion and behavior over time. To determine whether this content proves informative, we conduct an empirical evaluation using social media data from Twitter. Based on our experimental results, we observe that our textual-based approach reflects information diffusion more realistically than the state of the art, reducing the error regarding real diffusion.
2327: ST-TAR: An Efficient Spatio-Temporal Learning Framework for Traffic Accident Risk Forecasting
Authors: Hongyu Wang, Lisi Chen, Shuo Shang, Peng Han, Christian S. Jensen
Location: Guangzhou | Day: TBD
Show Abstract
Traffic accidents represent a significant concern due to their devastating consequences. The ability to predict future traffic accident risks is of key importance to accident prevention activities in transportation systems. Although existing studies have made substantial efforts to model spatio-temporal correlations, they fall short when it comes to addressing the zero-inflated data issue and capturing spatio-temporal heterogeneity, which reduces their predictive abilities. In addition, improving efficiency is an urgent requirement for traffic accident forecasting. To overcome these limitations, we propose an efficient Spatio-Temporal learning framework for Traffic Accident Risk forecasting (ST-TAR). Taking long-term and short-term data as separate inputs, the ST-TAR model integrates hierarchical multi-view GCN and long short-term cross-attention mechanism to encode spatial dependencies and temporal patterns. We leverage long-term periodicity and short-term proximity for spatio-temporal contrastive learning to capture spatio-temporal heterogeneity. A tailored adaptive risk-level weighted loss function based on efficient locality-sensitive hashing is introduced to alleviate the zero-inflated issue. Extensive experiments on two real-world datasets offer evidence that ST-TAR is capable of advancing state-of-the-art forecasting accuracy with improved efficiency. This makes ST-TAR suitable for applications that require accurate real-time forecasting.
2328: ProMEA: Prompt-driven Expansion and Alignment for Single Domain Generalization
Authors: Yunyun Wang, Yi Guo, Xiaodong Liu, Songcan Chen
Location: Guangzhou | Day: TBD
Show Abstract
In single Domain Generalization (single-DG), data scarcity in the single source domain hampers the learning for invariant features, leading to overfitting over source domain and poor generalization to unseen target domains. Existing single-DG methods primarily augment the source domain by adversarial generation. However, there are still two key challenges. i) With simple feature perturbation to confuse the classifier, it may generate unnatural samples with semantic ambiguity or distortion. ii) It is still difficult to cover the sufficient shift in a real domain by generating indistinguishable samples from source data, thus the learning model is inescapable from overfitting to the single source domain. To this end, we turn to augment the domain prompt, considering that text prompt perturbation is easier to generate and generalize.
Then the source domain is expanded with the guidance of augmented text prompts, which are learnable with both semantic consistency and style diversity. Specifically, we propose a ProMpt-driven Expansion and Alignment (ProMEA) method for single-DG, in which a Domain Prompt Expansion module is first developed to expand the single source domain with frequency features of augmented text prompts, in which the amplitude spectrum predominantly harbors the domain style information. With source prompts, a Domain Prompt Alignment module is further designed in inference for adapting target samples to the expanded source domains, in order to reduce the domain discrepancy. Finally, empirically results over single-DG benchmarks demonstrate the superiority of our proposal.
2330: DisPIM: Distilling PreTrained Image Models for Generalizable Visuo-Motor Control
Authors: Haitao Wang, Hejun Wu
Location: Guangzhou | Day: TBD
Show Abstract
We introduce DisPIM, a framework that leverages pretrained image models (PIMs) for visuo-motor control. Applying PIMs to visuo-motor control faces a big difficulty due to the distribution shift between the distribution of visual environmental states and that of the pretraining datasets. Due to such a distribution shift, fine-tuning PIMs specifically for visuo-motor control may hurt the generalizability of PIMs, while adding additional tunable parameters for specific actions apparently lead to high computational costs. DisPIM addresses these challenges using a novel feature distillation approach, which obtains a compact model that not only inherit the generalization capability of PIMs but also acquire task-specific skills for visuo-motor control. This good for both sides is mainly achieved by means of a target Q-ensemble mechanism, which is inspired by double Q-learning. This Q-ensemble mechanism can adaptively adjust the distillation rate, so as to balance the objective of generalization and task-specific ability during training. With this balancing mechanism, DisPIM achieves both task-specific and generalizable control requiring a low computation cost. Across a series of algorithms, task domains, and evaluation metrics in both simulation and real robot, our DisPIM demonstrates significant improvements in generalization and overall performance with low computational overhead.
2331: HA-SCN: Learning Hierarchical Aligned Subtree Convolutional Networks for Graph Classification
Authors: Xinya Qin, Lu Bai, Lixin Cui, Ming Li, Hangyuan Du, Yue Wang, Edwin Hancock
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we propose a Hierarchical Aligned Subtree Convolutional Network (HA-SCN) for graph classification. Our idea is to transform graphs of arbitrary sizes into fixed-sized aligned graphs and construct a normalized K-layer m-ary subtree for each node in the aligned graphs. By sliding convolutional filters over the entire subtree at each node, we define a novel subtree convolution and pooling operation that hierarchically abstracts node-level information. We demonstrate that the proposed HA-SCN model not only realizes the convolution mechanism similar to the Convolutional Neural Networks (CNNs), which have the characteristics of weight sharing and fixed-sized receptive fields, but also effectively mitigates the over-squashing problem. Meanwhile, it establishes the correspondence information between nodes, alleviating the information loss issue. Experimental results on various benchmark graph datasets show that our approach achieves state-of-the-art performance in graph classification tasks.
2332: Towards Micro-Action Recognition with Limited Annotations: An Asynchronous Pseudo Labeling and Training Approach
Authors: Yan Zhang, Lechao Cheng, Yaxiong Wang, Zhun Zhong, Meng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Micro-Action Recognition (MAR) aims to classify subtle human actions in video. However, annotating MAR datasets is particularly challenging due to the subtlety of actions. To this end, we introduce the setting of Semi-Supervised MAR (SSMAR), where only a part of samples are labeled. We first evaluate traditional Semi-Supervised Learning (SSL) methods to SSMAR and find that these methods tend to overfit on inaccurate pseudo-labels, leading to error accumulation and degraded performance. This issue primarily arises from the common practice of directly using the predictions of classifier as pseudo-labels to train the model. To solve this issue, we propose a novel framework, called Asynchronous Pseudo Labeling and Training (APLT), which explicitly separates the pseudo-labeling process from model training. Specifically, we introduce a semi-supervised clustering method during the offline pseudo-labeling phase to generate more accurate pseudo-labels. Moreover, a self-adaptive thresholding strategy is proposed to dynamically filter noisy labels of different classes. We then build a memory-based prototype classifier based on the filtered pseudo-labels, which is fixed and used to guide the subsequent model training phase. By alternating the two pseudo-labeling and model training phases in an asynchronous manner, the model can not only be learned with more accurate pseudo-labels but also avoid the overfitting issue. Experiments on three MAR datasets show that our APLT largely outperforms state-of-the-art SSL methods. For instance, APLT improves accuracy by 14.5% over FixMatch on the MA-12 dataset when using only 50% labeled data. Code is available at https://github.com/zy-hfut/APLT
2334: TESTN: A Triad-Enhanced Spatio-Temporal Network for Multi-Temporal POI Relationship Inference
Authors: Hongyu Wang, Lisi Chen, Shuo Shang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-temporal Point-of-Interest (POI) relationship inference aims to identify evolving relationships among locations over time, providing critical insights for location-based services. While existing studies have made substantial efforts to model relationships with custom-designed graph neural networks, they face the challenge of leveraging POI contextual information characterized by spatial dependencies and temporal dynamics, as well as capturing the heterogeneity of multi-type relationships. To address these challenges, we propose a Triad-Enhanced Spatio-Temporal Network (TESTN), which conceptualizes triads as interactions between relationships for capturing potential interplay. Specifically, TESTN incorporates the spatial 2-hop aggregation layer to capture geographical and semantic information beyond first-order neighbors and the temporal context extractor to integrate relational dynamics within adjacent time segments. Furthermore, we introduce a self-supervised pairwise neighboring relation consistency detection scheme to preserve the heterogeneity of multi-type relationships. Extensive experiments on three real-world datasets demonstrate the superior performance of our TESTN framework.
2337: Mitigating Message Imbalance in Fraud Detection with Dual-View Graph Representation Learning
Authors: Yudan Song, Yuecen Wei, Yuhang Lu, Qingyun Sun, Minglai Shao, Li-e Wang, Chunming Hu, Xianxian Li, Xingcheng Fu
Location: Guangzhou | Day: TBD
Show Abstract
Graph representation learning has become a mainstream method for fraud detection due to its strong expressive power, which focuses on enhancing node representations through improved neighborhood knowledge capture. However, the focus on local interactions leads to imbalanced transmission of global topological information and increased risk of node-specific information being overwhelmed during aggregation due to the imbalance between fraud and benign nodes. In this paper, we first summarize the impact of topology and class imbalance on downstream tasks in GNN-based fraud detection, as the problem of imbalanced supervisory messages is caused by fraudsters’ topological behavior obfuscation and identity feature concealment. Based on statistical validation, we propose a novel dual-view graph representation learning method to mitigate Message imbalance in Fraud Detection (MimbFD). Specifically, we design a topological message reachability module for high-quality node representation learning to penetrate fraudsters’ camouflage and alleviate insufficient propagation. Then, we introduce a local confounding debiasing module to adjust node representations, enhancing the stable association between node representations and labels to balance the influence of different classes. Finally, we conducted experiments on three public fraud datasets, and the results demonstrate that MimbFD exhibits outstanding performance in fraud detection.
2345: Granular-Ball-Induced Multiple Kernel K-Means
Authors: Shuyin Xia, Yifan Wang, Lifeng Shen, Guoyin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Most existing multi-kernel clustering algorithms, such as multi-kernel K-means, often struggle with computational efficiency and robustness when faced with complex data distributions. These challenges stem from their dependence on point-to-point relationships for optimization, which can lead to difficulty in accurately capturing data sets’ inherent structure and diversity. Additionally, the intricate interplay between multiple kernels in such algorithms can further exacerbate these issues, effectively impacting their ability to cluster data points in high-dimensional spaces. In this paper, we leverage granular-ball computing to improve the multi-kernel clustering framework.
The core of granular-ball computing is to adaptively fit data distribution by balls from coarse to acceptable levels.
Each ball can enclose data points based on a density consistency measurement.
Such ball-based data description thus improves the computational efficiency and the robustness to unknown noises. Specifically, based on granular-ball representations, we introduce the granular-ball kernel (GBK) and its corresponding granular-ball multi-kernel K-means framework (GB-MKKM) for efficient clustering.
Using granular-ball relationships in multiple kernel spaces, the proposed GB-MKKM framework shows its superiority in efficiency and clustering performance in the empirical evaluation of various clustering tasks.
2347: Settling the Complexity of Popularity in Additively Separable and Fractional Hedonic Games
Authors: Martin Bullinger, Matan Gilboa
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Game Theory
Show Abstract
We study coalition formation in the framework of hedonic games. These games model the problem of partitioning a set of agents having a preference order over the coalitions they can be part of. A partition is called popular if it does not lose a majority vote among the agents against any other partition. Unfortunately, hedonic games need not admit popular partitions. We go further and settle the complexity of the existence problem concerning popularity in additively separable and fractional hedonic games by showing that it is Sigma_2^p-complete in both cases. We are thus the first work that proves a completeness result of popularity for the second level of the polynomial hierarchy.
2353: Modality-Fair Preference Optimization for Trustworthy MLLM Alignment
Authors: Songtao Jiang, Yan Zhang, Ruizhe Chen, Tianxiang Hu, Yeying Jin, Qinglin He, Yang Feng, Jian Wu, Zuozhu Liu
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal large language models (MLLMs) have achieved remarkable success across various tasks. However, separate training of visual and textual encoders often results in a misalignment of the modality. Such misalignment may lead models to generate content that is absent from the input image, a phenomenon referred to as hallucination. These inaccuracies severely undermine the trustworthiness of MLLMs in real-world applications. Despite attempts to optimize text preferences to mitigate this issue, our initial investigation indicates that the trustworthiness of MLLMs remains inadequate. Specifically, these models tend to provide preferred answers even when the input image is heavily distorted. Analysis of visual token attention also indicates that the model focuses primarily on the surrounding context rather than the key object referenced in the question. These findings highlight a misalignment between the modalities, where answers inadequately leverage input images. Motivated by our findings, we propose Modality-Fair Preference Optimization (MFPO), which comprises three components: the construction of a multimodal preference dataset in which dispreferred images differ from originals solely in key regions; an image reward loss function encouraging the model to generate answers better aligned with the input images; and an easy-to-hard iterative alignment strategy to stabilize joint modality training. Extensive experiments on three trustworthiness benchmarks demonstrate that MFPO significantly enhances the trustworthiness of MLLMs. In particular, it enables the 7B models to attain trustworthiness levels on par with, or even surpass, those of the 13B, 34B, and larger models.
2372: Towards Recognizing Spatial-temporal Collaboration of EEG Phase Brain Networks for Emotion Understanding
Authors: Jiangfeng Sun, Kaiwen Xue, Qika Lin, Yufei Qiao, Yifan Zhu, Zhonghong Ou, Meina Song
Location: Guangzhou | Day: TBD
Show Abstract
Emotion recognition from EEG signals is crucial for understanding complex brain dynamics. Existing methods typically rely on static frequency bands and graph convolutional networks (GCNs) to model brain connectivity. However, EEG signals are inherently non-stationary and exhibit substantial individual variability, making static-band approaches inadequate for capturing their dynamic properties. Moreover, spatial-temporal dependencies in EEG often lead to feature degradation during node aggregation, ultimately limiting recognition performance. To address these challenges, we propose the Spatial-Temporal Electroencephalograph Collaboration framework (Stella). Our approach introduces an Adaptive Bands Selection module (ABS) that dynamically extracts low- and high-frequency components, generating dual-path features comprising phase brain networks for connectivity modeling and time-series representations for local dynamics. To further mitigate feature degradation, the Fourier Graph Operator (FGO) operates in the spectral domain, while the Spatial-Temporal Encoder (STE) enhances representation stability and density. Extensive experiments on benchmark EEG datasets demonstrate that Stella achieves state-of-the-art performance in emotion recognition, offering valuable insights for graph-based modeling of non-stationary neural signals. The code is available at https://github.com/sun2017bupt/EEGBrainNetwork.
2390: Preventing Latent Diffusion Model-Based Image Mimicry via Angle Shifting and Ensemble Learning
Authors: Minghao Li, Rui Wang, Ming Sun, Lihua Jing
Location: Guangzhou | Day: TBD
Show Abstract
The remarkable progress of Latent Diffusion Models (LDMs) in image generation has raised concerns about the potential for unauthorized image mimicry. To address these concerns, studies on adversarial attacks against LDMs have gained increasing attention in recent years. However, existing methods face bottlenecks when attacking the denoising module. In this work, we reveal that the robustness of the denoising module stems from two key factors: the cancellation effect between adversarial perturbations and estimated noise, and unstable gradients caused by randomly sampled timesteps and Gaussian noise. Based on these insights, we introduce a cosine similarity adversarial loss to prevent the generation of perturbations that are easily impaired and develop a more stable optimization strategy by ensembling gradients and fixing the noise in the latent space. Additionally, we propose an alternating iterative framework to reduce memory usage by mathematically dividing the optimization process into two spaces: latent space and pixel space. Compared to previous strategies, our proposed framework reduces video memory demands without sacrificing attack effectiveness. Extensive experiments demonstrate that the alternating iterative framework and the stable optimization strategy on cosine similarity loss are more efficient and more effective. Code is available at https://github.com/MinghaoLi01/cosattack.
2415: Adversarial Training for Graph Convolutional Networks: Stability and Generalization Analysis
Authors: Chang Cao, Han Li, Yulong Wang, Rui Wu, Hong Chen
Location: Guangzhou | Day: TBD
Show Abstract
Recently, numerous methods have been proposed to enhance the robustness of the Graph Convolutional Networks (GCNs) for their vulnerability against adversarial attacks. Despite their empirical success, a significant gap remains in understanding GCNs’ adversarial robustness from the theoretical perspective. This paper addresses this gap by analyzing generalization against both node and structure attacks for multi-layer GCNs through the framework of uniform stability. Under the smoothness assumption of the loss function, we establish the first adversarial generalization bound of GCNs in expectation. Our theoretical analysis contributes to a deeper understanding of how adversarial perturbations and graph architectures influence generalization performance, which provides meaningful insights for designing robust models. Experimental results on benchmark datasets confirm the validity of our theoretical findings, highlighting their practical significance.
2416: Strategies, Credences, and Shannon Entropy: Reasoning about Strategic Uncertainty in Stochastic Environments
Authors: Wojciech Jamroga, Michał Tomasz Godziszewski, Aniello Murano
Location: Montreal | Day: August 21st | Time: 15:00 | Session: Agent-based and Multi-agent Systems (3/3)
Show Abstract
Multi-agent systems (MAS) often include multiple layers of uncertainty. One source comes from agents’ limited ability to observe their environment, while another arises from the unpredictability of natural events and the actions of other agents, which, though uncertain, can be estimated through experiments or past experiences.
A central focus in MAS is the agents’ ability to achieve their goals.For intelligent agents, these goals are often epistemic,
involving the acquisition of partial or complete knowledge about a crucial fact A. Many such properties can be expressed using PATLK, an extension of probabilistic alternating-time temporal logic (PATL) with knowledge operators, or PATLC that extends PATL with probabilistic beliefs.

In many scenarios, however, the goal of the players is not to achieve high confidence about A being true, but rather to reduce their uncertainty about A (be it true or false). Similarly, in scenarios where the goal is to keep A secret, the outsiders’ uncertainty about A should be maintained above a certain threshold. To capture such properties, we introduce PATLH, a logic extending PATL with information-theoretic modalities based on Shannon entropy.The logic enables the specification of agents’ capabilities concerning the uncertainty of a player about a given set of facts. We define it over multi-agent systems with stochastic transitions and probabilistic imperfect information, capturing two key uncertainties: the agents’ partial observability of their environment and the stochastic nature of state transitions. As technical results, we compare the epistemic and information-theoretic extensions of PATL with respect to their expressiveness, succinctness, and complexity of model checking.
2417: Identifying and Reusing Learnwares Across Different Label Spaces
Authors: Jian-Dong Liu, Zhi-Hao Tan, Zhi-Hua Zhou
Location: Guangzhou | Day: TBD
Show Abstract
The learnware paradigm focuses on leveraging numerous established high-performing models to solve machine learning tasks instead of starting from scratch. As the key concept of this paradigm, a learnware consists of a well-trained model of any structure and a specification that characterizes the model’s capabilities, allowing it to be identified and reused for future tasks. Given the existence of numerous real-world models trained on diverse label spaces, effectively identifying and combining these models to address tasks involving previously unseen label spaces represents a critical challenge in this paradigm. In this paper, we make the first attempt to identify and reuse effective learnware combinations for tackling learning tasks across different label spaces, extending their applicability beyond the original purposes of individual learnwares. To this end, we introduce a statistical class-wise specification for establishing similarity relations between various label spaces. Leveraging these relations, we model the utility of a learnware combination as a minimum-cost maximum-flow problem, and further develop fine-grained learnware identification and assembly methods. Extensive experiments with thousands of heterogeneous models validate our approach, demonstrating that reusing identified learnware combinations can outperform both training from scratch and fine-tuning a generic pre-trained model.
2424: SALE-MLP: Structure Aware Latent Embeddings for GNN to Graph-free MLP Distillation
Authors: Harsh Pal, Sarthak Malik, Rajat Patel, Aakarsh Malhotra
Location: Montreal | Day: August 19th | Time: 15:00 | Session: ML: Reinforcement learning (1/2)
Show Abstract
Graph Neural Networks (GNNs), with their ability to effectively handle non-Euclidean data structures, have demonstrated state-of-the-art performance in learning node and graph-level representations. However, GNNs face significant computational overhead due to their message-passing mechanisms, making them impractical for real-time large-scale applications. Recently, Graph-to-MLP (G2M) knowledge distillation has emerged as a promising solution, utilizing MLPs to reduce inference latency. However, existing methods often lack structural awareness (SA), limiting their ability to capture essential graph-specific information. Moreover, some methods require access to large-scale graphs, undermining their scalability. To address these issues, we propose SALE-MLP (Structure-Aware Latent Embeddings for GNN-to-Graph-Free MLP Distillation), a novel graph-free and structure-aware approach that leverages unsupervised structural losses to align the MLP feature space with the underlying graph structure. SALE-MLP does not rely on precomputed GNN embeddings nor require graph during inference, making it efficient for real-world applications. Extensive experiments demonstrate that SALE-MLP outperforms existing G2M methods across tasks and datasets, achieving 3–4% improvement in node classification for inductive settings while maintaining strong transductive performance.
2426: RLMiniStyler: Light-weight RL Style Agent for Arbitrary Sequential Neural Style Generation
Authors: Jing Hu, Chengming Feng, Shu Hu, Ming-Ching Chang, Xin Li, Xi Wu, Xin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Arbitrary style transfer aims to apply the style of any given artistic image to another content image. Still, existing deep learning-based methods often require significant computational costs to generate diverse stylized results. Motivated by this, we propose a novel reinforcement learning-based framework for arbitrary style transfer RLMiniStyler. This framework leverages a unified reinforcement learning policy to iteratively guide the style transfer process by exploring and exploiting stylization feedback, generating smooth sequences of stylized results while achieving model lightweight. Furthermore, we introduce an uncertainty-aware multi-task learning strategy that automatically adjusts loss weights to adapt to the content and style balance requirements at different training stages, thereby accelerating model convergence. Through a series of experiments across image various resolutions, we have validated the advantages of RLMiniStyler over other state-of-the-art methods in generating high-quality, diverse artistic image sequences at a lower cost. Codes are available at https://github.com/fengxiaoming520/RLMiniStyler.
2428: Unveiling the Power of Noise Priors: Enhancing Diffusion Models for Mobile Traffic Prediction
Authors: Zhi Sheng, Daisy Yuan, Jingtao Ding, Qi Yan, Xi Zheng, Yue Sun, Yong Li
Location: Montreal | Day: August 19th | Time: 11:30 | Session: ML: Difussion Models
Show Abstract
Accurate prediction of mobile traffic,i.e., network traffic from cellular base stations, is crucial for optimizing network performance and supporting urban development. However, the non-stationary nature of mobile traffic, driven by human activity and environmental changes, leads to both regular patterns and abrupt variations. Diffusion models excel in capturing such complex temporal dynamics due to their ability to capture the inherent uncertainties. Most existing approaches prioritize designing novel denoising networks but often neglect the critical role of noise itself, potentially leading to sub-optimal performance. In this paper, we introduce a novel perspective by emphasizing the role of noise in the denoising process. Our analysis reveals that noise fundamentally shapes mobile traffic predictions, exhibiting distinct and consistent patterns. We propose NPDiff, a framework that decomposes noise into prior and residual components, with the prior derived from data dynamics, enhancing the model’s ability to capture both regular and abrupt variations. NPDiff can seamlessly integrate with various diffusion-based prediction models, delivering predictions that are effective, efficient, and robust. Extensive experiments demonstrate that it achieves superior performance with an improvement over 30%, offering a new perspective on leveraging diffusion models in this domain. We provide code and data at https://github.com/tsinghua-fib-lab/NPDiff.
2430: Self-Consistent Model-based Adaptation for Visual Reinforcement Learning
Authors: Xinning Zhou, Chengyang Ying, Yao Feng, Hang Su, Jun Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Visual reinforcement learning agents typically face serious performance declines in real-world applications caused by visual distractions. Existing methods rely on fine-tuning the policy’s representations with hand-crafted augmentations. In this work, we propose Self-Consistent Model-based Adaptation (SCMA), a novel method that fosters robust adaptation without modifying the policy. By transferring cluttered observations to clean ones with a denoising model, SCMA can mitigate distractions for various policies as a plug-and-play enhancement. To optimize the denoising model in an unsupervised manner, we derive an unsupervised distribution matching objective with a theoretical analysis of its optimality. We further present a practical algorithm to optimize the objective by estimating the distribution of clean observations with a pre-trained world model. Extensive experiments on multiple visual generalization benchmarks and real robot data demonstrate that SCMA effectively boosts performance across various distractions and exhibits better sample efficiency.
2440: CMFS: CLIP-Guided Modality Interaction for Mitigating Noise in Multi-Modal Image Fusion and Segmentation
Authors: Guilin Su, Yuqing Huang, Chao Yang, Zhenyu He
Location: Guangzhou | Day: TBD
Show Abstract
Infrared-visible image fusion and semantic segmentation are pivotal tasks for robust scene understanding under challenging conditions such as low light. However, existing methods often struggle with high noise, modality inconsistencies, and inefficient cross-modal interactions, limiting fusion quality and segmentation accuracy. To this end, we propose CMFS, a unified framework that leverages CLIP-guided modality interaction to mitigate noise in multi-modal image fusion and segmentation. Our approach features a region-aware Modal Interaction Alignment module that combines a VMamba-based encoder with an additional shuffle layer to obtain more robust features and a CLIP-guided, regionally constrained multi-modal feature interaction block to emphasize foreground targets while suppressing low-light noise. Additionally, a Frequency-Spatial Collaboration module uses selective scanning and integrates wavelet-, spatial-, and Fourier-domain features to achieve adaptive denoising and balanced feature allocation. Furthermore, we employ a low-rank mixture-of-experts with dynamic routing to improve region-specific fusion and enhance pixel-level accuracy. Extensive experiments on several benchmarks show that, compared with state-of-the-art methods, the proposed approach demonstrates effectiveness in both image fusion quality and semantic segmentation accuracy, especially in complex environments. The source code will be released at IJCAI2025-CMFS.
2446: Exact Algorithms with New Upper Bounds for the Maximum k-plex Problem
Authors: Jiongzhi Zheng, Mingming Jin, Kun He
Location: Guangzhou | Day: TBD
Show Abstract
The Maximum k-plex Problem (MKP) is a degree relaxation of the widely known Maximum Clique Problem. As a practical NP-hard problem, MKP has many important real-world applications, such as the analysis of various complex networks. Branch-and-bound (BnB) algorithms are a type of well-studied and effective exact algorithms for MKP, and the key for BnB algorithms is the bound design. Recent BnB MKP algorithms involve two kinds of upper bounds based on graph coloring and partition, respectively, that work in different perspectives and thus are complementary with each other. We first propose a new coloring-based upper bound, termed Relaxed Graph Color Bound (RelaxGCB), that significantly outperforms the previous coloring-based upper bound. Then we further propose another new upper bound, termed RelaxPUB, that incorporates RelaxGCB and a partition-based upper bound in a novel way, making use of their complementarity. We apply RelaxGCB and RelaxPUB to state-of-the-art BnB MKP algorithms and produce eight new BnB algorithms. Extensive experiments using diverse k values on hundreds of instances based on dense or massive sparse graphs demonstrate the excellent performance and robustness of our proposed methods.
2449: Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking
Authors: Shenglan Li, Rui Yao, Yong Zhou, Hancheng Zhu, Kunyang Sun, Bing Liu, Zhiwen Shao, Jiaqi Zhao
Location: Guangzhou | Day: TBD
Show Abstract
To reduce the reliance on large-scale annotations, self-supervised RGB-T tracking approaches have garnered significant attention. However, the omission of the object region by erroneous pseudo-label or the introduction of background noise affects the efficiency of modality fusion, while pseudo-label noise triggered by similar object noise can further affect the tracking performance. In this paper, we propose GDSTrack, a novel approach that introduces dynamic graph fusion and temporal diffusion to address the above challenges in self-supervised RGB-T tracking. GDSTrack dynamically fuses the modalities of neighboring frames, treats them as distractor noise, and leverages the denoising capability of a generative model. Specifically, by constructing an adjacency matrix via an Adjacency Matrix Generator (AMG), the proposed Modality-guided Dynamic Graph Fusion (MDGF) module uses a dynamic adjacency matrix to guide graph attention, focusing on and fusing the object’s coherent regions. Temporal Graph-Informed Diffusion (TGID) models MDGF features from neighboring frames as interference, and thus improving robustness against similar-object noise. Extensive experiments conducted on four public RGB-T tracking datasets demonstrate that GDSTrack outperforms the existing state-of-the-art methods.
The source code is available at https://github.com/LiShenglana/GDSTrack.
2452: Trajectory-Dependent Generalization Bounds for Pairwise Learning with φ-mixing Samples
Authors: Liyuan Liu, Hong Chen, Weifu Li, Tieliang Gong, Hao Deng, Yulong Wang
Location: Guangzhou | Day: TBD
Show Abstract
Recently, the mathematical tool from fractal geometry (i.e., fractal dimension) has been employed to investigate optimization trajectory-dependent generalization ability for some pointwise learning models with independent and identically distributed (i.i.d.) observations. This paper goes beyond the limitations of pointwise learning and i.i.d. samples, and establishes generalization bounds for pairwise learning with uniformly strong mixing samples. The derived theoretical results fill the gap of trajectory-dependent generalization analysis for pairwise learning, and can be applied to wide learning paradigms, e.g., metric learning, ranking and gradient learning. Technically, our framework brings concentration estimation with Rademacher complexity and trajectory-dependent fractal dimension together in a coherent way for felicitous learning theory analysis. In addition, the efficient computation of fractal dimension can be guaranteed for random algorithms (e.g., stochastic gradient descent algorithm for deep neural networks) by bridging topological data analysis tools and the trajectory-dependent fractal dimension.
2453: Self-calibration Enhanced Whole Slide Pathology Image Analysis
Authors: Haoming Luo, Xiaotian Yu, Shengxuming Zhang, Jiabin Xia, Jian Yang, Yuning Sun, Xiuming Zhang, Jing Zhang, Zunlei Feng
Location: Guangzhou | Day: TBD
Show Abstract
Pathology images are considered the “gold standard" for cancer diagnosis and treatment, with gigapixel images providing extensive tissue and cellular information. Existing methods fail to simultaneously extract global structural and local detail features for comprehensive pathology image analysis efficiently. To address these limitations, we propose a self-calibration enhanced framework for whole slide pathology image analysis, comprising three components: a global branch, a focus predictor, and a detailed branch. The global branch initially classifies using the pathological thumbnail, while the focus predictor identifies relevant regions for classification based on the last layer features of the global branch. The detailed extraction branch then assesses whether the magnified regions correspond to the lesion area. Finally, a feature consistency constraint between the global and detail branches ensures that the global branch focuses on the appropriate region and extracts sufficient discriminative features for final identification. These focused discriminative features can facilitate the discovery of novel prognostic tumor markers, from the perspective of feature uniqueness and tissue spatial distribution. Extensive experiment results demonstrate that the proposed framework can rapidly deliver accurate and explainable results for pathological grading and prognosis tasks.
2466: RDPA: Real-Time Distributed-Concentrated Penetration Attack for Point Cloud Learning
Authors: Youtong Shi, Lixin Chen, Yu Zang, Chenhui Yang, Cheng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Partial point attack approaches focus on leveraging the fewest points to achieve the best attack efficiency for easy implementation in the physical domain. For the first time, this paper proposes that the partial point attack strategy should pay attention to not only the selection and disturbance of points, but also the penetration of current defense methods. By re-examining characteristics of previous partial point attack approaches leading to performance improvement, we discover two fundamental principles: first, the selection of attacked points should consider not only the favourable visual salience but also the proper position concentration, thus to acquire effective structural destruction on the basis of remaining imperceptible; second, the perturbation of target points should form meaningful structures rather than outliers. To achieve this, we first propose a novel distributed-concentrated point selection (DPS) strategy, which is easier to concentrate salient points containing rich local information in a few tiny regions. Additionally, to enhance the penetration efficacy and real-time performance of attack point clouds against defenses, we further design a perturbation network based on the multi-scale penetration loss (L_msp), which can generate adversarial samples with as few outliers as possible only through a single forward propagation. Experimental results demonstrate that the real-time distributed-concentrated penetration attack (RDPA) framework can achieve state-of-the-art (SOTA) success rates by perturbing only 3.5% of points, and have the best penetration for mainstream defense methods such as SRS and SOR.
2480: S-EPOA: Overcoming the Indistinguishability of Segments with Skill-Driven Preference-Based Reinforcement Learning
Authors: Ni Mu, Yao Luan, Yiqin Yang, Bo Xu, Qing-Shan Jia
Location: Guangzhou | Day: TBD
Show Abstract
Preference-based reinforcement learning (PbRL) stands out by utilizing human preferences as a direct reward signal, eliminating the need for intricate reward engineering. However, despite its potential, traditional PbRL methods are often constrained by the indistinguishability of segments, which impedes the learning process. In this paper, we introduce Skill-Enhanced Preference Optimization Algorithm (S-EPOA), which addresses the segment indistinguishability issue by integrating skill mechanisms into the preference learning framework. Specifically, we first conduct the unsupervised pretraining to learn useful skills. Then, we propose a novel query selection mechanism to balance the information gain and distinguishability over the learned skill space. Experimental results on a range of tasks, including robotic manipulation and locomotion, demonstrate that S-EPOA significantly outperforms conventional PbRL methods in terms of both robustness and learning efficiency. The results highlight the effectiveness of skill-driven learning in overcoming the challenges posed by segment indistinguishability.
2481: Improved Approximation Ratio for Strategyproof Facility Location on a Cycle
Authors: Krzysztof Rogowski, Marcin Dziubiński
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Game Theory and Economic Paradigms
Show Abstract
We study the problem of design of strategy-proof in expectation (SP) mechanisms for facility location on a cycle, with the objective of minimizing the sum of costs of n agents. We show that there exists an SP mechanism that attains an approximation ratio of 7/4 with respect to the sum of costs of the agents, thus improving the best known upper bound of 2 – 2/n in the cases of n ≥ 5. The mechanism obtaining the bound randomizes between two mechanisms known in the literature: the Random Dictator (RD) and the Proportional Circle Distance (PCD) mechanism of Meir (2019). To prove the result, we propose a cycle-cutting technique that allows for estimating the problem on a cycle by a problem on a line.
2483: Online Planning in MDPs with Stochastic Durative Actions
Authors: Tal Berman, Ronen I. Brafman, Erez Karpas
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Planning and Scheduling (2/5)
Show Abstract
Stochastic planning problems are typically modeled as Markov Decision Processes, in which actions are assumed to be instantaneous and applied sequentially. Yet, real-world actions often have durations and are applied concurrently. This paper presents an online planning approach that can deal with durative actions with stochastic outcomes. Our approach relies on Monte Carlo Tree Search with a new backpropagation procedure and temporal reasoning techniques that address the need to not only choose which action to execute, but also when to execute it. We also introduce a novel heuristic that combines reasoning about time and probabilities. Overall, we present the first online planner for stochastic temporal planning, solving a richer problem representation than previous work while achieving state-of-the-art empirical results.
2488: Non-expansive Fuzzy ALC
Authors: Stefan Gebhart, Lutz Schröder, Paul Wild
Location: Montreal | Day: August 21st | Time: 10:00 | Session: KRR: Learning and reasoning
Show Abstract
Fuzzy description logics serve the representation of vague knowledge, typically letting concepts take truth degrees in the unit interval. Expressiveness, logical properties, and complexity vary strongly with the choice of propositional base. The Łukasiewicz propositional base is generally perceived to have preferable logical properties but often entails high complexity or even undecidability. Contrastingly, the less expressive Zadeh propositional base comes with low complexity but entails essentially no change in logical behaviour compared to the classical case. To strike a balance between these poles, we propose non-expansive fuzzy ALC, in which the Zadeh base is extended with Łukasiewicz connectives where one side is restricted to be a rational constant, that is, with constant shift operators. This allows, for instance, modelling dampened inheritance of properties along roles. We present an unlabelled tableau method for non-expansive fuzzy ALC, which allows reasoning over general TBoxes in EXPTime like in two-valued ALC.
2495: Enhancing Chemical Reaction and Retrosynthesis Prediction with Large Language Model and Dual-task Learning
Authors: Xuan Lin, Qingrui Liu, Hongxin Xiang, Daojian Zeng, Xiangxiang Zeng
Location: Guangzhou | Day: TBD
Show Abstract
Chemical reaction and retrosynthesis prediction are fundamental tasks in drug discovery. Recently, large language models (LLMs) have shown potential in many domains. However, directly applying LLMs to these tasks faces two major challenges: (i) lacking a large-scale chemical synthesis-related instruction dataset; (ii) ignoring the close correlation between reaction and retrosynthesis prediction for the existing fine-tuning strategies. To address these challenges, we propose ChemDual, a novel LLM framework for accurate chemical synthesis. Specifically, considering the high cost of data acquisition for reaction and retrosynthesis, ChemDual regards the reaction-and-retrosynthesis of molecules as a related recombination-and-fragmentation process and constructs a large-scale of 4.4 million instruction dataset. Furthermore, ChemDual introduces an enhanced LLaMA, equipped with a multi-scale tokenizer and dual-task learning strategy, to jointly optimize the process of recombination and fragmentation as well as the tasks between reaction and retrosynthesis prediction. Extensive experiments on Mol-Instruction and USPTO-50K datasets demonstrate that ChemDual achieves state-of-the-art performance in both predictions of reaction and retrosynthesis, outperforming the existing conventional single-task approaches and the general open-source LLMs. Through molecular docking analysis, ChemDual generates compounds with diverse and strong protein binding affinity, further highlighting its strong potential in drug design.
2503: Accelerating Diffusion-based Super-Resolution with Dynamic Time-Spatial Sampling
Authors: Rui Qin, Qijie Wang, Ming Sun, Haowei Zhu, Chao Zhou, Bin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Diffusion models have gained attention for their success in modeling complex distributions, achieving impressive perceptual quality in SR tasks. However, existing diffusion-based SR methods often suffer from high computational costs, requiring numerous iterative steps for training and inference. Existing acceleration techniques, such as distillation and solver optimization, are generally task-agnostic and do not fully leverage the specific characteristics of low-level tasks like super-resolution (SR). In this study, we analyze the frequency- and spatial-domain properties of diffusion-based SR methods, revealing key insights into the temporal and spatial dependencies of high-frequency signal recovery. Specifically, high-frequency details benefit from concentrated optimization during early and late diffusion iterations, while spatially textured regions demand adaptive denoising strategies. Building on these observations, we propose the Time-Spatial-aware Sampling strategy (TSS) for the acceleration of Diffusion SR without any extra training cost. TSS combines Time Dynamic Sampling (TDS), which allocates more iterations to refining textures, and Spatial Dynamic Sampling (SDS), which dynamically adjusts strategies based on image content. Extensive evaluations across multiple benchmarks demonstrate that TSS achieves state-of-the-art (SOTA) performance with significantly fewer iterations, improving MUSIQ scores by 0.2~3.0 and outperforming the current acceleration methods with only half the number of steps.
2508: KP-PINNs: Kernel Packet Accelerated Physics Informed Neural Networks
Authors: Siyuan Yang, Cheng Song, Zhilu Lai, Wenjia Wang
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Multidisciplinary Topics and Applications (1/2)
Show Abstract
Differential equations are involved in modeling many engineering problems. Many efforts have been devoted to solving differential equations. Due to the flexibility of neural networks, Physics Informed Neural Networks (PINNs) have recently been proposed to solve complex differential equations and have demonstrated superior performance in many applications. While the L2 loss function is usually a default choice in PINNs, it has been shown that the corresponding numerical solution is incorrect and unstable for some complex equations. In this work, we propose a new PINNs framework named Kernel Packet accelerated PINNs (KP-PINNs), which gives a new expression of the loss function using the reproducing kernel Hilbert space (RKHS) norm and uses the Kernel Packet (KP) method to accelerate the computation. Theoretical results show that KP-PINNs can be stable across various differential equations. Numerical experiments illustrate that KP-PINNs can solve differential equations effectively and efficiently. This framework provides a promising direction for improving the stability and accuracy of PINNs-based solvers in scientific computing.
2512: Multimodal Cancer Survival Analysis via Hypergraph Learning with Cross-Modality Rebalance
Authors: Mingcheng Qu, Guang Yang, Donglin Di, Tonghua Su, Yue Gao, Yang Song, Lei Fan
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal pathology-genomic analysis has become increasingly prominent in cancer survival prediction. However, existing studies mainly utilize multi-instance learning to aggregate patch-level features, neglecting the information loss of contextual and hierarchical details within pathology images. Furthermore, the disparity in data granularity and dimensionality between pathology and genomics leads to a significant modality imbalance. The high spatial resolution inherent in pathology data renders it a dominant role while overshadowing genomics in multimodal integration. In this paper, we propose a multimodal survival prediction framework that incorporates hypergraph learning to effectively capture both contextual and hierarchical details from pathology images. Moreover, it employs a modality rebalance mechanism and an interactive alignment fusion strategy to dynamically reweight the contributions of the two modalities, thereby mitigating the pathology-genomics imbalance. Quantitative and qualitative experiments are conducted on five TCGA datasets, demonstrating that our model outperforms advanced methods by over 3.4% in C-Index performance. Code: https://github.com/MCPathology/MRePath.
2522: SyncAnimation: A Real-Time End-to-End Framework for Audio-Driven Human Pose and Talking Head Animation
Authors: Yujian Liu, Shidang Xu, Jing Guo, Dingbin Wang, Zairan Wang, Xianfeng Tan, Xiaoli Liu
Location: Guangzhou | Day: TBD
Show Abstract
Generating talking avatar driven by audio remains a significant challenge. Existing methods typically require high computational costs and often lack sufficient facial detail and realism, making them unsuitable for applications that demand high real-time performance and visual quality. Additionally, while some methods can synchronize lip movement, they still face issues with consistency between facial expressions and upper body movement, particularly during silent periods. In this paper, we introduce SyncAnimation, the first NeRF-based method that achieves audio-driven, stable, and real-time generation of speaking avatar by combining generalized audio-to-pose matching and audio-to-expression synchronization. By integrating AudioPose Syncer and AudioEmotion Syncer, SyncAnimation achieves high-precision poses and expression generation, progressively producing audio-synchronized upper body, head, and lip shapes. Furthermore, the High-Synchronization Human Renderer ensures seamless integration of the head and upper body, and achieves audio-sync lip. The project page can be found at https://syncanimation.github.io.
2523: Denoising Diffusion Models are Good General Gaze Feature Learners
Authors: Guanzhong Zeng, Jingjing Wang, Pengwei Yin, Zefu Xu, Mingyang Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Since the collection of labeled gaze data is laborious and time-consuming, methods which can learn generalizable features by leveraging large-scale available unlabeled data are desirable. In recent years, we have witnessed the tremendous capabilities of diffusion models in generating images as well as their potential in feature representation learning. In this paper, we investigate whether they can acquire discriminative representations for gaze estimation via generative pre-training. To achieve this goal, we propose a self-supervised learning framework with diffusion models for gaze estimation, called GazeDiff. Specifically, we utilize a conditional diffusion model to generate target image with gaze direction specified by the reference image as the pre-training task. To facilitate the diffusion model to learn gaze related features as condition, we propose a disentangling feature learning strategy, which first learns appearance feature, head pose feature, and eye direction feature respectively, and then combines them as the conditional features. Extensive experiments demonstrate denoising diffusion models are also good general gaze feature learners.
2540: An Inverse Optimization Approach to Contextual Inverse Optimization
Authors: Yasunari Hikima, Naoyuki Kamiyama
Location: Montreal | Day: August 20th | Time: 14:00 | Session: ML: Machine Learning 6/8
Show Abstract
Contextual Inverse Optimization (CIO) is a generalized framework of the predict-then-optimize approach, also referred to as decision-focused learning or contextual optimization, aiming to learn a model that predicts the unknown parameters of a nominal optimization problem using related covariates without compromising the solution quality. Unlike the predict-then-optimize approach, which assumes access to datasets containing realized unknown parameters, CIO considers a setting where only historical optimal solutions are available. Previous work has primarily focused on CIO under linear programming problems and proposed methods based on optimality conditions. In this study, we propose a general algorithm based on inverse optimization as a more general approach that does not require optimality conditions. To validate its effectiveness, we apply the proposed method to multiple CIO problems and demonstrate that it performs comparably to or better than existing predict-then-optimize methods, even without ground-truth unknown parameters.
2551: An End-to-End Simple Clustering Hierarchical Pooling Operation for Graph Learning Based on Top-K Node Selection
Authors: Zhehan Zhao, Lu Bai, Ming Li, Lixin Cui, Hangyuan Du, Yue Wang, Edwin Hancock
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) are powerful tools for graph learning, but one of the important challenges is how to effectively extract representations for graph-level tasks. In this paper, we propose an end-to-end Simple Clustering Hierarchical Pooling (SCHPool) operation, which is based on Top-K node selection for learning expressive graph representations. Specifically, SCHPool considers each node and its local neighborhood as a cluster, and introduces a novel multi-view scoring function to evaluate node importance. Based on these scores, clusters centered around the Top-K nodes are retained. This design eliminates the need for complex clustering operations, significantly reducing computational overhead. Furthermore, during the coarsening process, SCHPool employs a lightweight yet comprehensive attention mechanism to adaptively aggregate both the node features within clusters and the edge connectivity strengths between clusters. This facilitates the construction of more informative coarsened graphs, enhancing model performance. Experimental results demonstrate the effectiveness of the proposed model.
2552: SIFAR: A Simple Faster Accelerated Variance-Reduced Gradient Method
Authors: Zhize Li
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we propose a simple faster accelerated gradient method called SIFAR for solving the finite-sum optimization problems. Concretely, we consider both general convex and strongly convex settings: i) For general convex finite-sum problems, SIFAR improves previous state-of-the-art result given by Varag. In particular, for large-scale problems or the convergence error is not very small, SIFAR obtains the first optimal result O(n), matching the lower bound. ii) For strongly convex finite-sum problems, we also show that SIFAR can achieve the optimal convergence rate matching the lower bound. Besides, SIFAR enjoys a simpler loopless algorithmic structure while previous algorithms use double-loop structures. Moreover, we provide a novel dynamic multi-stage convergence analysis, which is the key for improving previous results to the optimal rates. Our new theoretical rates and novel convergence analysis for the fundamental finite-sum problem can directly lead to key improvements for many other related problems, such as distributed/federated/decentralized optimization problems. Finally, the numerical experiments show that SIFAR converges faster than the previous state-of-the-art Varag, validating our theoretical results and confirming the practical superiority of SIFAR.
2554: Label Distribution Learning with Biased Annotations Assisted by Multi-Label Learning
Authors: Zhiqiang Kou, Si Qin, Hailin Wang, Jing Wang, Mingkun Xie, Shuo Chen, Yuheng Jia, Tongliang Liu, Masashi Sugiyama, Xin Geng
Location: Guangzhou | Day: TBD
Show Abstract
Multi-label learning (MLL) has gained attention for its ability to represent real-world data. Label Distribution Learning (LDL), an extension of MLL to learning from label distributions, faces challenges in collecting accurate label distributions. To address the issue of biased annotations, based on the low-rank assumption, existing works recover true distributions from biased observations by exploring the label correlations. However, recent evidence shows that the label distribution tends to be full-rank, and naive apply of low-rank approximation on biased observation leads to inaccurate recovery and performance degradation. In this paper, we address the LDL with biased annotations problem from a novel perspective, where we first degenerate the soft label distribution into a hard multi-hot label and then recover the true label information for each instance. This idea stems from an insight that assigning hard multi-hot labels is often easier than assigning a soft label distribution, and it shows stronger immunity to noise disturbances, leading to smaller label bias. Moreover, assuming that the multi-label space for predicting label distributions is low-rank offers a more reasonable approach to capturing label correlations. Theoretical analysis and experiments confirm the effectiveness and robustness of our method on real-world datasets.
2571: Multi-Agent Corridor Generating Algorithm
Authors: Arseni Pertzovskiy, Roni Stern, Roie Zivan, Ariel Felner
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Agent-based and Multi-agent Systems (1/3)
Show Abstract
In this paper, we propose the Multi-Agent Corridor Generating Algorithm (MACGA) for solving the Multi-agent Pathfinding (MAPF) problem, where a group of agents need to find non-colliding paths to their target locations. Existing approaches struggle to solve dense MAPF instances. In MACGA, the agents build corridors, which are sequences of connected vertices, from current locations towards agents’ goals, and evacuate other agents out of the corridors to avoid collisions and deadlocks. We also present the MACGA+PIBT algorithm, which integrates the well-known rule-based PIBT algorithm into MACGA to improve runtime and solution quality. The proposed algorithms run in polynomial time and have a reachability property, i.e., every agent is guaranteed to reach its goal location at some point. We demonstrate experimentally that MACGA and MACGA+PIBT outperform baseline algorithms in terms of success rate, runtime, and makespan across diverse MAPF benchmark grids.
2587: RoLocMe: A Robust Multi-agent Source Localization System with Learning-based Map Estimation
Authors: Thanh Dat Le, Lyuzhou Ye, Yan Huang
Location: Montreal | Day: August 21st | Time: 15:00 | Session: Agent-based and Multi-agent Systems (3/3)
Show Abstract
This paper addresses the source localization problem by introducing RoLocMe, a multi-agent reinforcement learning system that integrates SkipNet – a skip-connection-based RSS estimation model – with parallel Q-learning. SkipNet predicts RSS propagation of the entire search region, enabling agents to explore efficiently. The agents leverage dueling DQN, value decomposition, and λ-returns to learn cooperative policies. RoLocMe converges faster and achieves at least 20% higher success rates than existing methods in dense and sparse reward settings. A drop-one ablation study confirms each component’s importance and RoLocMe’s effectiveness for larger teams.
2590: Diffusion-aware Censored Gaussian Processes for Demand Modelling
Authors: Filipe Rodrigues
Location: Montreal | Day: August 19th | Time: 11:30 | Session: ML: Difussion Models
Show Abstract
Inferring the true demand for a product or a service from aggregate data is often challenging due to the limited available supply, thus resulting in observations that are censored and correspond to the realized demand, thereby not accounting for the unsatisfied demand. Censored regression models are able to account for the effect of censoring due to the limited supply, but they don’t consider the effect of substitutions, which may cause the demand for similar alternative products or services to increase. This paper proposes Diffusion-aware Censored Demand Models, which combine a Tobit likelihood with a graph-based diffusion process in order to model the latent process of transfer of unsatisfied demand between similar products or services. We instantiate this new class of models under the framework of GPs and, based on both simulated and real-world data for modeling sales, bike-sharing demand, and EV charging demand, demonstrate its ability to better recover the true demand and produce more accurate out-of-sample predictions.
2591: Towards Fairness with Limited Demographics via Disentangled Learning
Authors: Zichong Wang, Anqi Wu, Nuno Moniz, Shu Hu, Bart Knijnenburg, Xingquan Zhu, Wenbin Zhang
Location: Montreal | Day: August 20th | Time: 10:00 | Session: AI Ethics, Trust, Fairness (1/3)
Show Abstract
Fairness in artificial intelligence has garnered increasing attention due to concerns about discriminatory AI-based decision-making, prompting the development of numerous mitigation approaches. However, most existing methods assume that demographic information is readily available, which may not align with real-world scenarios where such information is often incomplete. To this end, this paper tackles the pervasive yet overlooked challenge of developing fair machine learning algorithms with limited demographics. Specifically, we explore leveraging limited demographic information to accurately infer missing demographics while simultaneously evaluating and optimizing model fairness. We argue that this approach better aligns with common real-world socially sensitive scenarios involving limited demographics. Extensive experiments on three benchmark datasets highlight the effectiveness of the proposed method, surpassing state-of-the-art with significant gains in fairness while maintaining comparable utility.
2595: Probabilistic Multimodal Learning with von Mises-Fisher Distributions
Authors: Peng Hu, Yang Qin, Yuanbiao Gou, Yunfan Li, Mouxing Yang, Xi Peng
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal learning is pivotal for the advancement of artificial intelligence, enabling machines to integrate complementary information from diverse data sources for holistic perception and understanding. Despite significant progress, existing methods struggle with challenges such as noisy inputs, noisy correspondence, and the inherent uncertainty of multimodal data, limiting their reliability and robustness. To address these issues, this paper presents a novel Probabilistic Multimodal Learning framework (PML) that models each data point as a von Mises-Fisher (vMF) distribution, effectively capturing intrinsic uncertainty and enabling robust fusion. Unlike traditional Gaussian-based models, PML learns directional representation with a concentration parameter to quantify reliability directly, enhancing stability and interpretability. To enhance discrimination, we propose a von Mises-Fisher Prototypical Contrastive Learning paradigm (vMF-PCL), which projects data onto a hypersphere by pulling within-class samples closer to their class prototype while pushing between-class prototypes apart, adaptively learning the reliability estimations. Building upon the estimated reliability, we develop a Reliable Multimodal Fusion mechanism (RMF) that dynamically adjusts the contribution and conflict of each modality, ensuring robustness against noisy data, noisy correspondence, and uncertainty. Extensive experiments on nine benchmarks demonstrate the superiority of PML, consistently outperforming 14 state-of-the-art methods. Code is available at https://github.com/XLearning-SCU/2025-IJCAI-PML.
2597: Imitation Learning via Focused Satisficing
Authors: Rushit N. Shah, Nikolaos Agadakos, Synthia Sasulski, Ali Farajzadeh, Sanjiban Choudhury, Brian Ziebart
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Robotics
Show Abstract
Imitation learning often assumes that demonstrations are close to optimal according to some fixed, but unknown, cost function.
However, according to satisficing theory, humans often choose acceptable behavior based on their personal (and potentially dynamic) levels of aspiration, rather than achieving (near-) optimality. For example, a lunar lander demonstration that successfully lands without crashing might be acceptable to a novice despite being slow or jerky.
Using a margin-based objective to guide deep reinforcement learning, our focused satisficing approach to imitation learning seeks a policy that surpasses the demonstrator’s aspiration levels—defined over trajectories or portions of trajectories—on unseen demonstrations without explicitly learning those aspirations. We show experimentally that this focuses the policy to imitate the highest quality (portions of) demonstrations better than existing imitation learning methods, providing much higher rates of guaranteed acceptability to the demonstrator, and competitive true returns on a range of environments.
2603: FAST: A Lightweight Mechanism Unleashing Arbitrary Client Participation in Federated Learning
Authors: Zhe Li, Seyedsina Nabavirazavi, Bicheng Ying, Sitharama Iyengar, Haibo Yang
Location: Guangzhou | Day: TBD
Show Abstract
Federated Learning (FL) provides a flexible distributed platform where numerous clients with high data and system heterogeneity can collaborate to learn a model. While previous research has shown that FL can handle diverse data, it often completely assumes idealized conditions. In practice, real-world factors make it hard to predict or design individual client participation. This complexity results in an unknown participation pattern – arbitrary client participation (ACP). Hence, the key open problem is to understand the impact of client participation and develop a lightweight mechanism to support ACP in FL. In this paper, we first empirically investigate the client participation’s influence in FL, revealing that FL algorithms are adversely impacted by ACP. To alleviate the impact, we propose a lightweight solution, Federated Average with Snapshot (FAST), that supports almost ACP for FL and can seamlessly integrate with other classic FL algorithms. Specifically, FAST enforces clients to take a snapshot once in a while and facilitates ACP for the majority of training processes. We prove that the convergence rates of FAST in non-convex and strongly-convex cases match those under ideal client participation. Furthermore, we empirically introduce an adaptive strategy to dynamically configure the snapshot frequency, tailored to accommodate diverse FL systems. Extensive experiments show that FAST significantly improves performance under ACP and high data heterogeneity.
2614: Human-Readable Neuro-Fuzzy Networks from Frequent Yet Discernible Patterns in Reward-Based Environments
Authors: John Wesley Hostetter, Adittya Soukarjya Saha, Md Mirajul Islam, Tiffany Barnes, Min Chi
Location: Montreal | Day: August 21st | Time: 10:00 | Session: KRR: Learning and reasoning
Show Abstract
We propose self-organizing and simplifying neuro-fuzzy networks (NFNs) to yield transparent human-readable policies by exploiting fuzzy information granulation and graph theory. Deriving from social network analysis, we retain only the frequent-yet-discernible (FYD) patterns in NFNs and apply them to reward-based scenarios. The effectiveness of NFNs from FYD patterns is shown in classic control and a real-world classroom using an intelligent tutoring system to teach students.
2619: Interval Selection with Binary Predictions
Authors: Christodoulos Karavasilis
Location: Montreal | Day: August 21st | Time: 11:30 | Session: Planning and Scheduling (4/5)
Show Abstract
Following a line of work that takes advantage of vast machine-learned data to enhance online algorithms with (possibly erroneous) information about future inputs, we consider predictions in the context of deterministic algorithms for the problem of selecting a maximum weight independent set of intervals arriving on the real line. We look at two weight functions, unit (constant) weights, and weights proportional to the interval’s length. In the classical online model of irrevocable decisions, no algorithm can achieve constant competitiveness. In this setting, we show that a simple algorithm that is faithful to the predictions is optimal, and achieves an objective value of at least OPT – η, with η being the total error in the predictions, both for unit, and proportional weights.
When revocable acceptances (a form of preemption) are allowed, the optimal deterministic algorithm for unit weights is 2k-competitive, where k is the number of different interval lengths. We give an algorithm with performance OPT − η (and therefore 1-consistent), that is also (2k + 1)-robust. For proportional weights, there is an optimal (2φ + 1)-competitive algorithm, where φ is the golden ratio. We present an algorithm with parameter λ > 1 that is 3λ / (λ – 1) -consistent, and (4λ^2 + 2λ) / (λ – 1)-robust. Although these bounds are not tight, we show that for λ > 3.42 we achieve consistency better than the optimal online guarantee, while maintaining bounded robustness.
We conclude with some experimental results on real-world data that complement our theoretical findings, and show the benefit of prediction algorithms for online interval selection, even in the presence of high error.
2627: Parallel Belief Contraction via Order Aggregation
Authors: Jake Chandler, Richard Booth
Location: Guangzhou | Day: TBD
Show Abstract
The standard “serial” (aka “singleton”) model of belief contraction models the manner in which an agent’s corpus of beliefs responds to the removal of a single item of information. One salient extension of this model introduces the idea of “parallel” (aka “package” or “multiple”) change, in which an entire set of items of information are simultaneously removed. Existing research on the latter has largely focussed on single-step parallel contraction: understanding the behaviour of beliefs after a single parallel contraction. It has also focussed on generalisations to the parallel case of serial contraction operations whose characteristic properties are extremely weak. Here we consider how to extend serial contraction operations that obey stronger properties. Potentially more importantly, we also consider the iterated case: the behaviour of beliefs after a sequence of parallel contractions. We propose a general method for extending serial iterated belief change operators to handle parallel change based on an n-ary generalisation of Booth & Chandler’s TeamQueue binary order aggregators.
2629: Parallel Belief Revision via Order Aggregation
Authors: Jake Chandler, Richard Booth
Location: Guangzhou | Day: TBD
Show Abstract
Despite efforts to better understand the constraints that operate on single-step parallel (aka “package”, “multiple”) revision, very little work has been carried out on how to extend the model to the iterated case. A recent paper by Delgrande & Jin outlines a range of relevant rationality postulates. While many of these are plausible, they lack an underlying unifying explanation. We draw on recent work on iterated parallel contraction to offer a general method for extending serial iterated belief revision operators to handle parallel change. This method, based on a family of order aggregators known as TeamQueue aggregators, provides a principled way to recover the independently plausible properties that can be found in the literature, without yielding the more dubious ones.
2632: Balans: Multi-Armed Bandits-based Adaptive Large Neighborhood Search for Mixed-Integer Programming Problems
Authors: Junyang Cai, Serdar Kadioğlu, Bistra Dilkina
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Constraint Satisfaction and Optimization (2/3)
Show Abstract
Mixed-integer programming (MIP) is a powerful paradigm for modeling and solving various important combinatorial optimization problems. Recently, learning-based approaches have shown a potential to speed up MIP solving via offline training that then guides important design decisions during the search. However, a significant drawback of these methods is their heavy reliance on offline training, which requires collecting training datasets and computationally costly training epochs yet offering only limited generalization to unseen (larger) instances. In this paper, we propose Balans, an adaptive meta-solver for MIPs with online learning capability that does not require any supervision or apriori training. At its core, Balans is based on adaptive large-neighborhood search, operating on top of an MIP solver by successive applications of destroy and repair neighborhood operators. During the search, the selection among different neighborhood definitions is guided on the fly for the instance at hand via multi-armed bandit algorithms. Our extensive experiments on hard optimization instances show that Balans offers significant performance gains over the default MIP solver, is better than committing to any single best neighborhood, and improves over the state-of-the-art large-neighborhood search for MIPs. Finally, we release Balans as a highly configurable, MIP solver agnostic, open-source software.
2637: Combining Deep Reinforcement Learning and Search with Generative Models for Game-Theoretic Opponent Modeling
Authors: Zun Li, Marc Lanctot, Kevin R. McKee, Luke Marris, Ian Gemp, Daniel Hennes, Paul Muller, Kate Larson, Yoram Bachrach, Michael P. Wellman
Location: Montreal | Day: August 21st | Time: 15:00 | Session: Agent-based and Multi-agent Systems (3/3)
Show Abstract
Opponent modeling methods typically involve two crucial steps: building a belief distribution over opponents’ strategies, and exploiting this opponent model by playing a best response. However, existing approaches typically require domain-specific heurstics to come up with such a model, and algorithms for approximating best responses are hard to scale in large, imperfect information domains.


In this work, we introduce a scalable and generic multiagent training regime for opponent modeling using deep game-theoretic reinforcement learning. We first propose Generative Best Respoonse (GenBR), a best response algorithm based on Monte-Carlo Tree Search (MCTS) with a learned deep generative model that samples world states during planning. This new method scales to large imperfect information domains and can be plug and play in a variety of multiagent algorithms. We use this new method under the framework of Policy Space Response Oracles (PSRO), to automate the generation of an offline opponent model via iterative game-theoretic reasoning and population-based training. We propose using solution concepts based on bargaining theory to build up an opponent mixture, which we find identifying profiles that are near the Pareto frontier. Then GenBR keeps updating an online opponent model and reacts against it during gameplay. We conduct behavioral studies where human participants negotiate with our agents in Deal-or-No-Deal, a class of bilateral bargaining games. Search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian co-player prediction, and can produce agents that achieve comparable social welfare and Nash bargaining score negotiating with humans as humans trading among themselves.
2639: Consensus-Guided Incomplete Multi-view Clustering via Cross-view Affinities Learning
Authors: Qian Liu, Huibing Wang, Jinjia Peng, Yawei Chen, Mingze Yao, Xianping Fu, Yang Wang
Location: Guangzhou | Day: TBD
Show Abstract
Incomplete multi-view clustering (IMC) has garnered substantial attention due to its capacity to handle unlabeled data. Existing methods predominantly explore pairwise consistency between every two views. However, such consistency is highly susceptible to missing samples and outliers within a certain view and thus deviates from the true clustering distribution. Moreover, dual-view interaction neglects the collaboration effects of multiple views, making it challenging to capture the holistic characteristics across views. In response to these issues, we propose a novel Consensus-Guided Incomplete Multi-view Clustering via Cross-view Affinities Learning (CAL). Specifically, CAL reconstructs views with available instances to mine sample-wise affinities and harness comprehensive content information within views. Subsequently, to extract clean structural information, CAL imposes a structured sparse constraint on the representation tensor to eliminate biased errors. Furthermore, by integrating the consensus representation into a representation tensor, CAL can employ high-order interaction of multiple views to depict the semantic correlation between views while acquiring a unified structural graph across multiple views. Extensive experiments on seven benchmark datasets demonstrate that CAL outperforms some state-of-the-art methods in clustering performance. The code is available at https://github.com/whbdmu/CAL.
2642: A Multi-view Fusion Approach for Enhancing Speech Signals via Short-time Fractional Fourier Transform
Authors: Zikun Jin, Yuhua Qian, Xinyan Liang, Haijun Geng
Location: Guangzhou | Day: TBD
Show Abstract
Deep learning-based speech enhancement (SE) methods focus on reconstructing speech from the time or frequency domain. However, these domains cannot provide enough information to capture the dynamics of non-stationary signals accurately. To enrich information, this work proposes a multi-view fusion SE method (MFSE). Specifically, MFSE extends the representation space of speech to the dynamic domain (also called fractional domain) between the time and frequency domains by using the short-time fractional Fourier transform (STFrFT). Subsequently, we construct inputs as modes of the primary short-time Fourier transform (STFT) spectrum and the auxiliary STFrFT spectrum views and adaptively identify the optimal fractional STFrFT spectrum from the infinitely continuous fractional domain by leveraging the average spectral centroids. The framework extracts potential features through multiple designed convolutional modules and captures the correlation between different speech frequencies through multi-granularity attention.
Experimental results show that the proposed method significantly improves performance in several metrics compared to existing single-channel SE methods based on time and frequency domains. Furthermore, the results of its generalizability evaluation show that the multi-view method outperforms the single-view method under a wide range of SNR conditions.
2648: Enhancing Counterfactual Estimation: A Focus on Temporal Treatments
Authors: Xin Wang, Shengfei Lyu, Kangyang Luo, Lishan Yang, Huanhuan Chen, Chunyan Miao
Location: Guangzhou | Day: TBD
Show Abstract
In the medical field, treatment sequences significantly influence future outcomes through complex temporal interactions. Therefore, highlighting the role of temporal treatments within the model is crucial for accurate counterfactual estimation, which is often overlooked in current methods. To address this, we employ Koopman theory, known for its capability to model complex dynamic systems, and introduce a novel model named the Counterfactual Temporal Dynamics Network via Neural Koopman Operators (CTD-NKO). This model utilizes Koopman operators to encapsulate sequential treatment data, aiming to capture the causal dynamics within the system induced by temporal interactions between treatments. Moreover, CTD-NKO implements a weighting strategy that aligns joint and marginal distributions of the system state and the current treatment to mitigate time-varying confounding bias. This deviates from the balanced representation strategy employed by existing methods, as we demonstrate that such a strategy may suffer from the potential information loss of historical treatments. These designs allow CTD-NKO to exploit treatment information more thoroughly and effectively, resulting in superior performance on both synthetic and real-world datasets.
2661: Prototype-based Optimal Transport for Out-of-Distribution Detection
Authors: Ao Ke, Wenlong Chen, Chuanwen Feng, Yukun Cao, Xike Xie, S. Kevin Zhou, Lei Feng
Location: Guangzhou | Day: TBD
Show Abstract
Detecting Out-of-Distribution (OOD) inputs is crucial for improving the reliability of deep neural networks in the real-world deployment. In this paper, inspired by the inherent distribution shift between in-distribution (ID) and OOD data, we propose a novel method that leverages optimal transport to measure the distribution discrepancy between test inputs and ID prototypes. The resulting transport costs are used to quantify the individual contribution of each test input to the overall discrepancy, serving as a desirable measure for OOD detection. To address the issue that solely relying on the transport costs to ID prototypes is inadequate for identifying OOD inputs closer to ID data, we generate virtual outliers to approximate the OOD region via linear extrapolation. By combining the transport costs to ID prototypes with the costs to virtual outliers, the detection of OOD data near ID data is emphasized, thereby enhancing the distinction between ID and OOD inputs. Extensive evaluations demonstrate the superiority of our method over state-of-the-art methods.
2663: Variational Offline Multi-agent Skill Discovery
Authors: Jiayu Chen, Tian Lan, Vaneet Aggarwal
Location: Guangzhou | Day: TBD
Show Abstract
Skills are effective temporal abstractions established for sequential decision making, which enable efficient hierarchical learning for long-horizon tasks and facilitate multi-task learning through their transferability. Despite extensive research, research gaps remain in multi-agent scenarios, particularly for automatically extracting subgroup coordination patterns in a multi-agent task. In this case, we propose two novel auto-encoder schemes: VO-MASD-3D and VO-MASD-Hier, to simultaneously capture subgroup- and temporal-level abstractions and form multi-agent skills, which firstly solves the aforementioned challenge. An essential algorithm component of these schemes is a dynamic grouping function that can automatically detect latent subgroups based on agent interactions in a task. Further, our method can be applied to offline multi-task data, and the discovered subgroup skills can be transferred across relevant tasks without retraining. Empirical evaluations on StarCraft tasks indicate that our approach significantly outperforms existing hierarchical multi-agent reinforcement learning (MARL) methods. Moreover, skills discovered using our method can effectively reduce the learning difficulty in MARL scenarios with delayed and sparse reward signals. The codebase is available at: https://github.com/LucasCJYSDL/VOMASD.
2666: POMP: Pathology-omics Multimodal Pre-training Framework for Cancer Survival Prediction
Authors: Suixue Wang, Shilin Zhang, Huiyuan Lai, Weiliang Huo, Qingchen Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Cancer survival prediction is an important direction in precision medicine, aiming to help clinicians tailor treatment regimens for patients. With the rapid development of high-throughput sequencing and computational pathology technologies, survival prediction has shifted from clinical features to joint modeling of multi-omics data and pathology images. However, existing multimodal learning methods struggle to effectively learn pathology-omics interactions due to the lack of proper alignment of multimodal data before fusion. In this paper, we propose POMP, a pathology-omics multimodal pre-training framework jointly learned with three training tasks for integrating pathological images and omics data for cancer survival prediction. To better perform cross-modal learning, we introduce a pathology-omics contrastive learning method to align the pathology and omics information. POMP leverages the principle of pre-trained models and explores the benefit of aligning multimodal information from the same patient, achieving state-of-the-art results on six cancer datasets from the Cancer Genome Atlas (TCGA). We also show that our contrastive learning method allows us to exploit the cosine similarity of pathological images and omics data as the survival risk score, which can further boost prediction performance compared with other commonly used methods. The code is available at https://github.com/SuixueWang/POMP.
2667: How to Mitigate Information Loss in Knowledge Graphs for GraphRAG: Leveraging Triple Context Restoration and Query-Driven Feedback
Authors: Manzong Huang, Chenyang Bu, Yi He, Xindong Wu
Location: Guangzhou | Day: TBD
Show Abstract
Knowledge Graph (KG)-augmented Large Language Models (LLMs) have recently propelled significant advances in complex reasoning tasks, thanks to their broad domain knowledge and contextual awareness. Unfortunately, current methods often assume KGs to be complete, which is impractical given the inherent limitations of KG construction and the potential loss of contextual cues when converting unstructured text into entity-relation triples.
In response, this paper proposes the Triple Context Restoration and Query-driven Feedback (TCR-QF) framework, which reconstructs the textual context underlying each triple to mitigate information loss, while dynamically refining the KG structure by iteratively incorporating query-relevant missing knowledge.
Experiments on five benchmark question-answering datasets substantiate the effectiveness of TCR-QF in KG and LLM integration, where itachieves a 29.1% improvement in Exact Match and a 15.5% improvement in F1 over its state-of-the-art GraphRAG competitors. The code is publicly available at https://github.com/HFUT-DMiC-Lab/TCR-QF.git.
2676: Inconsistency-Based Federated Active Learning
Authors: Chen-Chen Zong, Tong Jin, Sheng-Jun Huang
Location: Guangzhou | Day: TBD
Show Abstract
Federated learning (FL) enables distributed collaborative learning across local clients while preserving data privacy. However, its practical application in weakly supervised learning (WSL), where only a small subset of data is labeled, remains underexplored. Active learning (AL) is a promising solution for label-limited scenarios, but its adaptation to federated settings presents unique challenges, such as data heterogeneity and noise. In this paper, we propose Inconsistency-based Federated Active Learning (IFAL), a novel approach to address these challenges. First, we introduce a data-driven probability formulation that aligns the biases between local and global models in heterogeneous FL settings. Next, to mitigate noise, we propose an inter-model inconsistency criterion that filters out noisy examples and focuses on those with beneficial prediction discrepancies. Additionally, we introduce an intra-model inconsistency criterion to query examples that help refine the model’s decision boundaries. By combining these strategies with clustering, IFAL effectively selects a diverse and informative query set. Extensive experiments on benchmark datasets demonstrate that IFAL outperforms state-of-the-art methods.
2687: CSF-GAN: Cross-modal Semantic Fusion-based Generative Adversarial Network for Text-guided Image Inpainting
Authors: Shilin Zhang, Suixue Wang, Qingchen Zhang, Liang Zhao, Weiliang Huo, Sijia Hou, Chunjiang Fu
Location: Guangzhou | Day: TBD
Show Abstract
Most visual-guided image inpainting methods based on generative adversarial networks (GANs) struggle when the missing region has weak correlations with the surrounding visual context. Recently, diffusion-based methods guided by textual context have been proposed to address this limitation by leveraging additional semantic information to restore corrupted objects. However, these models typically involve more parameters and exhibit slower generation speeds compared to GAN-based approaches. To address this problem, we propose a novel text-guided image inpainting model, the cross-modal semantic fusion generative adversarial network (CSF-GAN). CSF-GAN is designed as a one-stage GAN with the following key contributions. First, a novel semantic fusion module (SFM) is introduced to integrate sentence- and word-level textual context into the inpainting process, enabling more effective guidance from multi-granularity semantic information. Second, a newly designed word-level local discriminator provides detailed feedback to the generator, enhancing the accuracy of generated content in alignment with word-level semantics. Third, two loss functions, the inpainting loss and edge loss, are employed to enhance both structural coherence and textural realism in the generated results. Extensive experiments on two benchmark datasets demonstrate that CSF-GAN outperforms state-of-the-art methods.
2693: K-Buffers: A Plug-in Method for Enhancing Neural Fields with Multiple Buffers
Authors: Haofan Ren, Zunjie Zhu, Xiang Chen, Ming Lu, Rongfeng Lu, Chenggang Yan
Location: Guangzhou | Day: TBD
Show Abstract
Neural fields are now the central focus of research in 3D vision and computer graphics. Existing methods mainly focus on various scene representations, such as neural points and 3D Gaussians. However, few works have studied the rendering process to enhance the neural fields. In this work, we propose a plug-in method named K-Buffers that leverages multiple buffers to improve the rendering performance. Our method first renders K buffers from scene representations and constructs K pixel-wise feature maps. Then, We introduce a K-Feature Fusion Network (KFN) to merge the K pixel-wise feature maps. Finally, we adopt a feature decoder to generate the rendering image. We also introduce an acceleration strategy to improve rendering speed and quality. We apply our method to well-known radiance field baselines, including neural point fields and 3D Gaussian Splatting (3DGS). Extensive experiments demonstrate that our method effectively enhances the rendering performance of neural point fields and 3DGS.
2695: From Sparse to Complete: Semantic Understanding Based on Stroke Evolution in On-the-fly Sketch-based Image Retrieval
Authors: Yingge Liu, Dawei Dai, Xiangling Hou, Shilin Zhao, Guoyin Wang
Location: Guangzhou | Day: TBD
Show Abstract
In contrast with human sketching, which pre-conceptualizes outlines and features, conventional sketch retrieval models rely primarily rely on pixel-level processing and feature extraction, limiting their ability to capture early sketch intent. Consequently, these models are susceptible to subjective stroke noise, reducing retrieval accuracy. To address this issue, we propose a novel on-the-fly noise stroke retrieval framework designed to align with human sketch-drawing cognition. The proposed framework introduces two core innovations. (i) A stroke consistency detection module that effectively discriminates and suppresses noise strokes by quantifying the structural similarity between the current stroke and the target image, as well as its alignment with key skeletal components. (ii) An adaptive gated mixture of experts module that dynamically selects and integrates features from multiple expert networks during the early, sparse stages of sketching, thereby capturing relevant information with greater precision. Experimental results across diverse sketch datasets demonstrate that the proposed method effectively identifies and suppresses early noise strokes, significantly enhances sketch retrieval performance, and exhibits strong robustness across varying sketch styles.
2696: Strategyproofness and Monotone Allocation of Auction in Social Networks
Authors: Yuhang Guo, Dong Hao, Bin Li, Mingyu Xiao, Bakh Khoussainov
Location: Guangzhou | Day: TBD
Show Abstract
Strategyproofness in network auctions requires that bidders not only report their valuations truthfully, but also do their best to invite neighbours from the social network. In contrast to canonical auctions, where the value-monotone allocation in Myerson’s Lemma is a cornerstone, a general principle of allocation rules for strategyproof network auctions is still missing. We show that, due to the absence of such a principle, even extensions to multi-unit network auctions with single-unit demand present unexpected difficulties, and all pioneering researches fail to be strategyproof.
For the first time in this field, we identify two categories of monotone allocation rules on networks: Invitation-Depressed Monotonicity (ID-MON) and Invitation-Promoted Monotonicity (IP-MON). They encompass all existing allocation rules of network auctions as specific instances. For any given ID-MON or IP-MON allocation rule, we characterize the existence and sufficient conditions for the strategyproof payment rules, and show that among all such payment rules, the revenue-maximizing one exists and is computationally feasible.
With these results, the obstacle of combinatorial network auction with single-minded bidders is now resolved.
2698: Higher-order Logical Knowledge Representation Learning
Authors: Suixue Wang, Weiliang Huo, Shilin Zhang, Qingchen Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Real-world knowledge graphs abound with higher-order logical relations that simple triples, limited to pairwise connections, fail to represent. Thus, capturing higher-order logical relations involving multiple entities has garnered significant attention. However, existing methods ignore the structural information in higher-order relations. To this end, we propose a higher-order logical knowledge representation learning method, named LORE, which leverages network motifs, the patterns/subgraphs that naturally capture the structural information in graphs, to extract higher-order features and ultimately, learn effective representations of knowledge graphs. Compared to existing approaches, LORE aggregates the attribute features of entities with the extracted higher-order logical relations to form enhanced representations of knowledge graphs. In particular, three aggregators (i.e., Hadamard, Connection, and Summation) are proposed and employed. Extensive experiments have been conducted on six real-world datasets for two downstream tasks (i.e., entity classification and link prediction). The results show that LORE outperforms baselines significantly and consistently.
2712: Viral Marketing and Convergence Properties in Generalised Voter Model
Authors: Abhiram Manohara, Ahad N. Zehmakan
Location: Montreal | Day: August 19th | Time: 11:30 | Session: GTEP: Computational social choice (1/2)
Show Abstract
Consider a social network where each node (user) is blue or red, corresponding to positive or negative opinion on a topic. In the voter model, in discrete time rounds, each node picks a neighbour uniformly at random and adopts its colour. Despite its significant popularity, this model does not capture some fundamental real-world characteristics such as the difference in the strengths of connections, individuals with no initial opinion, and users who are reluctant to update. To address these issues, we introduce a generalisation of the voter model.

We study the problem of selecting a set of seed blue nodes to maximise the expected number of blue nodes after some rounds. We prove that the problem is NP-hard and provide a polynomial time approximation algorithm with the best possible approximation guarantee. Our experiments on real-world and synthetic graph data demonstrate that the proposed algorithm outperforms other algorithms.

We also prove that the process could take an exponential number of rounds to converge. However, if we limit ourselves to strongly connected graphs, the convergence time is polynomial and the convergence period (size of the stationary configuration) is bounded by the highest common divisor of cycle lengths in the network.
2719: Hybrid Mesh-Gaussian Representation for Efficient Indoor Scene Reconstruction
Authors: Binxiao Huang, Zhihao Li, Shiyong Liu, Xiao Tang, Jiajun Tang, Jiaqi Lin, Yuxin Cheng, Zhenyu Chen, Xiaofei Wu, Ngai Wong
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: Computer Vision (3/3)
Show Abstract
3D Gaussian splatting (3DGS) has demonstrated exceptional performance in image-based 3D reconstruction and real-time rendering. However, regions with complex textures require numerous Gaussians to capture significant color variations accurately, leading to inefficiencies in rendering speed. To address this challenge, we introduce a hybrid representation for indoor scenes that combines 3DGS with textured meshes. Our approach uses textured meshes to handle texture-rich flat areas, while retaining Gaussians to model intricate geometries. The proposed method begins by pruning and refining the extracted mesh to eliminate geometrically complex regions. We then employ a joint optimization for 3DGS and mesh, incorporating a warm-up strategy and transmittance-aware supervision to balance their contributions seamlessly.Extensive experiments demonstrate that the hybrid representation maintains comparable rendering quality and achieves superior frames per second FPS with fewer Gaussian primitives.
2724: TrajCogn: Leveraging LLMs for Cognizing Movement Patterns and Travel Purposes from Trajectories
Authors: Zeyu Zhou, Yan Lin, Haomin Wen, Shengnan Guo, Jilin Hu, Youfang Lin, Huaiyu Wan
Location: Guangzhou | Day: TBD
Show Abstract
Spatio-temporal trajectories are crucial for data mining tasks, requiring versatile learning methods that can accurately extract movement patterns and travel purposes. While large language models (LLMs) have shown remarkable versatility through training on extensive datasets, and trajectories share similarities with natural language, standard LLMs cannot directly handle spatio-temporal features or extract trajectory-specific information.
We propose TrajCogn, a model that effectively adapts LLMs for trajectory learning. TrajCogn incorporates a novel trajectory semantic embedder to process spatio-temporal features and extract movement patterns and travel purposes, along with a trajectory prompt that integrates this information into LLMs for various downstream tasks. Experiments on three real-world datasets and four representative tasks demonstrate TrajCogn’s effectiveness.
2735: M4Bench: A Benchmark of Multi-domain Multi-granularity Multi-image Understanding for Multi-modal Large Language Models
Authors: Xiaojun Ye, Guanbao Liang, Chun Wang, Liangcheng Li, Pengfei Ke, Rui Wang, Bingxin Jia, Gang Huang, Qiao Sun, Sheng Zhou
Location: Guangzhou | Day: TBD
Show Abstract
The increasing demands in analyzing complex associated scenes pose necessities to researching multi-image understanding abilities. Compared with understanding individual images, both the alignments and differences between images are essential aspects of understanding the intricate relationships for multi-image inference tasks. However, existing benchmarks face difficulties in addressing both of these aspects simultaneously, resulting in obstacles to modeling relationships under various granularities and domains of images. In this paper, we introduce M4Bench to enhance the capability of aligning and distinguishing multi-images with multi-domain multi-granularity comparison. We carefully design five comparison tasks related to coarse and fine-grained granularities in single and multiple domains of images and evaluate them on 13 state-of-the-art multi-modal large language models with various sizes. Besides, we analyze the evaluation results and provide several observations and viewpoints for the multi-image understanding research. The data and evaluation code are available at https://github.com/eaglelab-zju/M4Bench.
2740: Neuron Similarity-Based Neural Network Verification via Abstraction and Refinement
Authors: Yuehao Liu, Yansong Dong, Liang Zhao, Wensheng Wang, Cong Tian
Location: Guangzhou | Day: TBD
Show Abstract
Deep neural networks (DNNs) have become integral to numerous safety-critical applications, necessitating rigorous verification of their trustworthiness. However, the problem of verifying DNNs has high computational complexity, and existing techniques have limited efficiency, insufficient to deal with large-scale network models. To address this challenge, we propose a novel abstraction-refinement verification method that reduces network size while maintaining verification accuracy. Specifically, the method quantifies the similarity between neurons based on various factors such as their interval outputs, and then merges similar neurons to generate a smaller abstract network. In addition, a counterexample-guided refinement process is developed to mitigate the impact of potential spurious counterexamples, so that verification results from the abstract network are applicable to the original network. We have implemented this method as a tool named ARVerifier and integrated it with three state-of-the-art verification tools for evaluation on ACAS Xu and MNIST benchmarks. Experimental results demonstrate that ARVerifier significantly reduces network size and yields verification time reductions by 11.61%, 18.70%, and 12.20% compared to α,β-CROWN, Verinet, and Marabou, respectively. Moreover, ARVerifier exhibits efficiency improvements by 26.64% and 46.87% compared to existing abstraction-refinement methods NARv and CEGAR-NN, respectively.
2743: PALA: Class-imbalanced Graph Domain Adaptation via Prototype-anchored Learning and Alignment
Authors: Xin Ma, Yifan Wang, Siyu Yi, Wei Ju, Bei Wu, Ziyue Qiao, Chenwei Tang, Jiancheng Lv
Location: Guangzhou | Day: TBD
Show Abstract
Graph domain adaptation is a key subfield of graph transfer learning that aims to bridge domain gaps by transferring knowledge from a label-rich source graph to an unlabeled target graph. However, most existing methods assume balanced labels in the source graph, which often fails in practice and leads to biased knowledge transfer. To address this, in this paper, we propose a prototype-anchored learning and alignment framework for class-imbalanced graph domain adaptation. Specifically, we incorporate pointwise node mutual information into the graph encoder to capture high-order topological proximity and learn generalized node representations. Leveraging this, we then introduce categorical prototypes with adversarial proto-instances for prototype-anchored learning and recalibration to represent the source graph under an imbalanced class distribution. Finally, we introduce a weighted prototype contrastive adaptation strategy that aligns target pseudo-labels with source prototypes to handle class imbalance during adaptation. Extensive experiments show that our PALA outperforms the state-of-the-art methods. Our code is available at https://github.com/maxin88scu/PALA.
2761: Accelerating Adversarial Training on Under-Utilized GPU
Authors: Zhuoxin Zhan, Ke Wang, Pulei Xiong
Location: Montreal | Day: August 21st | Time: 10:00 | Session: Machine Learning (4/4)
Show Abstract
Deep neural networks are vulnerable to adversarial attacks and adversarial training has been proposed to defend against such attacks by adaptively generating attacks, i.e., adversarial examples, during training. However, adversarial training is significantly slower than traditional training due to the search for worst attacks for each minibatch. To speed up adversarial training, existing work has considered a subset of a minibatch for generating attacks and reduced the steps in the search for attacks. We propose a novel adversarial training acceleration method, called AttackRider, by exploring under-utilized GPU hardware to reduce the number of calls to attack generation without increasing the time of each call. We characterize the extent of under-utilization of GPU for given GPU and model size, hence the potential for speedup, and present the application scenarios where this opportunity exists. The results on various machine learning tasks and datasets show that AttackRider can speed up state-of-the-art adversarial training algorithms with comparable robust accuracy. The source code of AttackRider is available at https://github.com/zxzhan/AttackRider.
2774: Beyond Low-rankness: Guaranteed Matrix Recovery via Modified Nuclear Norm
Authors: Jiangjun Peng, Yisi Luo, Xiangyong Cao, Shuang Xu, Deyu Meng
Location: Guangzhou | Day: TBD
Show Abstract
The nuclear norm (NN) has been widely explored in matrix recovery problems, such as Robust PCA and matrix completion, leveraging the inherent global low-rank structure of the data. In this study, we introduce a new modified nuclear norm (MNN) framework, where the MNN family norms are defined by adopting suitable transformations and performing the NN on the transformed matrix. The MNN framework offers two main advantages: (1) it jointly captures both local information and global low-rankness without requiring trade-off parameter tuning; (2) under mild assumptions on the transformation, we provide theoretical recovery guarantees for both Robust PCA and MC tasks—an achievement not shared by existing methods that combine local and global information. Thanks to its general and flexible design, MNN can accommodate various proven transformations, enabling a unified and effective approach to structured low-rank recovery. Extensive experiments demonstrate the effectiveness of our method. Code and supplementary material are available at https://github.com/andrew-pengjj/modified_nuclear_norm.
2778: Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization
Authors: Xinhao Yao, Hongjin Qian, Xiaolin Hu, Gengze Xu, Wei Liu, Jian Luan, Bin Wang, Yong Liu
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs), built on Transformer architectures, exhibit remarkable generalization across a wide range of tasks. However, fine-tuning these models for specific tasks remains resource-intensive due to their extensive parameterization. In this paper, we explore two remarkable phenomena related to the attention mechanism during the fine-tuning of LLMs (where Wq, Wk, and Wv denote the weights of the query, key, and value layers, respectively). The first phenomenon, termed “Unequal Importance of Attention Matrices”, highlights the impact of fine-tuning different weight matrices. It shows that optimizing the Wv matrix yields significantly better performance than optimizing the Wk matrix. Fine-tuning only the Wq and Wv matrices is computationally efficient while delivering results comparable to, or even better than fine-tuning all three matrices (Wq, Wk, and Wv). The second phenomenon, “Attention Matrices with Customized Learning Rate Lead to Better Convergence”, emphasizes the importance of assigning distinct learning rates to these matrices. Specifically, a higher learning rate for the Wv matrix compared to Wq and Wk accelerates convergence and improves performance. Building on these insights, we propose a new strategy that improves fine-tuning efficiency in terms of both storage and time. Experimental results on benchmark datasets validate the effectiveness of this approach, supporting our theoretical findings. Our analysis lays the theoretical groundwork for configuring and improving algorithms in LLMs fine-tuning.
2793: TP-Eval: Tap Multimodal LLMs’ Potential in Evaluation by Customizing Prompts
Authors: Yuxuan Xie, Tianhua Li, Wenqi Shao, Kaipeng Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Recently, multimodal large language models (MLLMs) have received much attention for their impressive capabilities. The evaluation of MLLMs is becoming critical to analyzing attributes of MLLMs and providing valuable insights. However, current benchmarks overlook the problem of prompt sensitivity – minor prompt variations may lead to significant performance fluctuations. Thus, inappropriate prompts may obscure the models’ capabilities, underestimating the models’ performance. Moreover, different models have different preferences for different prompts, and thus, using the same prompt for all models will cause evaluation bias. This paper analyzes this deficiency in existing benchmarks and further introduces a new evaluation framework named TP-Eval, which introduces a prompt customization method to reduce evaluation biases and tap models’ potential. TP-Eval will rewrite the original prompts to different customized prompts for different models. In particular, we propose some well-designed modules for prompt customization tailored to the scenario of MLLM evaluation. Extensive experiments demonstrate the effectiveness of our approach to uncovering models’ capabilities, and TP-Eval should benefit the community in developing more comprehensive and convincing MLLM evaluation benchmarks.
2810: Multi-Task Curriculum Graph Contrastive Learning with Clustering Entropy Guidance
Authors: Chusheng Zeng, Bocheng Wang, Jinghui Yuan, Mulin Chen, Xuelong Li
Location: Guangzhou | Day: TBD
Show Abstract
Recent advances in unsupervised deep graph clustering have been significantly promoted by contrastive learning. Despite the strides, most graph contrastive learning models face challenges: 1) graph augmentation is used to improve learning diversity, but commonly used random augmentation methods may destroy inherent semantics and cause noise; 2) the fixed positive and negative sample selection strategy ignores the difficulty distribution of samples when deal with complex real data, thereby impeding the model’s capability to capture fine-grained patterns and trapping the model in sub-optimal for clustering. To reduce these problems, we propose the Clustering-guided Curriculum Graph contrastive Learning (CurGL) framework. CurGL uses clustering entropy as the guidance of the following graph augmentation and contrastive learning. Specifically, according to the clustering entropy, the intra-class edges and important features are emphasized in augmentation. Then, a multi-task curriculum learning scheme is proposed, which employs the clustering guidance to shift the focus from the discrimination task to the clustering task. In this way, the sample selection strategy of contrastive learning can be adjusted adaptively from early to late stage, which enhances the model’s flexibility for complex data structure. Experimental results demonstrate that CurGL has achieved excellent performance compared to state-of-the-art competitors.
2815: Template-based Uncertainty Multimodal Fusion Network for RGBT Tracking
Authors: Zhaodong Ding, Chenglong Li, Shengqing Miao, Jin Tang
Location: Guangzhou | Day: TBD
Show Abstract
RGBT tracking is to localize the predefined targets in video sequences by effectively leveraging the information from both visible light (RGB) and thermal infrared (TIR) modalities. However, the quality of different modalities changes dynamically in complex scenes, and effectively perceiving modal quality for multimodal fusion remains a significant challenge. To address this challenge, we propose to employ the reliability of initial template to explore the uncertainty across different modalities, and design a novel template-based uncertainty computation framework for robust multimodal fusion in RGBT tracking.
In particular, we introduce an Uncertainty-aware Multimodal Fusion Module (UMFM), which constructs the uncertainty of each modality by leveraging the correlation between the template and search region in the Subjective Logic framework, aiming to achieve robust multimodal fusion. In addition, existing methods focus on dynamic template update while overlooking the potential role of a reliable initial template in the template updating process.To this end, we design a simple yet effective Contrastive Template Update Module (CTUM) to assess the reliability of the new template by comparing its quality with that of the initial template. Extensive experiments suggest that our method outperforms existing approaches on four RGBT tracking benchmarks.
2816: Rethinking Graph Contrastive Learning Through Relative Similarity Preservation
Authors: Zhiyuan Ning, Pengfei Wang, Ziyue Qiao, Pengyang Wang, Yuanchun Zhou
Location: Montreal | Day: August 21st | Time: 15:00 | Session: DM: Graph Data Mining
Show Abstract
Graph contrastive learning (GCL) has achieved remarkable success by following the computer vision paradigm of preserving absolute similarity between augmented views. However, this approach faces fundamental challenges in graphs due to their discrete, non-Euclidean nature — view generation often breaks semantic validity and similarity verification becomes unreliable. Through analyzing 11 real-world graphs, we discover a universal pattern transcending the homophily-heterophily dichotomy: label consistency systematically diminishes as structural distance increases, manifesting as smooth decay in homophily graphs and oscillatory decay in heterophily graphs. We establish theoretical guarantees for this pattern through random walk theory, proving label distribution convergence and characterizing the mechanisms behind different decay behaviors. This discovery reveals that graphs naturally encode relative similarity patterns, where structurally closer nodes exhibit collectively stronger semantic relationships. Leveraging this insight, we propose RELGCL, a novel GCL framework with complementary pairwise and listwise implementations that preserve these inherent patterns through collective similarity objectives. Extensive experiments demonstrate that our method consistently outperforms 20 existing approaches across both homophily and heterophily graphs, validating the effectiveness of leveraging natural relative similarity over artificial absolute similarity.
2834: A Simple yet Effective Hypergraph Clustering Network
Authors: Qianqian Wang, Bowen Zhao, Zhengming Ding, Xiangdong Zhang, Quanxue Gao
Location: Guangzhou | Day: TBD
Show Abstract
Hypergraph Clustering has gained significant attention due to its capability of capturing high order structural information. Among different approaches, contrastive learning-based methods leverage self-supervised learning and data augmentation, exhibiting impressive performance. However, most of them come with the following limitations: 1) Augmentation strategies like feature dropout can potentially disrupt the intrinsic clustering structure of hypergraphs. 2) High computational demands hinder their real-world application. To address the above issues, we propose a simple yet effective Hypergraph Clustering Network framework (HCN). Specifically, HCN replaces the hypergraph convolution operation with smoothing preprocessing, which avoids high computational complexity. Besides, to retain intrinsic structure, it develops two key modules: the self-diagonal consistency module and the structure alignment mod ule. They respectively align the similarity matrix with the identity matrix and the structural affinity matrix, which ensures intra-cluster compact ness and inter-cluster separability. Extensive experiments on five benchmark datasets demonstrate HCN’s superiority over state-of-the-art methods.
2846: Wavelet Multi-scale Region-Enhanced Network for Medical Image Segmentation
Authors: Hang Lu, Liang Du, Peng Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Medical image segmentation is an important task in medical artificial intelligence. Traditional segmentation methods often suffer from the information loss problem, especially in medical image data which contain many different-scale organs or tissues. To address this problem, we propose a novel medical image segmentation method called Wavelet Multi-scale Region-Enhanced Network (WMREN), which has a UNet structure. In the encoder, we design a bi-branch feature extraction architecture, which simultaneously learns the representations with Haar wavelet transform and the residual blocks. The bi-branch architecture can effectively tackle the information loss problem when extracting features. In the decoder we design an innovative Spatial Adaptive Fusion Module to enhance the regions of interest. As we know, the boundaries of objects play an important role in segmentation. To this end, we also carefully design a Contrast Refinement Enhancement Module to highlight the boundaries of the medical objects. Extensive experiments on several benchmark datasets show that our method outperforms state-of-the-art medical image segmentation methods, demonstrating its effectiveness and superiority. The source code is publicly available at https://github.com/C101812/WMREN/tree/master.
2852: Meta Label Correction with Generalization Regularizer
Authors: Tao Tong, Yujie Mo, Yucheng Xie, Songyue Cai, Xiaoshuang Shi, Xiaofeng Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Deep neural networks can easily lead to the over-fitting issue due to the influence of noisy labels. However, previous label correction methods for dealing with noisy labels often need expensive computation cost to achieve effectiveness and ignore the generalization ability of the model. To address these issues, in this paper, we propose a new meta-based self-correction method to achieve accurate filtering of noisy labels and to enhance the generalization ability of the label correction model. Specifically, we first investigate a new gradient score method to filter noisy labels with less computation cost, and then theoretically design a new generalization regularizer into the meta-learner and the base learner, for correcting noisy labels as well as achieving the generalization ability. Experimental results on real datasets verify the effectiveness of our proposed method in terms of different classification tasks.
2863: Learning from Logical Constraints with Lower- and Upper-Bound Arithmetic Circuits
Authors: Lucile Dierckx, Alexandre Dubray, Siegfried Nijssen
Location: Montreal | Day: August 21st | Time: 10:00 | Session: Machine Learning (4/4)
Show Abstract
An important class of neuro-symbolic (NeSy) methods relies on knowledge compilation (KC) techniques to transform logical constraints into a differentiable exact arithmetic circuit (AC) that represents all models of a logical formula. However, given the complexity of KC, compiling such exact circuits can be infeasible. Previous works in such cases proposed to compile a circuit for a subset of models. In this work, we will show that gradients calculated on a subset of models can be very far from true gradients. We propose a new framework that calculates gradients based on compiling logical constraints partially in not only a lower-bound circuit but also an upper-bound circuit. We prove that from this pair of ACs, gradients that are within a bounded distance from true gradients can be calculated. Our experiments show that adding the upper-bound AC also helps the learning process in practice, allowing for similar or better generalisation than working solely with fully compiled ACs, even with less than 150 seconds of partial compilation.
2866: Inter3D: A Benchmark and Strong Baseline for Human-Interactive 3D Object Reconstruction
Authors: Gan Chen, Ying He, Mulin Yu, F.Richard Yu, Gang Xu, Fei Ma, Ming Li, Guang Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Recent advancements in implicit 3D reconstruction methods, e.g., neural rendering fields and Gaussian splatting, have primarily focused on novel view synthesis of static or dynamic objects with continuous motion states. However, these approaches struggle to efficiently model a human-interactive object with n movable parts, requiring 2^n separate models to represent all discrete states. To overcome this limitation, we propose Inter3D, a new benchmark and approach for novel state synthesis of human-interactive objects. We introduce a self-collected dataset featuring commonly encountered interactive objects and a new evaluation pipeline, where only individual part states are observed during training, while part combination states remain unseen. We also propose a strong baseline approach that leverages Space Discrepancy Tensors to efficiently modelling all states of an object. To alleviate the impractical constraints on camera trajectories across training states, we propose a Mutual State Regularization mechanism to enhance the spatial density consistency of movable parts. In addition, we explore two occupancy grid sampling strategies to facilitate training efficiency. We conduct extensive experiments on the proposed benchmark, showcasing the challenges of the task and the superiority of our approach. The code and data are publicly available at https://github.com/Inter3D-ui/Inter3D.
2870: Find and Perceive: Tell Visual Change with Fine-Grained Comparison
Authors: Feixiao Lv, Rui Wang, Lihua Jing, Lijun Liu
Location: Guangzhou | Day: TBD
Show Abstract
The goal of the image change captioning task is to capture the differences between two similar images and describe them in natural language. In this paper, we decompose this task into two sub-problems, i.e., fine-grained change feature learning and discrimination of changed regions. Compared with existing methods which only focus on change feature learning, we propose a novel change captioning learning paradigm, Find and Perceive (F&P). Our proposed F&P consists of two main ideas, i.e., the Fine-Grained Semantic Change Perception (FGSCP) module for improving the model’s perception ability of subtle changes and the Weakly-Supervised Discriminator (WSD) of changed regions for improving the model’s sensitivity of localising the important regions. Specifically, the FGSCP deploys a two-step manner, firstly introducing the fine-grained categorisation and then enhancing the interaction of the two paired images. And the WSD adopts the contributions of each image region for final generated captions, accurately indicating which regions are important for change captions without any extra annotations. Finally, we conduct extensive experiments on four change captioning datasets, and experimental results show that our proposed method F&P outperforms existing change caption methods and achieves new state-of-the-art performance.
2886: MCF-Spouse: A Multi-Label Causal Feature Selection Method with Optimal Spouses Discovery
Authors: Lin Ma, Liang Hu, Qiang Huang, Pingting Hao, Juncheng Hu
Location: Guangzhou | Day: TBD
Show Abstract
Multi-label causal feature selection has garnered considerable attention for its ability to identify the most informative features while accounting for the causal dependencies between labels and features. However, previous work often overlooks the unique contributions of labels to the target variables in multi-label settings, focusing instead on prioritizing feature variables. Moreover, existing methods typically rely on traditional Markov Blanket (MB) discovery to construct an initial MB, which often fails to explore the most valuable form of spouse variables to feature selection in multi-label scenarios, leading to significant computational overhead due to redundant Conditional Independence (CI) tests required for spouse search. To address these challenges, we propose the Multi-label Causal Feature Selection Method with Optimal Spouses Discovery, MCF-Spouse, which leverages mutual information to quantify the contributions of both labels and features, ensuring the retention of the most informative variables in multi-label settings. Moreover, we systematically analyzes all potential forms of spouse variables to identify the optimal spouse case, significantly reducing the spouse search space and alleviating the time overhead associated with CI tests. Experiments conducted on diverse real-world datasets demonstrate that MCF-Spouse consistently outperforms state-of-the-art methods across multiple metrics, offering a scalable and interpretable solution for multi-label causal feature selection.
2893: Disentangled and Personalized Representation Learning for Next Point-of-Interest Recommendation
Authors: Xuan Rao, Shuo Shang, Lisi Chen, Renhe Jiang, Peng Han
Location: Guangzhou | Day: TBD
Show Abstract
Next POInt-of-Interest (POI) recommendation predicts a user’s next move and facilitates location-based services such as navigation and travel planning. SOTA methods fuse each POI and its contexts (e.g., time, category, and region) into a single representation to model sequential user movement. This hinders the effective utilization of context information, and diverse user preferences are also neglected. To tackle these limitations, we propose Disentangled and Personalized Representation Learning (DPRL) as a novel method for next POI recommendation. DPRL decouples POIs and contexts during representation learning, capturing their sequential regularities independently using separate recurrent neural networks (RNNs). To model the preference of each user, DPRL adopts an aggregation mechanism that integrates dynamic user preferences and spatial-temporal factors into the learned representations. We compare DPRL with 16 state-of-the-art baselines. The results show that DPRL outperforms all baselines and achieves an average accuracy improvement of 10.53% over the best-performing baseline.
2900: Do You Steal My Model? Signature Diffusion Embedded Dual-Verification Watermarking for Protecting Intellectual Property of Hyperspectral Image Classification Models
Authors: Yufei Yang, Song Xiao, Lixiang Li, Wenqian Dong, Jiahui Qu
Location: Guangzhou | Day: TBD
Show Abstract
Due to the high cost of data collection and training, the well-performed hyperspectral image (HSI) classification models are of great value and vulnerable to piracy threat during transmission and use. Model watermarking is a promising technology for intellectual property (IP) protection of models. However, the existing model watermarking methods for RGB image classification models ignore the complexity of ground objects and high dimension of HSIs, which makes trigger samples easy to be detected and forged. To address this problem, we propose a signature diffusion embedded dual-verification watermarking method, which generates imperceptible trigger samples with explicit owner information to achieve dual verification of both model ownership and legality of trigger set. Specifically, the subpixel-space owner signature diffusion incorporated imperceptible trigger set generation method is proposed to manipulate owner signature incorporated to the abundance matrix of seeds via diffusion model in subpixel space, thus balancing the perceptual quality of trigger samples and signature extraction capability. To resist ownership confusion, dual-stamp ownership verification is proposed to query the suspicious model with trigger samples for ownership verification, and further extracts signature from trigger samples to guarantee their legality. Extensive experiments demonstrate the proposed method can effectively protect IP of HSI classification models.
2902: Bi-DiffCD: Bidirectional Diffusion Guided Collaborative Change Detection for Arbitrary-Modal Remote Sensing Images
Authors: Jingyu Zhao, Jiahui Qu, Wenqian Dong
Location: Guangzhou | Day: TBD
Show Abstract
Change detection aims to identify land cover changes by analyzing multitemporal images that cover the same area. However, It may be difficult to effectively obtain high-quality multitemporal images with the same modality in real dynamic scenarios. The rapid development of remote sensing technology enables collaborative observation of multimodal images, but it is challenging for uni-modal image-specific methods to overcome modal discrepancy and achieve complementary advantage detection. To this end, we propose a bidirectional diffusion guided collaborative change detection model (Bi-DiffCD) for arbitrary-modal images, which eliminates the modal discrepancy between arbitrary-modal images through the bidirectional diffusion and makes full use of the multilevel complementary advantage features to improve the detection accuracy. Specifically, a conditional diffusion-based bidirectional modal alignment module (CDBMA) is designed to step-wise align the modal attribute bidirectionally while preserving the multimodal complementary features. Furthermore, a multilevel complementary feature collaborative change detection module (MLCCD) is proposed to collaborate the multilevel enhanced complementary change information from transformed images and potential features for change detection. Experiments have been conducted on three widely used and one self-made multimodal datasets to demonstrate the effectiveness of the proposed method with different combinations of modalities. Code is available at https://github.com/Jiahuiqu/Bi-DiffCD.
2918: Can Retelling Have Adequate Information for Reasoning? An Enhancement Method for Imperfect Video Understanding with Large Language Model
Authors: Mingxin Li, Wenhao Wang, Hongru Ji, Xianghua Li, Chao Gao
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) demonstrate strong capabilities in video understanding. However, it exhibits hallucinations and factual errors in video description. On the one hand, existing Multimodal Large Language Models (MLLMs) are primarily trained by combining language models and vision models, with their visual understanding capabilities depending on the performance of the backbone. Moreover, video descriptions often suffer from incomplete content and the possibility of errors. Given the proven assessment of the strong reasoning capabilities of LLMs, this paper proposes ERSR, a novel Entity and Relationship based Self-Enhanced Reasoning method for imperfect video understanding. Specifically, an entities and relationships strategy is designed to perform scene graphs based on the limited observed entity relationships, thereby enhancing video descriptions. Furthermore, by providing question feedbacks, a self-enhanced forward and feedback reasoning strategy is provided to enhance reasoning logic. Finally, the prediction question answering results are re-validated through rethinking and verifying using the LLMs. Extensive experiments show that the proposed method achieves competitive results on real-world video understanding datasets, with an overall improvement of no less than 1.4%.
2926: FedBG: Proactively Mitigating Bias in Cross-Domain Graph Federated Learning Using Background Data
Authors: Sheng Huang, Lele Fu, Tianchi Liao, Bowen Deng, Chuanfu Zhang, Chuan Chen
Location: Guangzhou | Day: TBD
Show Abstract
Federated graph learning is focused on aggregating knowledge from multi-source graph data and training graph neural networks. Unlike the data that traditional federated learning needs to deal with, federated graph learning also needs to face additional topological information. Further, there are also biases in features and topologies among clients, increasing the difficulty of training models. Previous methods usually seek global calibration information, however, this approach may suffer from information bias caused by data skews, and it is also difficult to naturally combine feature and topology information. Therefore, adjusting the bias before it occurs will hopefully address the learning difficulties caused by the skew. In view of this, we employ background graph data, which works as reference information for local training, to proactively correct bias before it occurs. As a kind of graph data, background graphs are naturally capable of combining feature and topology information to accomplish bias correction among clients in a comprehensive way. Mixing strategy is employed on the background graph to additionally provide privacy-preserving capabilities. Graph generation methods are employed to restore the diversity of background graphs that are blurred by the mixing strategy. Extensive experiments on two real-world datasets demonstrate the sufficient motivation and effectiveness of the proposed method.
2929: Enhancing the Performance of Global Model by Improving the Adaptability of Local Models in Federated Learning
Authors: Wujun Zhou, Shu Ding, Zelin Li, Wei Wang
Location: Guangzhou | Day: TBD
Show Abstract
Federated learning enables the clients to collaboratively train a global model, which is aggregated from local models. Due to the heterogeneous data distributions over clients and data privacy in federated learning, it is difficult to train local models to achieve a well-performed global model. In this paper, we introduce the adaptability of local models, i.e., the average performance of local models on data distributions over clients, and enhance the performance of the global model by improving the adaptability of local models. Since each client does not know the data distributions over other clients, the adaptability of the local model cannot be directly optimized. First, we provide the property of an appropriate local model which has good adaptability on the data distributions over clients. Then, we formalize the property into the local training objective with a constraint and propose a feasible solution to train the local model. Extensive experiments on federated learning benchmarks demonstrate that our method significantly improves the adaptability of local models and achieves a well-performed global model that consistently outperforms the baseline methods.
2942: LLM-TPF: Multiscale Temporal Periodicity-Semantic Fusion LLMs for Time Series Forecasting
Authors: Qihong Pan, Haofei Tan, Guojiang Shen, Xiangjie Kong, Mengmeng Wang, Chenyang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Large language models have demonstrated remarkable generalization capabilities and strong performance across various fields. Recent research has highlighted their significant potential in time series forecasting. However, time series data often exhibit complex periodic characteristics, posing a substantial challenge in enabling these models to effectively capture latent patterns. To address this challenge, we propose a novel framework, LLM-TPF, which leverages individuality and commonality fusion to enhance time series forecasting. In the frequency domain, periodic features are extracted to reveal the intrinsic periodicity of the data, while textual prototypes are used to indicate temporal trends. In the time domain, carefully designed prompts are employed to guide the models in comprehending global information. A commonality fusion mechanism further aggregates heterogeneous information across dimensions, and three distinct language models are utilized to independently process different types of information. Extensive real-world experiments demonstrate that LLM-TPF is a powerful tool for time series forecasting, achieving superior performance compared to state-of-the-art specialized models and exhibiting exceptional generalization ability in zero-shot scenarios. Code is available at https://github.com/switchsky/LLM-TPF.
2943: Guiding LLM-based Smart Contract Generation with Finite State Machine
Authors: Hao Luo, Yuhao Lin, Xiao Yan, Xintong Hu, Yuxiang Wang, Qiming Zeng, Hao Wang, Jiawei Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Smart contract is a kind of self-executing code based on blockchain technology with a wide range of application scenarios, but the traditional generation method relies on manual coding and expert auditing, which has a high threshold and low efficiency. Although Large Language Models (LLMs) show great potential in programming tasks, they still face challenges in smart contract generation w.r.t. effectiveness and security. To solve these problems, we propose FSM-SCG, a smart contract generation framework based on finite state machine (FSM) and LLMs, which significantly improves the quality of the generated code by abstracting user requirements to generate FSM, guiding LLMs to generate smart contracts, and iteratively optimizing the code with the feedback of compilation and security checks. The experimental results show that FSM-SCG significantly improves the quality of smart contract generation. Compared to the best baseline, FSM-SCG improves the compilation success rate of generated smart contract code by at most 48%, and reduces the average vulnerability risk score by approximately 68%.
2949: Hybrid Relational Graphs with Sentiment-laden Semantic Alignment for Multimodal Emotion Recognition in Conversation
Authors: Hongru Ji, Xianghua Li, Mingxin Li, Meng Zhao, Chao Gao
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal Emotion Recognition in Conversation (MERC) focuses on detecting the emotions expressed by speakers in each utterance. Recent research has increasingly leveraged graph-based models to capture interactive relationships in conversations, enhancing the ability to extract emotional cues. However, existing methods primarily focus on explicit utterance-level relationships, neglecting both the implicit connections within individual modality and the differences in implicit relationships across modalities. Moreover, these methods often overlook the role of sentimental features in conversation history in cross-modal semantic alignment. To address these issues, we propose a novel model that employs modality-adaptive hybrid relational graphs to enrich the dialogue graph by inferring implicit relationships between nodes within each modality. Furthermore, we introduce historical sentiment through a progressive strategy that utilizes contrastive learning to refine cross-modal semantic alignment. Experimental results demonstrate the superior performance of our approach over state-of-the-art methods on the IEMOCAP and MELD datasets. Our code is available at https://github.com/cgao-comp/HRG-SSA.
2974: Pseudo-Label Reconstruction for Partial Multi-Label Learning
Authors: Yu Chen, Fang Li, Na Han, Guanbin Li, Hongbo Gao, Sixian Chan, Xiaozhao Fang
Location: Guangzhou | Day: TBD
Show Abstract
In Partial Multi-Label Learning (PML), each instance is associated with a candidate label set containing multiple relevant labels along with other false positive labels. Currently, most PML methods directly extract instance correlation from instance features while ignoring the candidate labels, which may contain more discriminative instance-related information. This paper argues that, with a well-designed model, more accurate instance correlation can be mined from the candidate labels to facilitate label disambiguation. To this end, we propose a novel PML method based on pseudo-label reconstruction (PML-PLR). Specifically, we first propose a novel orthogonal candidate label reconstruction method, which jointly optimizes with instance features to extract more consistent instance correlation. Then, we use instance correlation as reconstruction coefficient to reconstruct pseudo-labels. Subsequently, through local manifold learning, the reconstructed pseudo-labels are leveraged to propagate the consistency relationship between labels and instances, thereby improving the accuracy of pseudo-labels. Extensive experiments and analyses demonstrate that the proposed PML-PLR outperforms state-of-the-art methods.
2982: DiffFERV: Diffusion-based Facial Editing of Real Videos
Authors: Xiangyi Chen, Han Xue, Li Song
Location: Montreal | Day: August 19th | Time: 15:00 | Session: CV: Difusion models
Show Abstract
Face video editing presents significant challenges, requiring precise preservation of facial identity, temporal consistency, and background details. Existing methods encounter three major challenges: difficulty in achieving accurate facial reconstruction, struggles with challenging real-world videos and reliance on a crop-edit-stitch paradigm that confines editing to localized facial regions. In response, we introduce DiffFERV, a novel diffusion-based framework for realistic face video editing that addresses these limitations through three core contributions. (1) A specialization stage that extends large Text-to-Image (T2I) models’ general prior to faces while retaining their broad generative capabilities. This enables robust performance on non-aligned and challenging face images. (2) Temporal modeling, implemented through two distinct attention mechanisms, complements the specialization stage to ensure joint and temporally consistent processing of video frames. (3) Finally, we present a holistic editing pipeline and the concept of preservation features, which leverages our model’s enhanced priors and temporal mechanisms to achieve faithful edits of entire video frames without the need for cropping, excelling even in real-world scenarios. Extensive experiments demonstrate that DiffFERV achieves state-of-the-art performance in both reconstruction and editing tasks.
2994: Spatially Resolved Transcriptomics Data Clustering with Tailored Spatial-scale Modulation
Authors: Yuang Xiao, Yanran Zhu, Chang Tang, Xiao Zheng, Yuanyuan Liu, Kun Sun, Xinwang Liu
Location: Guangzhou | Day: TBD
Show Abstract
Spatial transcriptomics, comprising spatial location and high-throughput gene expression information, provides revolutionary insights into disease discovery and cellular evolution. Spatial transcriptomic clustering, which pinpoints distinct spatial domains within tissues, reveals cellular interactions and enhances our understanding of the intricate architecture of tissues. Existing methods typically construct spatial graphs using a static radius based on spatial coordinates, which hinders the accurate identification of spatial domains and complicates the precise partitioning of boundary nodes within clusters. To address this issue, we introduce a novel spatially resolved transcriptomics data clustering network (TSstc). Specifically, we employ a tailored spatial-scale modulation approach, constructing different spatial graphs incrementally as the radius of the spatial domain expands, and a Spatiality-Aware Sampling (SAS) strategy is proposed to aggregate node representations by considering the spatial dependencies between spots. We then use GCN encoders to learn gene embedding with gene graph and multiple spatial embeddings with spatial graphs. During training, we incorporate cross-view correlation-based tailored spatial regularization constraints to preserve high-quality neighbor relationships across spatial embeddings at different scales. Finally, a zero-inflated negative binomial model is utilized to capture the global probability distribution of gene expression profiles. Extensive experimental results demonstrate that our approach surpasses existing state-of-the-art methods in clustering tasks and related downstream applications.
2996: On Middle Grounds for Preference Statements
Authors: Anne-Marie George, Ana Ozaki
Location: Montreal | Day: August 21st | Time: 11:30 | Session: Knowledge Representation and Reasoning (3/4)
Show Abstract
In group decisions or deliberations, stakeholders are often confronted with conflicting opinions. We investigate a logic-based way of expressing such opinions and a formal general notion of a middle ground between stakeholders. Inspired by the literature on preferences with hierarchical and lexicographic models, we instantiate our general framework to the case where stakeholders express their opinions using preference statements of the form ‘I prefer ‘a’ to ‘b’’, where ‘a’ and ‘b’ are alternatives expressed over some attributes, e.g., in a trolley problem, one can express I prefer to save 1 adult and 1 child to 2 adults (and 0 children). We prove theoretical results on the existence and uniqueness of middle grounds. In particular, we show that, for preference statements, middle grounds may not exist and may not be unique. We provide algorithms for deciding the existence and finding middle grounds.
2998: IE-PMMA:Point Cloud Completion Through Inverse Edge-aware Upsampling and Precise Multi-Modal Feature Alignment
Authors: Ran Jia, Junpeng Xue, Shuai Ma, Wenbo Lu, Kelei Wang
Location: Guangzhou | Day: TBD
Show Abstract
Point cloud completion is a crucial task in 3D computer vision. Multi-modal completion approaches have gained attention among the popular two-stage point cloud completion methods. However, there is a notable lack of research focused on accurately aligning data from different modalities within these methods. Additionally, in other point cloud-based tasks, edge point information often provides unexpected positive contributions. In this paper, we propose a novel point cloud completion method that leverages edge point information for the first time in the completion task, which also addresses the precise alignment of multi-modal data. In particular, we implement a two-step local-to-global module to achieve better alignment of multi-modal data during the preliminary point cloud generation process. Besides, we introduce a new spatial representation structure capable of extracting a fixed number of edge points. Moreover, with the assistance of edge information, we further design an inverse edge-aware upsampler to refine the point cloud. We evaluate our method on three typical datasets, and the results demonstrate that our IE-PMMA outperforms the existing state-of-the-art methods quantitatively and visually.
3008: Interpretable DNFs
Authors: Martin C. Cooper, Imane Bousdira, Clément Carbonnel
Location: Montreal | Day: August 21st | Time: 10:00 | Session: ML: Explainable/Interpretable machine learning
Show Abstract
A classifier is considered interpretable if each of its decisions has an explanation which is small enough to be easily understood by a human user. A DNF can be seen as a binary classifier kappa over boolean domains. The size of an explanation of a positive decision taken by a DNF kappa is bounded by the size of the terms in kappa, since we can explain a positive decision by giving a term of kappa that evaluates to true. Since both positive and negative decisions must be explained, we consider that interpretable DNFs are those kappa for which both kappa and its complement can be expressed as DNFs composed of terms of bounded size. In this paper, we investigate the family of k-DNFs whose complements can also be expressed as k-DNFs. We compare two such families, namely depth-k decision trees and nested k-DNFs, a novel family of models. Experimental evidence indicates that nested k-DNFs are an interesting alternative to decision trees in terms of interpretability and accuracy.
3012: Exploring Transferable Homogenous Groups for Compositional Zero-Shot Learning
Authors: Zhijie Rao, Jingcai Guo, Miaoge Li, Yang Chen, Mengzhu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Conditional dependency present one of the trickiest problems in Compositional Zero-Shot Learning, leading to significant property variations of the same state (object) across different objects (states). To address this problem, existing approaches often adopt either all-to-one or one-to-one representation paradigms. However, these extremes create an imbalance in the seesaw between transferability and discriminability, favoring one at the expense of the other. Comparatively, humans are adept at analogizing and reasoning in a hierarchical clustering manner, intuitively grouping categories with similar properties to form cohesive concepts. Motivated by this, we propose Homogeneous Group Representation Learning (HGRL), a new perspective formulates state (object) representation learning as multiple homogeneous sub-group representation learning. HGRL seeks to achieve a balance between semantic transferability and discriminability by adaptively discovering and aggregating categories with shared properties, learning distributed group centers that retain group-specific discriminative features. Our method integrates three core components designed to simultaneously enhance both the visual and prompt representation capabilities of the model. Extensive experiments on three benchmark datasets validate the effectiveness of our method. Code is available at https://github.com/zjrao/HGRL.
3015: Indirect Online Preference Optimization via Reinforcement Learning
Authors: En Wang, Xingyu Lin, Du Su, Chenfu Bao, Zhonghou Lv, Funing Yang, Yuanbo Xu, Wenbin Liu
Location: Guangzhou | Day: TBD
Show Abstract
Human preference alignment (HPA) aims to ensure Large Language Models (LLMs) responding appropriately to meet human moral and ethical requirements. Existing methods, such as RLHF and DPO, rely heavily on high-quality human annotation, which restrict the efficiency of iterative online model refinement.
To address the inefficiencies of human annotation acquisition, iterated online strategy advocates the use of fine-tuned LLMs to self-generate preference data. However, this approach is prone to distribution bias, because of differences between human and model annotations, as well as modeling errors between simulators and real-world contexts. To mitigate the impact of distribution bias, we adopt the principles of adversarial training, framing a zero-sum two-player game with a protagonist agent and an adversarial agent. With the adversarial agent challenging the alignment of protagonist agent, we continuously refine the protagonist’s performance. By utilizing min-max equilibrium and Nash equilibrium strategies, we propose Indirect Online Preference Optimization (IOPO) mechanism that enables the protagonist agent to converge without bias while maintaining linear computational complexity. Extensive experiments across three real-world datasets demonstrate that IOPO outperforms state-of-the-art alignment methods in both offline and online scenarios, evidenced by standard alignment metrics and human evaluations. This innovation reduces the time required for model iterations from months to one week, alleviates distribution shifts, and significantly cuts annotation costs.
3032: Learning to Extrapolate and Adjust: Two-Stage Meta-Learning for Concept Drift in Online Time Series Forecasting
Authors: Weiqi Chen, Zhaoyang Zhu, Yifan Zhang, Lefei Shen, Linxiao Yang, Qingsong Wen, Liang Sun
Location: Guangzhou | Day: TBD
Show Abstract
The inherent non-stationarity of time series in practical applications poses significant challenges for accurate forecasting. This paper tackles the concept drift problem where the underlying distribution or environment of time series changes. To better describe the characteristics and effectively model concept drifts, we first classify them into macro-drift (stable, long-term changes) and micro-drift (sudden, short-term fluctuations). Next, we propose a unified meta-learning framework called LEAF (Learning to Extrapolate and Adjust for Forecasting), where an extrapolation module is first introduced to track and extrapolate the prediction model in latent space considering macro-drift, and then an adjustment module incorporates meta-learnable surrogate loss to capture sample-specific micro-drift patterns. LEAF’s dual-stage approach effectively addresses diverse concept drifts and is model-agnostic which can be compatible with any deep prediction model. We further provide theoretical analysis to justify why the proposed framework can handle macro-drift and micro-drift. To facilitate further research in this field, we release three electric load time series datasets collected from real-world scenarios, exhibiting diverse and typical concept drifts. Extensive experiments on multiple datasets demonstrate the effectiveness of LEAF.
3033: Relational Decomposition for Program Synthesis
Authors: Céline Hocquette, Andrew Cropper
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Knowledge Representation and Reasoning (2/4)
Show Abstract
We introduce a relational approach to program synthesis. The key idea is to decompose synthesis tasks into simpler relational synthesis subtasks. Specifically, our representation decomposes a training input-output example into sets of input and output facts respectively. We then learn relations between the input and output facts. We demonstrate our approach using an off-the-shelf inductive logic programming (ILP) system on four challenging synthesis datasets. Our results show that (i) our representation can outperform a standard one, and (ii) an off-the-shelf ILP system with our representation can outperform domain-specific approaches.
3035: BRIGHT-VO: Brightness-Guided Hybrid Transformer for Visual Odometry with Multi-modality Refinement Module
Authors: Dongzhihan Wang, Yang Yang, Xuyang Chen, Liang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Visual odometry (VO) plays a crucial role in autonomous driving, robotic navigation, and other related tasks by estimating the position and orientation of a camera based on visual input. Significant progress has been made in data-driven VO methods, particularly those leveraging deep learning techniques to extract image features and estimate camera poses. However, these methods often struggle in low-light conditions because of the reduced visibility of features and the increased difficulty of matching keypoints. To address this limitation, we introduce BrightVO, a novel VO model based on Transformer architecture, which not only performs front-end visual feature extraction, but also incorporates a multi-modality refinement module in the back-end that integrates Inertial Measurement Unit (IMU) data. Using pose graph optimization, this module iteratively refines pose estimates to reduce errors and improve both accuracy and robustness. Furthermore, we create a synthetic low-light dataset, KiC4R, which includes a variety of lighting conditions to facilitate the training and evaluation of VO frameworks in challenging environments. Experimental results demonstrate that BrightVO achieves state-of-the-art performance on both the KiC4R dataset and the KITTI benchmarks. Specifically, it provides an average improvement of 20% in pose estimation accuracy in normal outdoor environments and 25% in low-light conditions, outperforming existing methods. This work is open-source at https://github.com/Anastasiawd/BrightVO.
3044: Structure-Aware Handwritten Text Recognition via Graph-Enhanced Cross-Modal Mutual Learning
Authors: Ji Gan, Yupeng Zhou, Yanming Zhang, Jiaxu Leng, Xinbo Gao
Location: Guangzhou | Day: TBD
Show Abstract
Existing handwriting recognition methods only focus on learning visual patterns by modeling low-level relationships of adjacent pixels, while overlooking the intrinsic geometric structures of characters. In this paper, we propose a novel graph-enhanced cross-modal mutual learning network GCM to fully process handwritten text images alongside their corresponding geometric graphs, which consists of one shared cross-modal encoder and two parallel inverse decoders. Specifically, the encoder simultaneously extracts visual and geometric information from the cross-modal inputs, and the decoders fuse the multi-modal features for prediction under the guidance of cross-modal fusion. Moreover, two parallel decoders sequentially aggregate cross-modal features in inverse orders (V→G and G→V) but are enhanced through mutual distillation at each time-step, which involves one-to-one knowledge transfer and fully leverages complementary cross-modal information from both directions. Notably, only one branch of GCM is activated in inference, thus avoiding the increase of the model parameters and computation costs for testing. Experiments show that our method outperforms previous state-of-the-art methods on public benchmarks such as IAM, RIMES, and ICDAR-2013 when no extra training data is utilized.
3048: Richer Semantics, Better Alignment: Aligning Visual Features with Explicit and Enriched Semantics for Visible-Infrared Person Re-Identification
Authors: Neng Dong, Shuanglin Yan, Liyan Zhang, Jinhui Tang
Location: Guangzhou | Day: TBD
Show Abstract
Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities. Existing methods learn visual features solely from images, failing to align them into the modality-invariant semantic space. In this paper, we propose a novel framework, termed Richer Semantics, Better Alignment (RSBA), to align visual features with explicit and enriched semantics. Specifically, we first develop an Explicit Semantics-Guided Feature Alignment (ESFA) module, which supplements textual descriptions for cross-modality images and aligns image-text pairs within each modality, alleviating the distribution discrepancy of visual features. We then devise a Consistent Similarity-Guided Indirect Alignment (CSIA) module, which constrains the similarity between intra-modality image-text pairs to be consistent with that between inter-modality text-text pairs, indirectly aligning visual features with cross-modality semantics. Furthermore, we design a Cross-View Semantics Compensation (CVSC) module, which integrates multi-view texts and improves the image-text matching of one-to-one in ESFA and CSIA to one-to-many, further strengthening the alignment of visual features within the semantic space. Extensive experimental results on three public datasets demonstrate the effectiveness and superiority of our proposed RSBA.
3049: Wisdom from Diversity: Bias Mitigation Through Hybrid Human-LLM Crowds
Authors: Axel Abels, Tom Lenaerts
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: AI Ethics, Trust, Fairness (3/3)
Show Abstract
Despite their performance, large language models (LLMs) can inadvertently perpetuate biases found in the data they are trained on. By analyzing LLM responses to bias-eliciting headlines, we find that these models often mirror human biases. To address this, we explore crowd-based strategies for mitigating bias through response aggregation. We first demonstrate that simply averaging responses from multiple LLMs, intended to leverage the “wisdom of the crowd", can exacerbate existing biases due to the limited diversity within LLM crowds. In contrast, we show that locally weighted aggregation methods more effectively leverage the wisdom of the LLM crowd, achieving both bias mitigation and improved accuracy. Finally, recognizing the complementary strengths of LLMs (accuracy) and humans (diversity), we demonstrate that hybrid crowds containing both significantly enhance performance and further reduce biases across ethnic and gender-related contexts.
3053: Pixel-wise Divide and Conquer for Federated Vessel Segmentation
Authors: Tian Chen, Wenke Huang, Zhihao Wang, Zekun Shi, He Li, Wenhui Dong, Mang Ye, Bo Du, Yongchao Xu
Location: Guangzhou | Day: TBD
Show Abstract
Accurate vessel segmentation is essential for diagnosing and managing vascular and ophthalmic diseases. Traditional learning-based vessel segmentation methods heavily rely on high-quality, pixel-level annotated datasets. However, segmentation performance suffers significantly when applied in federated learning settings due to vessel morphology inconsistency and vessel-background imbalance. The former limits the ability of models to capture fine-grained vessels, while the latter overemphasizes background pixels and biases the model towards them. To address these challenges, we propose a novel method named Federated Vessel-Aware Calibration (FVAC), which leverages global uncertainty to provide differentiated guidance for clients, focusing on pixels of various morphologies that are difficult to distinguish. Furthermore, we introduce a foreground-background decoupling alignment strategy that utilizes more stable and balanced global features to mitigate semantic drift caused by vessel-background imbalance in local clients. Comprehensive experiments confirm the effectiveness of our method
3057: MMGIA: Gradient Inversion Attack Against Multimodal Federated Learning via Intermodal Correlation
Authors: Lele Zheng, Yang Cao, Leo Yu Zhang, Wei Wang, Yulong Shen, Xiaochun Cao
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal federated learning (MMFL) enables collaborative model training across multiple modalities, such as images and text, without requiring direct data sharing. However, the inherent correlations between modalities introduce new privacy vulnerabilities, making MMFL more susceptible to gradient inversion attacks. In this work, we propose MMGIA, an intermodal correlation-driven gradient inversion attack that systematically exploits multimodal correlation to enhance data reconstruction quality. MMGIA consists of a two-stage optimization framework: the first stage independently reconstructs each modality using traditional gradient inversion techniques, while the second stage refines these reconstructions through pre-trained feature extractors to align modalities in a shared latent space. To further improve reconstruction accuracy, we introduce a quality-weighted fusion strategy, which dynamically integrates multimodal embeddings into a global fused representation that serves as a guiding signal for refining each modality’s reconstruction. This ensures that high-quality reconstructions contribute more to the optimization process, preventing degradation in well-reconstructed modalities while enhancing weaker ones. We conduct extensive experiments on multiple multimodal scenarios, demonstrating that MMGIA outperforms both the only existing multimodal attack and state-of-the-art single-modal attacks, revealing the heightened privacy risks in MMFL.
3061: Solving MDPs with LTLf+ and PPLTL+ Temporal Objectives
Authors: Giuseppe De Giacomo, Yong Li, Sven Schewe, Christoph Weinhuber, Pian Yu
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Planning and Scheduling (1/5)
Show Abstract
The temporal logics LTLf+ and PPLTL+ have recently been introduced to express objectives over infinite traces. These logics are appealing because they match the expressive power of LTL on infinite traces while enabling efficient DFA-based techniques, which have been crucial to the scalability of reactive synthesis and adversarial planning in LTLf and PPLTL over finite traces. In this paper, we demonstrate that these logics are also highly effective in the context of MDPs. Introducing a technique tailored for probabilistic systems, we leverage the benefits of efficient DFA-based methods and compositionality. This approach is simpler than its nonprobabilistic counterparts in reactive synthesis and adversarial planning, as it accommodates a controlled form of nondeterminism ("good for MDPs") in the automata when transitioning from finite to infinite traces. Notably, by exploiting compositionality, our solution is both implementation-friendly and well-suited for straightforward symbolic implementations.
3074: LTLf+ and PPLTL+: Extending LTLf and PPLTL to Infinite Traces
Authors: Benjamin Aminof, Giuseppe De Giacomo, Sasha Rubin, Moshe Y. Vardi
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Planning and Scheduling (1/5)
Show Abstract
We study two logics, LTLf+ and PPLTL+, to express properties of infinite traces, that are based on the linear-time temporal logics LTLf and PPLTL on finite traces. LTLf+/PPLTL+ use levels of Manna and Pnueli’s LTL safety-progress hierarchy, and thus have the same expressive power as LTL. However, they also retain a crucial characteristic of reactive synthesis for the base logics: the game arena for strategy extraction can be derived from deterministic finite automata (DFA). Consequently, these logics circumvent the notorious difficulties associated with determinizing infinite trace automata, typical of LTL synthesis. We present optimal DFA-based technique for solving reactive synthesis for LTLf+ and PPLTL+. Additionally, we adapt these algorithms to optimally solve satisfiability and model-checking for these two logics.
3078: Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs
Authors: Maris F. L. Galesloot, Roman Andriushchenko, Milan Ceska, Sebastian Junges, Nils Jansen
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Planning and Scheduling (5/5)
Show Abstract
Partially observable Markov decision processes (POMDPs) model specific environments in sequential decision-making under uncertainty. Critically, optimal policies for POMDPs may not be robust against perturbations in the environment. Hidden-model POMDPs (HM-POMDPs) capture sets of different environment models, that is, POMDPs with a shared action and observation space. The intuition is that the true model is hidden among a set of potential models, and it is unknown which model will be the environment at execution time. A policy is robust for a given HM-POMDP if it achieves sufficient performance for each of its POMDPs. We compute such robust policies by combining two orthogonal techniques: (1) a deductive formal verification technique that supports tractable robust policy evaluation by computing a worst-case POMDP within the HM-POMDP, and (2) subgradient ascent to optimize the candidate policy for a worst-case POMDP. The empirical evaluation shows that, compared to various baselines, our approach (1) produces policies that are more robust and generalize better to unseen POMDPs, and (2) scales to HM-POMDPs that consist of over a hundred thousand environments.
3079: Secure and Efficient Watermarking for Latent Diffusion Models in Model Distribution Scenarios
Authors: Liangqi Lei, Keke Gai, Jing Yu, Liehuang Zhu, Qi Wu
Location: Guangzhou | Day: TBD
Show Abstract
Latent diffusion models have exhibited considerable potential in generative tasks. Watermarking is considered to be an alternative to safeguard the copyright of generative models and prevent their misuse. However, in the context of model distribution scenarios, the accessibility of models to large scale of model users brings new challenges to the security, efficiency and robustness of existing watermark solutions. To address these issues, we propose a secure and efficient watermarking solution. A new security mechanism is designed to prevent watermark leakage and watermark escape, which considers watermark randomness and watermark-model association as two constraints for mandatory watermark injection. To reduce the time cost of training the security module, watermark injection and the security mechanism are decoupled, ensuring that fine-tuning VAE only accomplishes the security mechanism without the burden of learning watermark patterns. A watermark distribution-based verification strategy is proposed to enhance the robustness against diverse attacks in the model distribution scenarios. Experimental results prove that our watermarking consistently outperforms existing six baselines on effectiveness and robustness against ten image processing attacks and adversarial attacks, while enhancing security in the distribution scenarios. The code is available at https://anonymous.4open.science/r/DistriMark-F11F/.
3096: Responsibility Anticipation and Attribution in LTLf
Authors: Giuseppe De Giacomo, Emiliano Lorini, Timothy Parker, Gianmarco Parretti
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Agent-based and Multi-agent Systems (2/3)
Show Abstract
Responsibility is one of the key notions in machine ethics and in the area of autonomous systems. It is a multi-faceted notion involving counterfactual reasoning about actions and strategies. In this paper, we study different variants of responsibility for LTLf outcomes based on strategic reasoning. We show a connection with notions in reactive synthesis, including the synthesis of winning, dominant, and best-effort strategies. This connection provides a strong computational grounding of responsibility, allowing us to characterize the worst-case computa- tional complexity and devise sound, complete, and optimal algorithms for anticipating and attributing responsibility.
3108: MsRAG: Knowledge Augumented Image Captioning with Object-level Multi-source RAG
Authors: Yuming Qiao, Yuechen Wang, Dan Meng, Haonan Lu, Zhenyu Yang, Xudong Zhang
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Machine Learning (2/4)
Show Abstract
Language-Visual Large Models (LVLMs) have made significant strides in enhancing visual understanding capabilities. However, these models often struggle with knowledge-based visual tasks due to constrains in their pre-training data scope and timeliness. Existing Retrieval-Augmented Generation (RAG) methods can effectively solve the problem but primarily rely on user queries, limiting their applicability in scenarios without explicit language input. To overcome these challenges, we introduce MsRAG, a knowledge-augmented captioning framework designed to effectively retrieve and utilize external real-world knowledge, particularly in the absence of user queries, and perform dense captioning for subjects. MsRAG comprises three key components: (1) Parallel Visual Search Module. It retrieves fine-grained object-level knowledge using both online visual search engines and offline domain-knowledge databases, enhancing the robustness and richness of retrieved information. (2) Prompt Templates Pool. The prompt pool dynamically assigns appropriate prompts based on retrieved information, optimizing LVLMs’ ability to leverage relevant data under complex RAG conditions. (3) Visual-RAG Alignment Module, which employs a novel visual prompting method to bridge the modality gap between textual RAG content and corresponding visual objects, enabling precise alignment of visual elements with their text-format RAG content. To validate the effectiveness of MsRAG, we conducted a series of qualitative and quantitative experiments. The evaluation results demonstrate the superiority of MsRAG over other methods.
3118: MiniMal: Hard-Label Adversarial Attack Against Static Malware Detection with Minimal Perturbation
Authors: Chengyi Li, Zhiyuan Jiang, Yongjun Wang, Tian Xia, Yayuan Zhang, Yuhang Mao
Location: Guangzhou | Day: TBD
Show Abstract
Static malware detectors based on machine learning are integral to contemporary antivirus systems, but they are vulnerable to adversarial attacks. While existing research has demonstrated success with adversarial attacks in black-box hard-label scenarios, challenges such as high perturbation rates and incomplete retention of functional integrity remain. To address these issues, we propose a novel black-box hard-label attack method, MiniMal. MiniMal begins with initialized adversarial examples and utilizes binary search and particle swarm optimization algorithms to streamline the perturbation content, significantly reducing the perturbation rate of the adversarial examples. Furthermore, we propose a functionality verification method grounded in file format parsing and control flow graph comparisons to ensure the functional integrity of the adversarial examples. Experimental results indicate that MiniMal achieves an attack success rate of over 98% against three leading machine learning detectors, improving performance by approximately 4.8% to 7.1% compared to state-of-the-art methods. MiniMal reduces perturbation rates to below 40%, making them 9 to 11 times lower than those of previous methods. Additionally, functional verification via Cuckoo Sandbox revealed that the adversarial examples generated by MiniMal retained 100% functional integrity, even with various modifications applied.
3126: CrossVTON: Mimicking the Logic Reasoning on Cross-Category Virtual Try-On Guided by Tri-Zone Priors
Authors: Donghao Luo, Yujie Liang, Xu Peng, Xiaobin Hu, Boyuan Jiang, Chengming Xu, Taisong Jin, Chengjie Wang, Yanwei Fu
Location: Guangzhou | Day: TBD
Show Abstract
Despite remarkable progress in image-based virtual try-on systems, generating realistic and robust fitting images for cross-category virtual try-on remains a challenging task. The primary difficulty arises from the absence of human-like reasoning, which involves addressing size mismatches between garments and models while recognizing and leveraging the distinct functionalities of various regions within the model images. To address this issue, we draw inspiration from human cognitive processes and disentangle the complex reasoning required for cross-category try-on into a structured framework. This framework systematically decomposes the model image into three distinct regions: try-on, reconstruction, and imagination zones. Each zone plays a specific role in accommodating the garment and facilitating realistic synthesis. To endow the model with robust reasoning capabilities for cross-category scenarios, we propose an iterative data constructor. This constructor encompasses diverse scenarios, including intra-category try-on, any-to-dress transformations (replacing any garment category with a dress), and dress-to-any transformations (replacing a dress with another garment category). Utilizing the generated dataset, we introduce a tri-zone priors generator that intelligently predicts the try-on, reconstruction, and imagination zones by analyzing how the input garment is expected to align with the model image. Guided by these tri-zone priors, our proposed method, CrossVTON, achieves state-of-the-art performance, surpassing existing baselines in both qualitative and quantitative evaluations. Notably, it demonstrates superior capability in handling cross-category virtual try-on, meeting the complex demands of real-world applications.
3128: Beyond the Known: Decision Making with Counterfactual Reasoning Decision Transformer
Authors: Minh Hoang Nguyen, Linh Le Pham Van, Thommen George Karimpanal, Sunil Gupta, Hung Le
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Machine Learning (1/4)
Show Abstract
Decision Transformers (DT) play a crucial role in modern reinforcement learning, leveraging offline datasets to achieve impressive results across various domains. However, DT requires high-quality, comprehensive data to perform optimally. In real-world applications, the lack of training data and the scarcity of optimal behaviours make training on offline datasets challenging, as suboptimal data can hinder performance. To address this, we propose the Counterfactual Reasoning Decision Transformer (CRDT), a novel framework inspired by counterfactual reasoning. CRDT enhances DT’s ability to reason beyond known data by generating and utilizing counterfactual experiences, enabling improved decision-making in unseen scenarios. Experiments across Atari and D4RL benchmarks, including scenarios with limited data and altered dynamics, demonstrate that CRDT outperforms conventional DT approaches. Additionally, reasoning counterfactually allows the DT agent to obtain stitching abilities, combining suboptimal trajectories, without architectural modifications. These results highlight the potential of counterfactual reasoning to enhance reinforcement learning agents’ performance and generalization capabilities.
3130: Continuous Diffusive Prediction Network for Multi-Station Weather Prediction
Authors: Chujie Xu, Yuqing Ma, Haoyuan Deng, Yajun Gao, Yudie Wang, Kai Lv, Xianglong Liu
Location: Guangzhou | Day: TBD
Show Abstract
Multi-station weather prediction provides weather forecasts for specific geographical locations, playing an important role in various aspects of daily life. Existing methods consider the relationships between individual stations discretely, making it difficult to model the continuous spatiotemporal processes of atmospheric motion, which results in suboptimal prediction outcomes. This paper proposes the Continuous Diffusive Prediction Network (CDPNet) to model the real-world continuous weather change process from discrete station observation data. CDPNet consists of two core modules: the Continuous Calibrated Initialization (CCI) and the Diffusive Difference Estimation (DDE). The CCI module interpolates data between observation stations to construct a spatially continuous physical field and ensures temporal continuity by integrating directional information from a global perspective. It accurately represents the current physical state and provides a foundation for future weather prediction. Moreover, the DDE module explicitly captures the spatial diffusion process and estimates the diffusive differences between consecutive time steps, effectively modeling spatio-temporally continuous atmospheric motion. Likewise, directional information on weather changes is introduced from the entire historical series to mitigate estimation uncertainty and improve the performance of weather prediction. Extensive experiments on the Weather2K and Global Wind/Temp datasets demonstrate that CDPNet outperforms state-of-the-art models.
3133: DenseSAM: Semantic Enhance SAM for Efficient Dense Object Segmentation
Authors: Linyun Zhou, Jiacong Hu, Shengxuming Zhang, Xiangtong Du, Mingli Song, Xiuming Zhang, Zunlei Feng
Location: Guangzhou | Day: TBD
Show Abstract
Dense object segmentation is essential for various applications, particularly in pathology image and remote sensing image analysis. However, distinguishing numerous similar and densely packed objects in this task presents significant challenges. Several methods, including CNN- and ViT-based approaches, have been proposed to tackle these issues. Yet, models trained on limited datasets exhibit limited generalization ability. The Segment Anything Model (SAM) has recently achieved significant progress in zero-shot segmentation but relies heavily on precise positional guidance. However, providing numerous accurate location prompts in dense scenarios is time-consuming. To overcome this limitation, we conducted an in-depth exploration of the SAM mechanism and found that its strong generalization ability stems from the encoder’s edge detection capability, which is semantically independent, making location prompts essential for segmentation. This insight inspired the development of DenseSAM, which replaces location prompts with semantic guidance for automatic segmentation in dense scenarios. Specifically, it uses local details to weaken the edges of background objects, leverages global context to enhance intra-class feature similarity, while further increasing contrast with the background, and integrates a dual-head decoding process to enable lightweight automatic semantic segmentation. Extensive experiments on pathology images demonstrate that DenseSAM delivers remarkable performance with minimal training parameters, providing a cost-effective and efficient solution. Moreover, experiments on remote sensing images further validate its excellent scalability, making DenseSAM suitable for various dense object segmentation domains. The code is available at https://github.com/imAzhou/DenseSAM.
3135: FedSaaS: Class-Consistency Federated Semantic Segmentation via Global Prototype Supervision and Local Adversarial Harmonization
Authors: Xiaoyang Yu, Xiaoming Wu, Xin Wang, Dongrun Li, Ming Yang, Peng Cheng
Location: Guangzhou | Day: TBD
Show Abstract
Federated semantic segmentation enables pixel-level classification in images through collaborative learning while maintaining data privacy. However, existing research commonly overlooks the fine-grained class relationships within the semantic space when addressing heterogeneous problems, particularly domain shift. This oversight results in ambiguities between class representation. To overcome this challenge, we propose a novel federated segmentation framework that strikes class consistency, termed FedSaaS. Specifically, we introduce class exemplars as a criterion for both local- and global-level class representations. On the server side, the uploaded class exemplars are leveraged to model class prototypes, which supervise global branch of clients, ensuring alignment with global-level representation. On the client side, we incorporate an adversarial mechanism to harmonize contributions of global and local branches, leading to consistent output. Moreover, multilevel contrastive losses are employed on both sides to enforce consistency between two-level representations in the same semantic space. Extensive experiments on five driving scene segmentation datasets demonstrate that our framework outperforms state-of-the-art methods, significantly improving average segmentation accuracy and effectively addressing the class-consistency representation problem.
3136: A Theoretical Perspective on Why Stochastic Population Update Needs an Archive in Evolutionary Multi-objective Optimization
Authors: Shengjie Ren, Zimin Liang, Miqing Li, Chao Qian
Location: Guangzhou | Day: TBD
Show Abstract
Evolutionary algorithms (EAs) have been widely applied to multi-objective optimization due to their population-based nature. Population update, a key component in multi-objective EAs (MOEAs), is usually performed in a greedy, deterministic manner. However, recent studies have questioned this practice and shown that stochastic population update (SPU), which allows inferior solutions have a chance to be preserved, can help MOEAs jump out of local optima more easily. Nevertheless, SPU risks losing high-quality solutions, potentially requiring a large population. Intuitively, a possible solution to this issue is to introduce an archive that stores the best solutions ever found. In this paper, we theoretically show that using an archive allows a small population and may enhance the search performance of SPU-based MOEAs. We examine two classic algorithms, SMS-EMOA and NSGA-II, on the bi-objective problem OneJumpZeroJump, and prove that using an archive can reduce the expected running time upper bound (even exponentially). The comparison between SMS-EMOA and NSGA-II also suggests that the (μ+μ) update mode may be more suitable for SPU than the (μ+1) update mode. We also validate our findings empirically. We hope this work may provide theoretical support to explore different ideas of designing algorithms in evolutionary multi-objective optimization.
3137: Dual-level Fuzzy Learning with Patch Guidance for Image Ordinal Regression
Authors: Chunlai Dong, Haochao Ying, Qibo Qiu, Jinhong Wang, Danny Chen, Jian Wu
Location: Guangzhou | Day: TBD
Show Abstract
Ordinal regression bridges regression and classification by assigning objects to ordered classes. While human experts rely on discriminative patch-level features for decisions, current approaches are limited by the availability of only image-level ordinal labels, overlooking fine-grained patch-level characteristics. In this paper, we propose a Dual-level Fuzzy Learning with Patch Guidance framework, named DFPG that learns precise feature-based grading boundaries from ambiguous ordinal labels, with patch-level supervision. Specifically, we propose patch-labeling and filtering strategies to enable the model to focus on patch-level features exclusively with only image-level ordinal labels available. We further design a dual-level fuzzy learning module, which leverages fuzzy logic to quantitatively capture and handle label ambiguity from both patch-wise and channel-wise perspectives. Extensive experiments on various image ordinal regression datasets demonstrate the superiority of our proposed method, further confirming its ability in distinguishing samples from difficult-to-classify categories. The code is available at https://github.com/ZJUMAI/DFPG-ord.
3142: GraphAD: Interaction Scene Graph for End-to-end Autonomous Driving
Authors: Yunpeng Zhang, Deheng Qian, Ding Li, Yifeng Pan, Yong Chen, Zhenbao Liang, Zhiyao Zhang, Yingzong Liu, Jianhui Mei, Maolei Fu, Yun Ye, Zhujin Liang, Yi Shan, Dalong Du
Location: Guangzhou | Day: TBD
Show Abstract
Modeling complicated interactions among the ego-vehicle, road agents, and map elements has been a crucial part for safety-critical autonomous driving. Previous work on end-to-end autonomous driving relies on the attention mechanism to handle heterogeneous interactions, which fails to capture geometric priors and is also computationally intensive. In this paper, we propose the Interaction Scene Graph (ISG) as a unified method to model the interactions among the ego-vehicle, road agents, and map elements. With the representation of the ISG, the driving agents aggregate essential information from the most influential elements, including the road agents with potential collisions and the map elements to follow. Since a mass of unnecessary interactions are omitted, the more efficient scene-graph-based framework is able to focus on indispensable connections and leads to better performance. We evaluate the proposed method for end-to-end autonomous driving on the nuScenes dataset. Compared with strong baselines, our method significantly outperforms in full-stack driving tasks.
3148: Dyn-D^2P: Dynamic Differentially Private Decentralized Learning with Provable Utility Guarantee
Authors: Zehan Zhu, Yan Huang, Xin Wang, Shouling Ji, Jinming Xu
Location: Guangzhou | Day: TBD
Show Abstract
Most existing decentralized learning methods with differential privacy (DP) guarantee rely on constant gradient clipping bounds and fixed-level DP Gaussian noises for each node throughout the training process, leading to a significant accuracy degradation compared to non-private counterparts. In this paper, we propose a new Dynamic Differentially Private Decentralized learning approach (termed Dyn-D^2P) tailored for general time-varying directed networks. Leveraging the Gaussian DP (GDP) framework for privacy accounting, Dyn-D^2P dynamically adjusts gradient clipping bounds and noise levels based on gradient convergence. This proposed dynamic noise strategy enables us to enhance model accuracy while preserving the total privacy budget. Extensive experiments on benchmark datasets demonstrate the superiority of Dyn-D^2P over its counterparts employing fixed-level noises, especially under strong privacy guarantees. Furthermore, we provide a provable utility bound for Dyn-D^2P that establishes an explicit dependency on network-related parameters, with a scaling factor of 1/sqrt{n} in terms of the number of nodes n up to a bias error term induced by gradient clipping. To our knowledge, this is the first model utility analysis for differentially private decentralized non-convex optimization with dynamic gradient clipping bounds and noise levels.
3150: Cost-Effective On-Device Sequential Recommendation with Spiking Neural Networks
Authors: Di Yu, Changze Lv, Xin Du, Linshan Jiang, Qing Yin, Wentao Tong, Xiaoqing Zheng, Shuiguang Deng
Location: Guangzhou | Day: TBD
Show Abstract
On-device sequential recommendation (SR) systems are designed to make local inferences using real-time features, thereby alleviating the communication burden on server-based recommenders when handling concurrent requests from millions of users.
However, the resource constraints of edge devices, including limited memory and computational capacity, pose significant challenges to deploying efficient SR models.
Inspired by the energy-efficient and sparse computing properties of deep Spiking Neural Networks (SNNs), we propose a cost-effective on-device SR model named SSR, which encodes dense embedding representations into sparse spike-wise representations and integrates novel spiking filter modules to extract temporal patterns and critical features from item sequences, optimizing computational and memory efficiency without sacrificing recommendation accuracy.
Extensive experiments on real-world datasets demonstrate the superiority of SSR. Compared to other SR baselines, SSR achieves comparable recommendation performance while reducing energy consumption by an average of 59.43%. In addition, SSR significantly lowers memory usage, making it particularly well-suited for deployment on resource-constrained edge devices.
3152: Fine-grained Prompt Screening: Defending Against Backdoor Attack on Text-to-Image Diffusion Models
Authors: Yiran Xu, Nan Zhong, Guobiao Li, Anda Cheng, Yinggui Wang, Zhenxing Qian, Xinpeng Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Text-to-image (T2I) diffusion models exhibit impressive generation capabilities in recently studies. However, they are vulnerable to backdoor attacks, where model outputs are manipulated by malicious triggers. In this paper, we propose a novel input-level defense method, called Fine-grained Prompt Screening (GrainPS). Our method is motivated by the phenomenon, i.e., Semantics Misalignment, where the backdoor trigger causes the inconsistency between the cross-attention projections of object words (the key words to determine the main content of the generated image) and their true semantics. In particular, we divide each prompt into pieces and conduct fine-grained analysis by examining the impact of the trigger on object words in the cross-attention layers rather than their global influence on the entire generated image. To assess the impact of each word on object words, we formulate "semantics alignment score” as the metric with a carefully crafted detection strategy to identify the trigger. Therefore, our implementation can detect backdoor input prompts and localize of triggers simultaneously. Evaluations across four advanced backdoor attack scenarios demonstrate the effectiveness of our proposed defense method.
3154: TEST-V: TEst-time Support-set Tuning for Zero-shot Video Classification
Authors: Rui Yan, Jin Wang, Hongyu Qu, Xiaoyu Du, Dong Zhang, Jinhui Tang, Tieniu Tan
Location: Guangzhou | Day: TBD
Show Abstract
Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other’s strengths and propose a novel framework, namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts inquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. TEST-V achieves state-of-the-art results across four benchmarks and shows good interpretability.
3164: SocialMP: Learning Social Aware Motion Patterns via Additive Fusion for Pedestrian Trajectory Prediction
Authors: Tianci Gao, Yuzhen Zhang, Hang Guo, Pei Lv
Location: Guangzhou | Day: TBD
Show Abstract
Accurately capturing social interaction in complex scenarios is essential for pedestrian trajectory prediction task. The uncertainty in pedestrian interactions and the physical constraints imposed by the environment make this task challenging. To solve this problem, existing methods adopt dimensionality reduction algorithms to capture explainable human motions and behaviors. However, these approaches not only suffer from weak social awareness due to the inadequate feature extraction, but also overlook physical constraints, leading to predicted trajectories often cross unwalkable areas. To overcome these problems, we build an attention-based motion pattern representation, named SocialMP, which can effectively enhance the social awareness and environmental perception of motion patterns. Specifically, our method first characterizes the motion patterns through singular value decomposition and defines a visual field-based rule to model environmental social interaction. Then, an attention-based additive fusion mechanism is designed to enhance social awareness and environment perception of motion patterns. Therein, we integrate social interactions into motion patterns through cross-attention mechanism to generate latent motion patterns, and feed them into our devised additive fusion structure with backward connection for multiple iterations. Lastly, we design a map loss function by applying an additional penalty into average displacement error to prevent the pedestrians from passing through the unwalkable area. Extensive experiments on ETH-UCY and SDD datasets demonstrate that our SocialMP can not only improve prediction accuracy but also generate plausible trajectories.
3167: ECC-SNN: Cost-Effective Edge-Cloud Collaboration for Spiking Neural Networks
Authors: Di Yu, Changze Lv, Xin Du, Linshan Jiang, Wentao Tong, Zhenyu Liao, Xiaoqing Zheng, Shuiguang Deng
Location: Guangzhou | Day: TBD
Show Abstract
Most edge-cloud collaboration frameworks rely on the substantial computational and storage capabilities of cloud-based artificial neural networks (ANNs). However, this reliance results in significant communication overhead between edge devices and the cloud, as well as high computational energy consumption, especially when applied to resource-constrained edge devices. To address these challenges, we propose ECC-SNN, a novel edge-cloud collaboration framework that incorporates energy-efficient spiking neural networks (SNNs) to offload more computational workload from the cloud to the edge, thereby improving cost-effectiveness and reducing reliance on the cloud. ECC-SNN employs a joint training approach that integrates ANN and SNN models, enabling edge devices to leverage knowledge from cloud models for enhanced performance while reducing energy consumption and processing latency. Furthermore, ECC-SNN features an on-device incremental learning algorithm that enables edge models to continuously adapt to dynamic environments, reducing the communication overhead and resource consumption associated with frequent cloud update requests. Extensive experimental results on four datasets demonstrate that ECC-SNN improves accuracy by 4.15%, reduces average energy consumption by 79.4%, and lowers average processing latency by 39.1%.
3176: A Priori Estimation of the Approximation, Optimization and Generalization Errors of Random Neural Networks for Solving Partial Differential Equations
Authors: Xianliang Xu, Ye Li, Zhongyi Huang
Location: Guangzhou | Day: TBD
Show Abstract
In recent years, neural networks have achieved remarkable progress in various fields and have also drawn much attention in applying them on scientific problems. A line of methods involving neural networks for solving partial differential equations (PDEs), such as Physics-Informed Neural Networks (PINNs) and the Deep Ritz Method (DRM), has emerged. Although these methods outperform classical numerical methods in certain cases, the optimization problems involving neural networks are typically non-convex and non-smooth, which can result in unsatisfactory solutions for PDEs. In contrast to deterministic neural networks, the hidden weights of random neural networks are sampled from some prior distribution and only the output weights participate in training. This makes training much simpler, but it remains unclear how to select the prior distribution. In this paper, we focus on Barron type functions and approximate them under Sobolev norms by random neural networks with clear prior distribution. In addition to the approximation error, we also derive bounds for the optimization and generalization errors of random neural networks for solving PDEs when the solutions are Barron type functions.
3180: LRGR: Self-Supervised Incomplete Multi-View Clustering via Local Refinement and Global Realignment
Authors: Yanwanyu Xi, Xiao Zheng, Chang Tang, Xingchen Hu, Yuanyuan Liu, Jun-Jie Huang, Xinwang Liu
Location: Guangzhou | Day: TBD
Show Abstract
Incomplete Multi-View Clustering (IMVC) aims to explore comprehensive representations from multiple views with missing samples.
Recent studies have revealed that IMVC methods benefit from Graph Convolutional Network (GCN) in achieving robust feature imputation and effective representation learning. Despite these notable improvements, GCN imputation methods often cause a distribution shift between the imputed and original representations, particularly when the neighbors of the imputed nodes are assigned to different groups. Moreover, GCN learning methods tend to produce homogeneous imputed representations, which blur cluster boundaries and hinder effective discriminative clustering.
To remedy these challenges, the Local Refinement and Global Realignment (LRGR) Self-supervised model is proposed for incomplete multi-view clustering, which includes two stages.
In the first stage, a local imputed refinement module is designed to enhance the versatility of imputed representations through cross-view contrastive learning guided by view-specific prototypes.
In the second stage, a global realignment module is introduced to achieve semantic consistency across views, alleviating distribution shifts by leveraging pseudo-labels and their corresponding confidence scores as guidance.
Experiments on five widely used multi-view datasets demonstrate the competitiveness and superiority of our method compared to state-of-the-art approaches.
3185: Single-Node Trigger Backdoor Attacks in Graph-Based Recommendation Systems
Authors: Runze Li, Di Jin, Xiaobao Wang, Dongxiao He, Bingdao Feng, Zhen Wang
Location: Guangzhou | Day: TBD
Show Abstract
Graph recommendation systems have been widely studied due to their ability to effectively capture the complex interactions between users and items. However, these systems also exhibit certain vulnerabilities when faced with attacks. The prevailing shilling attack methods typically manipulate recommendation results by injecting a large number of fake nodes and edges. However, such attack strategies face two primary challenges: low stealth and high destructiveness. To address these challenges, this paper proposes a novel graph backdoor attack method that aims to enhance the exposure of target items to the target user in a covert manner, without affecting other unrelated nodes. Specifically, we design a single-node trigger generator, which can effectively expose multiple target items to the target user by inserting only one fake user node. Additionally, we introduce constraint conditions between the target nodes and irrelevant nodes to mitigate the impact of fake nodes on the recommendation system’s performance. Experimental results show that the exposure of the target items reaches no less than 50% in 99% of the target users, while the impact on the recommendation system’s performance is controlled within approximately 5%.
3197: CorrDetail: Visual Detail Enhanced Self-Correction for Face Forgery Detection
Authors: Binjia Zhou, Hengrui Lou, Lizhe Chen, Haoyuan Li, Dawei Luo, Shuai Chen, Jie Lei, Zunlei Feng, Yijun Bei
Location: Guangzhou | Day: TBD
Show Abstract
With the swift progression of image generation technology, the widespread emergence of facial deepfakes poses significant challenges to the field of security, thus amplifying the urgent need for effective deepfake detection. Existing techniques for face forgery detection can broadly be categorized into two primary groups: visual-based methods and multimodal approaches. The former often lacks clear explanations for forgery details, while the latter, which merges visual and linguistic modalities, is more prone to the issue of hallucinations.To address these shortcomings, we introduce a visual detail enhanced self-correction framework, designated CorrDetail, for interpretable face forgery detection. CorrDetail is meticulously designed to rectify authentic forgery details when provided with error-guided questioning, with the aim of fostering the ability to uncover forgery details rather than yielding hallucinated responses. Additionally, to bolster the reliability of its findings, a visual fine-grained detail enhancement module is incorporated, supplying CorrDetail with more precise visual forgery details. Ultimately, a fusion decision strategy is devised to further augment the model’s discriminative capacity in handling extreme samples, through the integration of visual information compensation and model bias reduction. Experimental results demonstrate that CorrDetail not only achieves state-of-the-art performance compared to the latest methodologies but also excels in accurately identifying forged details, all while exhibiting robust generalization capabilities.
3203: Antibody Design and Optimization with Multi-scale Equivariant Graph Diffusion Models for Accurate Complex Antigen Binding
Authors: Jiameng Chen, Xiantao Cai, Jia Wu, Wenbin Hu
Location: Guangzhou | Day: TBD
Show Abstract
Antibody design remains a critical challenge in therapeutic and diagnostic development, particularly for complex antigens with diverse binding interfaces. Current computational methods face two main limitations: (1) capturing geometric features while preserving symmetries, and (2) generalizing novel antigen interfaces. Despite recent advancements, these methods often fail to accurately capture molecular interactions and maintain structural integrity. To address these challenges, we propose AbMEGD, an end-to-end framework integrating Multi-scale Equivariant Graph Diffusion for antibody sequence and structure co-design. Leveraging advanced geometric deep learning, AbMEGD combines atomic-level geometric features with residue-level embeddings, capturing local atomic details and global sequence-structure interactions. Its E(3)-equivariant diffusion method ensures geometric precision, computational efficiency, and robust generalizability for complex antigens. Furthermore, experiments using the SAbDab database demonstrate a 10.13% increase in amino acid recovery, 3.32% rise in improvement percentage, and a 0.062 Å reduction in root mean square deviation within the critical CDR-H3 region compared to DiffAb, a leading antibody design model. These results highlight AbMEGD’s ability to balance structural integrity with improved functionality, establishing a new benchmark for sequence-structure co-design and affinity optimization. The code is available at: https://github.com/Patrick221215/AbMEGD.
3214: MPPQ: Enhancing Post-Training Quantization for LLMs via Mixed Supervision, Proxy Rounding, and Pre-Searching
Authors: Mingrun Wei, Yeyu Yan, Dong Wang
Location: Guangzhou | Day: TBD
Show Abstract
Recently, post-training quantization (PTQ) methods for large language models (LLMs) primarily focus on tackling the challenges caused by outliers. Scaling transformation has proven to be effective while how to enhance the performance of extremely low-bitwidth (e.g., 2-bit) PTQ under it remains largely unexplored. In this work, a new PTQ framework, namely MPPQ, is established. Specifically, MPPQ first proposes an enhanced reconstruction loss based on Mixed metric supervision to mitigate the distribution inconsistency caused by quantization while providing strong regularization for learnable parameters.
Secondly, we introduce a Proxy-based adaptive rounding scheme in weight quantization, which replaces the round-to-nearest (RTN) function to minimize the overall quantization errors through element-wise scaling. Furthermore, a factor coarse Pre-searching mechanism is presented to ensure proper coordination between quantization and clipping patterns, while achieving optimal initialization of clipping factors before training.
Extensive experiments show that MPPQ consistently outperforms state-of-the-art methods in low-bit quantization settings. For instance, the perplexity of WikiText2 can be dramatically reduced to 8.85 (3.9 ↓ vs 12.75 of the latest method, LRQuant) for the LLaMA-2-7B model, which is quantized with W4A4.
3217: Exploring the Over-smoothing Problem of Graph Neural Networks for Graph Classification: An Entropy-based Viewpoint
Authors: Feifei Qian, Lu Bai, Lixin Cui, Ming Li, Hangyuan Du, Yue Wang, Edwin Hancock
Location: Guangzhou | Day: TBD
Show Abstract
The over-smoothing has emerged as a major challenge in the development of Graph Neural Networks (GNNs). While existing state-of-the-art methods effectively mitigate the diminishing distance between nodes and improve the performance of node classification, they tend to be elusive for graph-level tasks. This paper introduces a novel entropy-based perspective to explore the over-smoothing problem, simultaneously enhancing the distinguishability of non-isomorphic graphs. We provide a theoretical analysis of the relationship between the smoothness and the entropy for graphs, highlighting how the over-smoothing in high-entropic regions negatively impact the graph classification performance. To tackle this issue, we propose a simple yet effective method to Sample and Discretize node features in high-Entropic regions (SDE), aiming to preserve the critical and complicated structural information. Moreover, we introduce a new evaluation metric to assess the over-smoothing for graph-level tasks, focusing on node distributions. Experimental results demonstrate that the proposed SDE method significantly outperforms existing state-of-the-art methods, establishing a new benchmark in the field of GNNs.
3233: Unlocking the Potential of Lightweight Quantized Models for Deepfake Detection
Authors: Renshuai Tao, Ziheng Qin, Yifu Ding, Chuangchuang Tan, Jiakai Wang, Wei Wang
Location: Guangzhou | Day: TBD
Show Abstract
Deepfake detection is increasingly crucial due to the rapid rise of AI-generated content. Existing methods achieve high performance relying on computationally intensive large models, making real-time detection on resource-constrained edge devices challenging. Given that deepfake detection is a binary classification task, there is potential for model compression and acceleration. In this paper, we propose a low-bit quantization framework for lightweight and efficient deepfake detection. The Connected Quantized Block extracts common forgery features via the quantized path and retains method-specific textures through the shortcut connections. Additionally, the Shifted Logarithmic Redistribution Quantizer mitigates information loss in near-zero domains by unfolding the unbalanced activations, enabling finer quantization granularity. Comprehensive experiments demonstrate that this new framework significantly reduces 10.8x computational costs and 12.4x storage requirements while maintaining high detection performance, even surpassing SOTA methods using less than 5% FLOPs, paving the way for efficient deepfake detection in resource-limited scenarios.
3248: MedualTime: A Dual-Adapter Language Model for Medical Time Series-Text Multimodal Learning
Authors: Jiexia Ye, Weiqi Zhang, Ziyue Li, Jia Li, Meng Zhao, Fugee Tsung
Location: Guangzhou | Day: TBD
Show Abstract
The recent rapid advancements in language models (LMs) have garnered attention in medical time series-text multimodal learning.
However, existing contrastive learning-based and prompt-based LM approaches tend to be biased, often assigning a primary role to time series modality while treating text modality as secondary. We classify these approaches under a temporal-primary paradigm, which may overlook the unique and critical task-relevant information embedded in text modality like clinical reports, thus failing to fully leverage mutual benefits and complementarity of different modalities.
To fill this gap, we propose a novel textual-temporal multimodal learning paradigm that enables either modality to serve as the primary while being enhanced by the other, thereby effectively capturing modality-specific information and fostering cross-modal interaction. In specific, we design MedualTime, a language model composed of dual adapters to implement temporal-primary and textual-primary modeling simultaneously. Within each adapter, lightweight adaptation tokens are injected into the top layers of LM to encourage high-level modality fusion. The shared LM pipeline by dual adapters not only achieves adapter alignment but also enables efficient fine-tuning, reducing computational resources. Empirically, MedualTime demonstrates superior performance on medical data, achieving notable improvements of 8% accuracy and 12% F1 in supervised settings.
Furthermore, MedualTime’s transferability is validated by
few-shot transfer experiments from coarse-grained to fine-grained medical data.
3251: IterMeme: Expert-Guided Multimodal LLM for Interactive Meme Creation with Layout-Aware Generation
Authors: Yaqi Cai, Shancheng Fang, Yadong Qu, Xiaorui Wang, Meng Shao, Hongtao Xie
Location: Guangzhou | Day: TBD
Show Abstract
Meme creation is a creative process that blends images and text. However, existing methods lack critical components, failing to support intent-driven caption-layout generation and personalized generation, making it difficult to generate high-quality memes. To address this limitation, we propose IterMeme, an end-to-end interactive meme creation framework that utilizes a unified Multimodal Large Language Model (MLLM) to facilitate seamless collaboration among multiple components. To overcome the absence of a caption-layout generation component, we develop a robust layout representation method and construct a large-scale image-caption-layout dataset, MemeCap, which enhances the model’s ability to comprehend emotions and coordinate caption-layout generation effectively.
To address the lack of a personalization component, we introduce a parameter-shared dual-LLM architecture that decouples the intricate representations of reference images and text. Furthermore, we incorporate the expert-guided M³OE for fine-grained identity properties (IP) feature extraction and cross-modal fusion. By dynamically injecting features into every layer of the model, we enable adaptive refinement of both visual and semantic information.
Experimental results demonstrate that IterMeme significantly advances the field of meme creation by delivering consistently high-quality outcomes. The code, model, and dataset will be open-sourced to the community.
3263: Flow Matching Based Sequential Recommender Model
Authors: Feng Liu, Lixin Zou, Xiangyu Zhao, Min Tang, Liming Dong, Dan Luo, Xiangyang Luo, Chenliang Li
Location: Guangzhou | Day: TBD
Show Abstract
Generative models, particularly diffusion model, have emerged as powerful tools for sequential recommendation. However, accurately modeling user preferences remains challenging due to the noise perturbations inherent in the forward and reverse processes of diffusion-based methods. Towards this end, this study introduces FMRec, a Flow Matching based model that employs a straight flow trajectory and a modified loss tailored for the recommendation task. Additionally, from the diffusion-model perspective, we integrate a reconstruction loss to improve robustness against noise perturbations, thereby retaining user preferences during the forward process. In the reverse process, we employ a deterministic reverse sampler, specifically an ODE-based updating function, to eliminate unnecessary randomness, thereby ensuring that the generated recommendations closely align with user needs. Extensive evaluations on four benchmark datasets reveal that FMRec achieves an average improvement of 6.53% over state-of-the-art methods. The replication code is available at https://github.com/FengLiu-1/FMRec.
3271: LogiCase: Effective Test Case Generation from Logical Description in Competitive Programming
Authors: Sicheol Sung, Aditi, Dogyu Kim, Yo-Sub Han, Sang-Ki Ko
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: MTA: Software engineering
Show Abstract
Automated Test Case Generation (ATCG) is crucial for evaluating software reliability, particularly in competitive programming where robust algorithm assessments depend on diverse and accurate test cases. However, existing ATCG methods often fail to meet complex specifications or generate effective corner cases, limiting their utility. In this work, we introduce Context-Free Grammars with Counters (CCFGs), a formalism that captures both syntactic and semantic structures in input specifications. Using a fine-tuned CodeT5 model, we translate natural language input specifications into CCFGs, enabling the systematic generation of high-quality test cases. Experiments on the CodeContests dataset demonstrate that CCFG-based test cases outperform baseline methods in identifying incorrect algorithms, achieving significant gains in validity and effectiveness. Our approach provides a scalable and reliable grammar-driven framework for enhancing automated competitive programming evaluations.
3275: State Revisit and Re-explore: Bridging Sim-to-Real Gaps in Offline-and-Online Reinforcement Learning with An Imperfect Simulator
Authors: Xingyu Chen, Jiayi Xie, Zhijian Xu, Ruixun Liu, Shuai Yang, Zeyang Liu, Lipeng Wan, Xuguang Lan
Location: Guangzhou | Day: TBD
Show Abstract
In reinforcement learning (RL) based robot skill acquisition, a high-fidelity simulator is usually indispensable but unattainable since the real environment dynamics are difficult to model, which leads to severe sim-to-real gaps. Existing methods solve this problem by combining offline and online RL to jointly learn transferable policies from limited offline data and imperfect simulators. However, due to the unrestricted exploration in the imperfect simulator, the hybrid offline-and-online RL methods inevitably suffer from low sample efficiency and insufficient state-action space coverage during training. To solve this problem, we propose a State Revisit and Re-exploration (SR2) hybrid offline-and-online RL framework. In particular, the proposed algorithm employs a meta-policy and a sub-policy, where the meta-policy aims to find high-quality states in the offline trajectories for online exploration, and the sub-policy learns the robot skill using mixed offline and online data. By introducing the state revisit and explore mechanism, our approach efficiently improves performance on a set of sim-to-real robotic tasks. Through extensive simulation and real-world tasks, we demonstrate the superior performance of our approach against other state-of-the-art methods.
3276: A Multi-Granularity Clustering Approach for Federated Backdoor Defense with the Adam Optimizer
Authors: Jidong Yuan, Qihang Zhang, Naiyue Chen, Shengbo Chen, Baomin Xu
Location: Guangzhou | Day: TBD
Show Abstract
Federated learning is vulnerable to backdoor attacks due to its distributed nature and the inability to access local datasets. Meanwhile, the heterogeneity of distributed data further complicates the detection of such attacks. However, existing defense strategies often overlook the presence of non-stationary objectives and noisy gradients across multiple clients, making it challenging to accurately and efficiently identify malicious participants. To address these challenges, we propose a backdoor defense method for Federated Learning with Adam optimizer and multi-granularity Clustering (FLAC), incorporating both coarse-grained and fine-grained clustering mechanisms to neutralize backdoor attacks. First, the Adam optimizer accelerates the learning process by mitigating the impact of noisy gradients and addressing the non-stationary objectives posed by different clients under attack. Second, a multi-granularity clustering process is considered to differentiate between benign clients and potential attackers. This is followed by an adaptive clipping strategy to further alleviate the influence of malicious attackers. Our theoretical analysis demonstrates the consistent convergence of Adam in a federated backdoor defense environment. Extensive experimental results validate the effectiveness of our defense approach.
3279: Multi-Scale Temporal Neural Network for Stock Trend Prediction Enhanced by Temporal Hyepredge Learning
Authors: Lingyun Song, Haodong Li, Siyu Chen, Xinbiao Gan, Binze Shi, Jie Ma, Yudai Pan, Xiaoqi Wang, Xuequn Shang
Location: Guangzhou | Day: TBD
Show Abstract
Existing research in Stock Trend Prediction (STP) focuses on temporal features extracted from a temporal sequence of stock data with a look-back window, which frequently leads to the omission of important periodic patterns, such as weekly and monthly variations in stock prices. Furthermore, these methods examine stocks individually, ignoring the temporal variation patterns among stocks that share higher-order relationships, like those within the same industry. These relationships typically provide contextual insights into market investments influencing stock price fluctuations. To tackle these issues, we propose a Multi-Scale Temporal Neural Network (MSTNN) framework tailored for STP. This architecture explores the periodic fluctuation behaviors of individual stocks through an innovative 3D convolutional neural network, alongside examining temporal variation patterns of stocks linked to specific industries via a temporal hypergraph attention mechanism. Empirical results from two real-world benchmark datasets show that MSTNN significantly outperforms prior state-of-the-art STP methods. The code of our MSTNN is available at https://github.com/sunlitsong/MSTNN.
3295: Volumetric Axial Disentanglement Enabling Advancing in Medical Image Segmentation
Authors: Xingru Huang, Jian Huang, Yihao Guo, Tianyun Zhang, Zhao Huang, Yaqi Wang, Ruipu Tang, Guangliang Cheng, Shaowei Jiang, Zhiwen Zheng, Jin Liu, Renjie Ruan, Xiaoshuai Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Information retrieved from three dimensions is treated uniformly in CNN-based volumetric segmentation methods. However, such neglect of axial disparities fails to capture true spatio-temporal variations. This paper introduces the volumetric axial disentanglement to address the disparities in spatial information along different axial dimensions. Building on this concept, we propose the Post-Axial Refiner (PaR) module to refine segmentation masks by implementing axial disentanglement on the specific axis of the volumetric medical sequences. As a plug-and-play enhancement to existing volumetric segmentation architecture, PaR further utilizes specialized attention approaches to learn disentangled post-decoding features, enhancing spatial representation and structural detail. Validation on various datasets demonstrates PaR’s consistent elevation of segmentation precision and boundary clarity across 11 baselines and different imaging modalities, achieving state-of-the-art performance on multiple datasets. Experimental tests demonstrate the ability of volumetric axial disentanglement to refine the segmentation of volumetric medical images. Code is released at https://github.com/IMOP-lab/PaR-Pytorch.
3314: SSTrack: Sample-interval Scheduling for Lightweight Visual Object Tracking
Authors: Yutong Kou, Shubo Lin, Liang Li, Bing Li, Weiming Hu, Jin Gao
Location: Guangzhou | Day: TBD
Show Abstract
In recent years, CPU real-time object tracking has gained significant attention due to its broad applications such as UAV-tracking. To maintain computational efficiency, most existing CPU real-time object trackers rely on lightweight backbones and employ a single initial template image without intermediate online templates. Although the appearance variance between the template and the search is larger under this single template setting, the representation ability of lightweight backbones is weaker which poses a challenge when training lightweight object trackers. To address this issue, we propose SSTrack, a new easier-to-harder training schedule for the lightweight object tracker. From the data perspective, our method designed a success-aware sample scheduler that gradually increases difficult training samples with longer template-search time intervals and reduces the amount of the easier samples so the training cost remains unchanged. From the optimization perspective, we utilized a gradient scaling strategy that retains the original training objective of easier samples despite the reduction in their quantities. With the collective effort from both perspectives, our method achieves State-of-the-Art CPU-real-time accuracy on 5 UAV-tracking benchmarks and 5 general object tracking benchmarks. Codes and models will be available at https://github.com/Kou-99/SSTrack.
3315: UniCT Depth: Event-Image Fusion Based Monocular Depth Estimation with Convolution-Compensated ViT Dual SA Block
Authors: Luoxi Jing, Dianxi Shi, Zhe Liu, Songchang Jin, Chunping Qiu, Ziteng Qiao, Yuxian Li, Jianqiang Xia
Location: Guangzhou | Day: TBD
Show Abstract
Depth estimation plays a crucial role in 3D scene understanding and is extensively used in a wide range of vision tasks. Image-based methods struggle in challenging scenarios, while event cameras offer high dynamic range and temporal resolution but face difficulties with sparse data. Combining event and image data provides significant advantages, yet effective integration remains challenging. Existing CNN-based fusion methods struggle with occlusions and depth disparities due to limited receptive fields, while Transformer-based fusion methods often lack deep modality interaction. To address these issues, we propose UniCT Depth, an event-image fusion method that unifies CNNs and Transformers to model local and global features. We propose the Convolution-compensated ViT Dual SA (CcViT-DA) Block, designed for the encoder, which integrates Context Modeling Self-Attention (CMSA) to capture spatial dependencies and Modal Fusion Self-Attention (MFSA) for effective cross-modal fusion. Furthermore, we design the tailored Detail Compensation Convolution (DCC) Block to improve texture details and enhances edge representations. Extensive experiments show that UniCT Depth outperforms existing image, event, and fusion-based monocular depth estimation methods across key metrics.
3322: ElaD-Net: An Elastic Semantic Decoupling Network for Lesion Segmentation in Breast Ultrasound Images
Authors: Lijuan Xu, Kai Wang, Fuqiang Yu, Fenghua Tong, Mengran Li, Dawei Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Breast diseases pose a significant threat to women’s health. Automatic lesion segmentation in breast ultrasound images (BUSI) plays a crucial role in fast diagnosis. While various enhanced U-Net-based models have achieved success in multi-scale feature analysis and handling blurred boundaries, two key challenges persist that could guide the improvement of BUSI segmentation networks: 1) significant fluctuations in pixel intensity distribution similarity between the lesion and surrounding tissues, and 2) inconsistent transmission of spatial detail due to multi-scale lesion sampling. These issues highlight the necessity of semantic elasticity understanding and consistency control. To this end, we propose ElaD-Net, an Elastic Semantic Decoupling Network for lesion segmentation in BUSI. This network uses the pre-trained EfficientNet-B2 for multi-scale encoding of BUSI. The decoding stage features two key modules: Elastic Semantic Decoupling (ESD) and Spatial Semantic Reconstruction (SSR). ESD learns and decouples multi-frequency semantics in multi-scale channels with a self-calibration mechanism, enabling dynamic adjustment of receptive depth to resist similarity fluctuations. SSR further optimizes ESD outputs via feature branching, compression, and excitation to ensure spatial semantic consistency, thereby separately reconstructing edge and body.
3331: AdvGrasp: Adversarial Attacks on Robotic Grasping from a Physical Perspective
Authors: Xiaofei Wang, Mingliang Han, Tianyu Hao, Cegang Li, Yunbo Zhao, Keke Tang
Location: Montreal | Day: August 21st | Time: 10:00 | Session: AI Ethics, Trust, Fairness (2/3)
Show Abstract
Adversarial attacks on robotic grasping provide valuable insights into evaluating and improving the robustness of these systems. Unlike studies that focus solely on neural network predictions while overlooking the physical principles of grasping, this paper introduces AdvGrasp, a framework for adversarial attacks on robotic grasping from a physical perspective. Specifically, AdvGrasp targets two core aspects: lift capability, which evaluates the ability to lift objects against gravity, and grasp stability, which assesses resistance to external disturbances. By deforming the object’s shape to increase gravitational torque and reduce stability margin in the wrench space, our method systematically degrades these two key grasping metrics, generating adversarial objects that compromise grasp performance. Extensive experiments across diverse scenarios validate the effectiveness of AdvGrasp, while real-world validations demonstrate its robustness and practical applicability.
3368: Reasoning About Causal Knowledge in Nondeterministic Domains
Authors: Shakil M. Khan, Yves Lespérance, Maryam Rostamigiv
Location: Montreal | Day: August 21st | Time: 11:30 | Session: Knowledge Representation and Reasoning (3/4)
Show Abstract
Reasoning about causality and agent causal knowledge is critical for effective decision-making and planning in multi-agent contexts. Previous work in the area generally assumes that the domain is deterministic, but in fact many agents operate in nondeterministic domains where the outcome of their actions depends on unpredictable environment reactions. In this paper, we propose a situation calculus-based framework for reasoning about causal knowledge in nondeterministic domains. In such domains, the agent may not know the environment reactions to her actions and their outcomes, and may be uncertain about which actions caused a condition to come about. But she can perform sensing actions to acquire knowledge about the state and use it to gain knowledge about causes. Our formalization recognizes sensing actions as causes of both physical and epistemic effects. We also examine how regression can be used to reason about causal knowledge.
3373: Inference of Human-derived Specifications of Object Placement via Demonstration
Authors: Alex Cuellar, Ho Chit Siu, Julie A Shah
Location: Montreal | Day: August 21st | Time: 10:00 | Session: Humans and AI
Show Abstract
As robots’ manipulation capabilities improve for pick-and-place tasks (e.g., object packing, sorting, and kitting), methods focused on understanding human-acceptable object configurations remain limited expressively with regard to capturing spatial relationships important to humans. To advance robotic understanding of human rules for object arrangement, we introduce positionally-augmented RCC (PARCC), a formal logic framework based on region connection calculus (RCC) for describing the relative position of objects in space. Additionally, we introduce an inference algorithm for learning PARCC specifications via demonstrations. Finally, we present the results from a human study, which demonstrate our framework’s ability to capture a human’s intended specification and the benefits of learning from demonstration approaches over human-provided specifications.
3385: Optimal Transport on Categorical Data for Conterfactuals Using Compositional Data and Dirichlet Transport
Authors: Agathe Fernandes Machado, Ewen Gallic, Arthur Charpentier
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Machine Learning (1/4)
Show Abstract
Recently, optimal transport-based approaches have gained attention for deriving counterfactuals, e.g., to quantify algorithmic discrimination. However, in the general multivariate setting, these methods are often opaque and difficult to interpret. To address this, alternative methodologies have been proposed, using causal graphs combined with iterative quantile regressions or sequential transport to examine fairness at the individual level, often referred to as "counterfactual fairness." Despite these advancements, transporting categorical variables remains a significant challenge in practical applications with real datasets. In this paper, we propose a novel approach to address this issue. Our method involves (1) converting categorical variables into compositional data and (2) transporting these compositions within the probabilistic simplex of the Euclidean space. We demonstrate the applicability and effectiveness of this approach through an illustration on real-world data, and discuss limitations.
3388: KIPPO: Koopman-Inspired Proximal Policy Optimization
Authors: Andrei Cozma, Landon Harris, Hairong Qi
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Machine Learning (3/4)
Show Abstract
Reinforcement Learning (RL) has made significant strides in various domains, and policy gradient methods like Proximal Policy Optimization (PPO) have gained popularity due to their balance in performance, training stability, and computational efficiency. These methods directly optimize policies through gradient-based updates. However, developing effective control policies for environments with complex and non-linear dynamics remains a challenge. High variance in gradient estimates and non-convex optimization landscapes often lead to unstable learning trajectories. Koopman Operator Theory has emerged as a powerful framework for studying non-linear systems through an infinite-dimensional linear operator that acts on a higher-dimensional space of measurement functions. In contrast with their non-linear counterparts, linear systems are simpler, more predictable, and easier to analyze. In this paper, we present Koopman-Inspired Proximal Policy Optimization (KIPPO), which learns an approximately linear latent-space representation of the underlying system’s dynamics while retaining essential features for effective policy learning. This is achieved through a Koopman-approximation auxiliary network that can be added to the baseline policy optimization algorithms without altering the architecture of the core policy or value function. Extensive experimental results demonstrate consistent improvements over the PPO baseline with 6–60% increased performance while reducing variability by up to 91% when evaluated on various continuous control tasks.
3390: PanComplex: Leveraging Complex-Valued Neural Networks for Enhanced Pansharpening
Authors: Chunhui Luo, Dong Li, Xiaoliang Ma, Xin Lu, Zhiyuan Wang, Jiangtong Tan, Xueyang Fu
Location: Guangzhou | Day: TBD
Show Abstract
Pansharpening combines panchromatic and low-resolution multispectral images to generate high-resolution multispectral images. Previous studies have explored the connection between pansharpening and the frequency domain, but mostly in the real-valued domain, leaving the complex domain relatively unexplored. To redefine the pansharpening task, we propose a complex-valued spatial-frequency dual-domain framework, PanComplex. To achieve this, we first establish complex representations and introduce basic complex operators tailored to pansharpening, enabling the transformation of multispectral real-valued signals into the complex domain for learning. We then model both spatial and frequency branches to capture global frequency features and local spatial features comprehensively. Finally, we employ a complex-based interaction module to fuse the spatial and frequency features, achieving complementary information across both domains. By using the representation power of the complex domain, PanComplex effectively extracts complementary features from PAN and MS images, thereby enhancing pansharpening performance. Experiments on multiple datasets demonstrate that our method achieves optimal performance with the fewest parameters and exhibits strong generalization ability to other tasks. The source code for this work is publicly available at https://github.com/lch-ustc/PanComplex.
3399: EfficientPIE: Real-Time Prediction on Pedestrian Crossing Intention with Sole Observation
Authors: Fang Qu, Pengzhan Zhou, Yuepeng He, Kaixin Gao, Youyu Luo, Xin Feng, Yu Liu, Songtao Guo
Location: Guangzhou | Day: TBD
Show Abstract
Present Advanced Driving Assistance System (ADAS) responds to the dangerous crossing of pedestrians after the occurrence of the incident, occasionally causing severe accidents due to the stringent response window. Inference of pedestrian crossing intention may help vehicles operate in advance and enhance the safety of the vehicle by predicting the crossing probability. Recent studies usually ignore the demand of real-time forecast that required in the realistic driving scenario, and mainly focus on improving the model representation capacity on public datasets by increasing modality and observation time. Consequently, a new framework named EfficientPIE is proposed to predict the pedestrian crossing intention in real time with sole observation of the incident. To achieve reliable predictions, we propose incremental learning based on intention domain to relieve forgetting and promote performance with a progressive perturbation method. Our EfficientPIE outperforms all the SOTA models on two datasets PIE and JAAD, running nearly 7.4x faster than the previously fastest model. Our code is available at https://github.com/heinideyibadiaole/EfficientPIE.
3405: Improved MMS Approximations for Few Agent Types
Authors: Parnian Shahkar, Jugal Garg
Location: Montreal | Day: August 21st | Time: 11:30 | Session: GTEP: Fair division
Show Abstract
We study fair division of indivisible goods under the maximin share (MMS) fairness criterion in settings where agents are grouped into a small number of types, with agents within each type having identical valuations. For the special case of a single type, an exact MMS allocation is always guaranteed to exist. However, for two or more distinct agent types, exact MMS allocations do not always exist, shifting the focus to establishing the existence of approximate-MMS allocations. A series of works over the last decade has resulted in the best-known approximation guarantee of 3/4 + 3/3836.

In this paper, we improve the approximation guarantees for settings where agents are grouped into two or three types, a scenario that arises in many practical settings. Specifically, we present novel algorithms that guarantee a 4/5-MMS allocation for two agent types and a 16/21-MMS allocation for three agent types. Our approach leverages the MMS partition of the majority type and adapts it to provide improved fairness guarantees for all types.
3408: On the Discrimination and Consistency for Exemplar-Free Class Incremental Learning
Authors: Tianqi Wang, Jingcai Guo, Depeng Li, Zhi Chen
Location: Montreal | Day: August 21st | Time: 10:00 | Session: Machine Learning (4/4)
Show Abstract
Exemplar-free class incremental learning (EF-CIL) is a nontrivial task that requires continuously enriching model capability with new classes while maintaining previously learned knowledge without storing and replaying any old class exemplars. An emerging theory-guided framework for CIL trains task-specific models for a shared network, shifting the pressure of forgetting to task-id prediction. In EF-CIL, task-id prediction is more challenging due to the lack of inter-task interaction (e.g., replays of exemplars). To address this issue, we conduct a theoretical analysis of the importance and feasibility of preserving a discriminative and consistent feature space, upon which we propose a novel method termed DCNet. Concretely, it progressively maps class representations into a hyperspherical space, in which different classes are orthogonally distributed to achieve ample inter-class separation. Meanwhile, it also introduces compensatory training to adaptively adjust supervision intensity, thereby aligning the degree of intra-class aggregation. Extensive experiments and theoretical analysis verified the superiority of DCNet. Code is available at https://github.com/Tianqi-Wang1/DCNet.
3409: Situational-Constrained Sequential Resources Allocation via Reinforcement Learning
Authors: Libo Zhang, Yang Chen, Toru Takisaka, Kaiqi Zhao, Weidong Li, Jiamou Liu
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Machine Learning (1/4)
Show Abstract
Sequential Resource Allocation with situational constraints presents a significant challenge in real-world applications, where resource demands and priorities are context-dependent. This paper introduces a novel framework, SCRL, to address this problem. We formalize situational constraints as logic implications and develop a new algorithm that dynamically penalizes constraint violations. To handle situational constraints effectively, we propose a probabilistic selection mechanism to overcome limitations of traditional constraint reinforcement learning (CRL) approaches. We evaluate SCRL across two scenarios: medical resource allocation during a pandemic and pesticide distribution in agriculture. Experiments demonstrate that SCRL outperforms existing baselines in satisfying constraints while maintaining high resource efficiency, showcasing its potential for real-world, context-sensitive decision-making tasks.
3425: FedDLAD: A Federated Learning Dual-Layer Anomaly Detection Framework for Enhancing Resilience Against Backdoor Attacks
Authors: Binbin Ding, Penghui Yang, Sheng-Jun Huang
Location: Guangzhou | Day: TBD
Show Abstract
In Federated Learning (FL), the decentralized nature of client training introduces vulnerabilities, notably backdoor attacks. Prevailing anomaly detection approaches typically perform binary classification, dividing clients into trusted and untrusted groups. However, these methods face two critical challenges: the insider threat, where malicious clients concealed within the trusted group compromise the global model, and the benign exclusion, where legitimate contributions from benign clients are mistakenly classified as untrusted and disregarded. These issues weaken both the robustness and fairness of FL systems, exposing inherent defense vulnerabilities. To address these challenges,
we propose FedDLAD, a Federated Learning Dual-Layer Anomaly Detection framework designed to enhance resilience against backdoor attacks. The framework leverages the Connectivity-Based Outlier Factor (COF) module to perform a robust initial classification of clients by analyzing structural data connectivity. The Interquartile Range (IQR) module further reinforces this by mitigating the insider threat through the removal of residual malicious influences within the trusted group. Furthermore, the Pardon module dynamically reintegrates misclassified benign clients from the untrusted group, thereby preserving their valuable contributions and addressing the benign exclusion. We conduct extensive evaluations of FedDLAD against state-of-the-art defenses on real-world datasets, demonstrating its superior ability to reduce backdoor attack success rates while maintaining robust model performance. Code is available at: https://github.com/dingbinb/FedDLAD.
3426: T2S: High-resolution Time Series Generation with Text-to-Series Diffusion Models
Authors: Yunfeng Ge, Jiawei Li, Yiji Zhao, Haomin Wen, Zhao Li, Meikang Qiu, Hongyan Li, Ming Jin, Shirui Pan
Location: Guangzhou | Day: TBD
Show Abstract
Text-to-Time Series generation holds significant potential to address challenges such as data sparsity, imbalance, and limited availability of multimodal time series data across domains. While diffusion models have achieved remarkable success in Text-to-X (e.g., vision and audio data) generation, their use in time series generation remains limit. Existing approaches face two critical limitations: (1) reliance on domain-specific captions that generalize poorly, and (2) inability to generate time series of arbitrary length, limiting real-world use. In this work, we first introduce a new multimodal dataset containing over 600,000 high-resolution text-time series pairs. Second, we propose Text-to-Series (T2S), a diffusion-based framework that bridges the gap between natural language and time series in a domain-agnostic manner. It employs a length-adaptive VAE to encode time series of varying lengths into consistent latent embeddings. On top of that, T2S effectively aligns textual representations with latent embeddings by utilizing Flow Matching and employing DiT as the denoiser. We train T2S in an interleaved paradigm across multiple lengths, allowing it to generate sequences of arbitrary lengths. Extensive evaluations demonstrate that T2S achieves state-of-the-art performance across 13 datasets spanning 12 domains.
3436: MMET: A Multi-Input and Multi-Scale Transformer for Efficient PDEs Solving
Authors: Yichen Luo, Jia Wang, Dapeng Lan, Yu Liu, Zhibo Pang
Location: Guangzhou | Day: TBD
Show Abstract
Partial Differential Equations (PDEs) are fundamental for modeling physical systems, yet solving them in a generic and efficient manner using machine learning-based approaches remains challenging due to limited multi-input and multi-scale generalization capabilities, as well as high computational costs. This paper proposes the Multi-input and Multi-scale Efficient Transformer (MMET), a novel framework designed to address the above challenges. MMET decouples mesh and query points as two sequences and feeds them into the encoder and decoder, respectively, and uses a Gated Condition Embedding (GCE) layer to embed input variables or functions with varying dimensions, enabling effective solutions for multi-scale and multi-input problems. Additionally, a Hilbert curve-based reserialization and patch embedding mechanism decrease the input length. This significantly reduces the computational cost when dealing with large-scale geometric models. These innovations enable efficient representations and support multi-scale resolution queries for large-scale and multi-input PDE problems. Experimental evaluations on diverse benchmarks spanning different physical fields demonstrate that MMET outperforms SOTA methods in both accuracy and computational efficiency. This work highlights the potential of MMET as a robust and scalable solution for real-time PDE solving in engineering and physics-based applications, paving the way for future explorations into pre-trained large-scale models in specific domains. This work is open-sourced at https://github.com/YichenLuo-0/MMET.
3441: LEKA: LLM-Enhanced Knowledge Augmentation
Authors: Xinhao Zhang, Jinghan Zhang, Fengran Mo, Dongjie Wang, Yanjie Fu, Kunpeng Liu
Location: Montreal | Day: August 21st | Time: 15:00 | Session: ML: Large Language Models
Show Abstract
Humans excel in analogical learning and knowledge transfer and, more importantly, possess a unique understanding of identifying appropriate sources of knowledge. From a model’s perspective, this presents a unique challenge. If models could autonomously retrieve knowledge relevant for transfer or decision-making to solve problems, they would transition from passively acquiring to actively accessing and learning from knowledge. However, filling models with knowledge is relatively straightforward—it simply requires more training and accessible knowledge bases. The more complex task is teaching models about which knowledge can be analogized and transferred. Therefore, we design a knowledge augmentation method, LEKA, for knowledge transfer that actively searches for suitable knowledge sources that can enrich the target domain’s knowledge. This LEKA method extracts key information from the target domain’s textual information, retrieves pertinent data from external data libraries, and harmonizes retrieved data with the target domain data in feature space and marginal probability measures. We validate the effectiveness of our approach through extensive experiments across various domains and demonstrate significant improvements over traditional methods in automating data alignment and optimizing transfer learning outcomes.
3445: New Sequence-Independent Lifting Techniques for Cover Inequalities and When They Induce Facets
Authors: Siddharth Prasad, Ellen Vitercik, Maria-Florina Balcan, Tuomas Sandholm
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Constraint Satisfaction and Optimization (1/3)
Show Abstract
Sequence-independent lifting is a procedure for strengthening valid inequalities of an integer program. We generalize the sequence-independent lifting method of Gu, Nemhauser, and Savelsbergh (GNS lifting) for cover inequalities and correct an error in their proposed generalization. We obtain a new sequence-independent lifting technique—piecewise-constant (PC) lifting—with a number of important properties. We derive a broad set of sufficient conditions under which PC lifting yields facets—the first characterization of facet-defining sequence-independent liftings that are efficiently computable from the underlying cover. Finally, we demonstrate via experiments that PC lifting can be a useful alternative to GNS lifting. We test PC lifting atop a number of novel cover inequality generation routines, which prove to be effective in experiments with CPLEX. PC lifting delivers strong numerical properties making it practically relevant for integer programming solvers.
3456: Distance Preservation Games
Authors: Haris Aziz, Hau Chan, Patrick Lederer, Shivika Narang, Toby Walsh
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Game Theory
Show Abstract
We introduce and analyze distance preservation games (DPGs). In DPGs, agents express ideal distances to other agents and need to choose locations in the unit interval while preserving their ideal distances as closely as possible. We analyze the existence and computation of location profiles that are jump stable (i.e., no agent can benefit by moving to another location) or welfare optimal for DPGs, respectively. Specifically, we prove that there are DPGs without jump stable location profiles and identify important cases where such outcomes always exist and can be computed efficiently. Similarly, we show that finding welfare optimal location profiles is NP-complete and present approximation algorithms for finding solutions with social welfare close to optimal. Finally, we prove that DPGs have a price of anarchy of at most 2.
3483: GSDet: Gaussian Splatting for Oriented Object Detection
Authors: Zeyu Ding, Jiaqi Zhao, Yong Zhou, Wen-liang Du, Hancheng Zhu, Rui Yao
Location: Guangzhou | Day: TBD
Show Abstract
Oriented object detection has advanced with the development of convolutional neural networks (CNNs) and transformers. However, modern detectors still rely on predefined object candidates, such as anchors in CNN-based methods or queries in transformer-based methods, which struggle to capture spatial information effectively. To address the limitations, we propose GSDet, a novel framework that formulates oriented object detection as Gaussian splatting. Specifically, our approach performs detection within a 3D feature space constructed from image features, where 3D Gaussians are employed to represent oriented objects. These 3D Gaussians are projected onto the image plane to form 2D Gaussians, which are then transformed into oriented boxes. Furthermore, we optimize the mean, anisotropic covariance, and confidence scores of these randomly initialized 3D Gaussians, using a decoder that incorporates 3D Gaussian sampling. Moreover, our method exhibits flexibility, enabling adaptive control and a dynamic number of Gaussians during inference. Experiments on 3 datasets indicate that GSDet achieves AP50 gains of 0.7% on DIOR-R, 0.3% on DOTA-v1.0, and 0.55% on DOTA-v1.5 when evaluated with adaptive control and outperforms mainstream detectors.
3486: Distributed Cascaded Manifold Hashing Network for Compact Image Set Representation
Authors: Xiaxin Wang, Haoyu Cai, Xiaobo Shen, Xia Wu
Location: Guangzhou | Day: TBD
Show Abstract
Conventional image set methods typically learn from image sets stored in a single location. However, in real-world applications, image sets are often distributed across different locations. Learning from such distributed sets using deep neural networks poses challenges for efficient image set classification and retrieval. To address this, we propose Distributed Cascade Manifold Hashing Network (DCMHN) for compact image set representation. DCMHN represents each image set using an SPD manifold and utilizes a manifold hashing network to generate hash codes, enabling efficient classification and retrieval. The network is trained in a cascaded manner, where the bilinear mapping in the BiMap layer is learned first, followed by joint learning of the hash function and classifier in the hash layer. DCMHN enforces local consistency on global variables across neighboring nodes, allowing parallel optimization. Extensive experiments on three benchmark image set datasets demonstrate that the proposed DCMHN achieves competitive accuracies in distributed settings, and outperforms state-of-the-arts in terms of computation and storage efficiency.
3493: Rotation Invariant Spatial Networks for Single-View Point Cloud Classification
Authors: Feng Luan, Jiarui Hu, Changshi Zhou, Zhipeng Wang, Jiguang Yue, Yanmin Zhou, Bin He
Location: Guangzhou | Day: TBD
Show Abstract
Point cloud classification is critical for three-dimensional scene understanding. However, in real-world scenarios, depth cameras often capture partial, single-view point clouds of objects with different poses, making their accurate classification a challenge. In this paper, we propose a novel point cloud classification network that captures the detailed spatial structure of objects by constructing tetrahedra, which is different from point-wise operations. Specifically, we propose a RISpaNet block to extract rotation-invariant features. A rotation-invariant property generation module is designed in RISpaNet for constructing rotation-invariant tetrahedron properties (RITPs). Meanwhile, a multi-scale pooling module and a hybrid encoder are used to process RITPs to generate integrated rotation-invariant features. Further, for single-view point clouds, a complete point cloud auxiliary branch and a part-whole correlation module are jointly employed to obtain complete point cloud features from partial point clouds. Experimental results show that this network performs better than other state-of-the-art methods, evaluated on four public datasets. We achieved an overall accuracy of 94.7% (+2.0%) on ModelNet40, 93.4% (+5.9%) on MVP, 94.7% (+6.3%) on PCN and 94.8% (+1.7%) on ScanObjectNN. Our project website is https://luxurylf.github.io/RISpaNet_project/.
3496: ASCENT-ViT: Attention-based Scale-aware Concept Learning Framework for Enhanced Alignment in Vision Transformers
Authors: Sanchit Sinha, Guangzhi Xiong, Aidong Zhang
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: ETF: Explainability and interpretability
Show Abstract
As Vision Transformers (ViTs) are increasingly adopted in sensitive vision applications, there is a growing demand for improved interpretability. This has led to efforts to forward-align these models with carefully annotated abstract, human-understandable semantic entities – concepts. Concepts provide global rationales to the model predictions and can be quickly understood/intervened on by domain experts. Most current research focuses on designing model-agnostic, plug-and-play generic concept-based explainability modules that do not incorporate the inner workings of foundation models (e.g., inductive biases, scale invariance, etc.) during training. To alleviate this issue for ViTs, in this paper, we propose ASCENT-ViT, an attention-based, concept learning framework that effectively composes scale and position-aware representations from multiscale feature pyramids and ViT patch representations, respectively. Further, these representations are aligned with concept annotations through attention matrices – which incorporate spatial and global (semantic) concepts. ASCENT-ViT can be utilized as a classification head on top of standard ViT backbones for improved predictive performance and accurate and robust concept explanations as demonstrated on five datasets, including three widely used benchmarks (CUB, Pascal APY, Concept-MNIST) and two real-world datasets (AWA2, KITS). An appendix of the paper with more comprehensive results is available at https://arxiv.org/abs/2501.09221.
3500: Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution
Authors: Zihang Liu, Zhenyu Zhang, Hao Tang
Location: Guangzhou | Day: TBD
Show Abstract
Diffusion-based image super-resolution (SR) methods have demonstrated remarkable performance. Recent advancements have introduced deterministic sampling processes that reduce inference from 15 iterative steps to a single step, thereby significantly improving the inference speed of existing diffusion models. However, their efficiency remains limited when handling complex semantic regions due to the single-step inference.
To address this limitation, we propose SAMSR, a semantic-guided diffusion framework that incorporates semantic segmentation masks into the sampling process. Specifically, we introduce the SAM-Noise Module, which refines Gaussian noise using segmentation masks to preserve spatial and semantic features. Furthermore, we develop a pixel-wise sampling strategy that dynamically adjusts the residual transfer rate and noise strength based on pixel-level semantic weights, prioritizing semantically rich regions during the diffusion process. To enhance model training, we also propose a semantic consistency loss, which aligns pixel-wise semantic weights between predictions and ground truth.
Extensive experiments on both real-world and synthetic datasets demonstrate that SAMSR significantly improves perceptual quality and detail recovery, particularly in semantically complex images.
3502: Counterfactual Knowledge Maintenance for Unsupervised Domain Adaptation
Authors: Yao Li, Yong Zhou, Jiaqi Zhao, Wen-liang Du, Rui Yao, Bing Liu
Location: Guangzhou | Day: TBD
Show Abstract
Traditional unsupervised domain adaptation (UDA) struggles to extract rich semantics due to backbone limitations. Recent large-scale pre-trained visual-language models (VLMs) have shown strong zero-shot learning capabilities in UDA tasks. However, directly using VLMs results in a mixture of semantic and domain-specific information, complicating knowledge transfer. Complex scenes with subtle semantic differences are prone to misclassification, which in turn can result in the loss of features that are crucial for distinguishing between classes. To address these challenges, we propose a novel counterfactual knowledge maintenance UDA framework. Specifically, we employ counterfactual disentanglement to separate the representation of semantic information from domain features, thereby reducing domain bias. Furthermore, to clarify ambiguous visual information specific to classes, we maintain the discriminative knowledge of both visual and textual information. This approach synergistically leverages multimodal information to preserve modality-specific distinguishable features. We conducted extensive experimental evaluations on several public datasets to demonstrate the effectiveness of our method. The source code is available at https://github.com/LiYaolab/CMKUDA
3503: Where and How to Enhance: Discovering Bit-Width Contribution for Mixed Precision Quantization
Authors: Haidong Kang, Lianbo Ma, Guo Yu, Shangce Gao
Location: Guangzhou | Day: TBD
Show Abstract
Mixed precision quantization (MPQ) is an effective quantization approach to achieve accuracy-complexity trade-off of neural network, through assigning different bit-widths to network activations and weights in each layer. The typical way of existing MPQ methods is to optimize quantization policies (i.e., bit-width allocation) in a gradient descent manner, termed as Differentiable MPQ (DMPQ). At the end of the search, the bit-width associated to the quantization parameters which has the largest value will be selected to form the final mixed precision quantization policy, with the implicit assumption that the values of quantization parameters reflect the operation contribution to the accuracy improvement. While much has been discussed about the MPQ’s improvement, the bit-width selection process has received little attention. We study this problem and argue that the magnitude of quantization parameters does not necessarily reflect the actual contribution of the bit-width to the task performance. Then, we propose a Shapley-based MPQ (SMPQ) method, which measures the bit-width operation’s direct contribution on the MPQ task. To reduce computation cost, a Monte Carlo sampling-based approximation strategy is proposed for Shapley computation. Extensive experiments on mainstream benchmarks demonstrate that our SMPQ consistently achieves state-of-the-art performance than gradient-based competitors.
3505: Dual-Agent Reinforcement Learning for Automated Feature Generation
Authors: Wanfu Gao, Zengyao Man, Hanlin Pan, Kunpeng Liu
Location: Guangzhou | Day: TBD
Show Abstract
Feature generation involves creating new features from raw data to capture complex relationships among the original features, improving model robustness and machine learning performance. Current methods using reinforcement learning for feature generation have made feature exploration more flexible and efficient. However, several challenges remain: first, during feature expansion, a large number of redundant features are generated. When removing them, current methods only retain the best features each round, neglecting those that perform poorly initially but could improve later. Second, the state representation used by current methods fails to fully capture complex feature relationships. Third, there are significant differences between discrete and continuous features in tabular data, requiring different operations for each type. To address these challenges, we propose a novel dual-agent reinforcement learning method for feature generation. Two agents are designed: the first generates new features, and the second determines whether they should be preserved. A self-attention mechanism enhances state representation, and diverse operations distinguish interactions between discrete and continuous features. The experimental results on multiple datasets demonstrate that the proposed method is effective.
3512: LLM-enhanced Score Function Evolution for Causal Structure Learning
Authors: Zidong Wang, Fei Liu, Qi Feng, Qingfu Zhang, Xiaoguang Gao
Location: Guangzhou | Day: TBD
Show Abstract
Causal structure learning (CSL) plays a pivotal role in causality and is often formulated as an optimization problem within score-and-search methods. Under the assumption of an infinite dataset and a predefined distribution, several well-established and consistent score functions have been shown to be both optimal and reliable for identifying ground-truth causal graphs. However, in practice, these idealized assumptions are often infeasible, which can result in CSL algorithms learning suboptimal structures. In this paper, we introduce L-SFE, a framework designed to automatically discover effective score functions by exploring the "score function space". L-SFE addresses this task from a bi-level optimization perspective. First, it leverages a Large Language Model (LLM) to interpret the characteristics of score functions and generate the corresponding code implementations. Next, L-SFE employs evolutionary algorithms along with carefully designed operators, to search for solutions with higher fitness. Additionally, we take the BIC as example and prove the consistency of the generated score functions. Experimental evaluations, conducted on discrete, continuous, and real datasets, demonstrate the high stability, generality and effectiveness of L-SFE.
3514: Aggregation Mechanism Based Graph Heterogeneous Networks Distillation
Authors: Xiaobin Hong, Mingkai Lin, Xiangkai Ma, Wenzhong Li, Sanglu Lu
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) have demonstrated remarkable effectiveness across various tasks but are often hindered by their high computational overhead. GNN-to-MLP distillation provides a promising remedy by transferring knowledge from complex GNNs to lightweight MLPs. However, existing methods largely overlook the differences in aggregation mechanisms and heterogeneous architectures. Simplifying such intricate information into MLP potentially causes information loss or distortion, ultimately resulting in suboptimal performance. This paper proposes an aggregation mechanism enhanced GNN distillation framework (AMEND). AMEND introduces multi-scope aggregation context preservation to replicate the teacher’s broad aggregation scopes and an aggregation-enhanced centered kernel alignment method to match the teacher’s aggregation patterns. To ensure efficient and robust knowledge transfer, we integrate a manifold mixup strategy, enabling the student to capture the teacher’s insights into mixed data distributions. Experimental results on 8 standard and 4 large-scale datasets demonstrate that AMEND consistently outperforms state-of-the-art distillation methods.
3518: Frequency-Aware Deep Depth from Focus
Authors: Tao Yan, Yingying Wang, Jiangfeng Zhang, Yuhua Qian, Jieru Jia, Lu Chen, Feijiang Li
Location: Guangzhou | Day: TBD
Show Abstract
In large aperture imaging, the shallow depth of field (DoF) phenomenon requires capturing multiple images at different focal levels, allowing us to infer depth information using depth from focus (DFF) techniques. However, most previous works design convolutional neural networks from a time domain perspective, often leading to blurred fine details in depth estimation. In this work, we propose a frequency-aware deep DFF network (FAD) that couples multi-scale spatial domain local features with frequency domain global structural features. Our main innovations include two key points: First, we introduce a frequency domain feature extraction module that uses the Fourier transform to transfer latent focus features into the frequency domain. This module adaptively captures essential frequency information for focus changes through element-wise multiplication, enhancing fine details in depth results while preserving global structural integrity. Second, the time-frequency joint module of FAD improves the consistency of depth information in sparse texture regions and the continuity in transition areas from both local and global complementary perspectives. Comprehensive experiments demonstrate that our model achieves compelling generalization and state-of-the-art depth prediction across various datasets. Additionally, it can be quickly adapted to real-world applications as a pre-trained model.
3531: Towards Debiased Generalized Category Discovery
Authors: Pengcheng Guo, Yonghong Song, Boyu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Generalized Category Discovery (GCD) aims at classifying unlabeled training data coming from old and novel classes by leveraging the information of partially labeled old classes. In this paper, we reveal that existing methods often suffer from competition between new and old classes, where the focus on learning new classes often results in a notable performance degradation on the old classes. Moreover, we delve into the reason behind this problem: the GCD classifier can be overconfident and biased towards the new class. With this insight, we propose Debiased GCD (DeGCD), a simple but effective approach that mitigates the bias caused by the overconfidence from new categories by a debiased head. Specifically, we first propose semantic calibration loss that aids the GCD classifier in debiasing by enforcing neighborhood prediction consistency with the latent representation of the debiased head. Furthermore, a debiased contrastive objective is proposed to refine the similarity matrix from the GCD classifier and the debiased classifier, suppressing the overconfidence in new classes in unlabeled data. In addition, an alignment constraint loss is designed to prevent damaging the distribution of the old categories caused by overconfidence in the new categories. Experiments on various datasets shows DeGCD achieves state-of-the-art performance and maintains a good balance between new and old classes. In addition, this method can be seamlessly adapted to other GCD methods, not only to achieve further performance gains but also to effectively balance the performance of the new class with that of the old class.
3532: Boosting Zero-shot Stereo Matching Using Large-Scale Mixed Images Sources in the Real World
Authors: Yuran Wang, Yingping Liang, Ying Fu
Location: Guangzhou | Day: TBD
Show Abstract
Stereo matching methods rely on dense pixel-wise ground truth labels, which are laborious to obtain, especially for real-world datasets. The scarcity of labeled data and domain gaps between synthetic and real-world images also pose notable challenges. In this paper, we propose a novel framework, BooSTer, that leverages both vision foundation models and large-scale mixed image sources, including synthetic, real, and single-view images. First, to fully unleash the potential of large-scale single-view images, we design a data generation strategy combining monocular depth estimation and diffusion models to generate dense stereo matching data from single-view images. Second, to tackle sparse labels in real-world datasets, we transfer knowledge from monocular depth estimation models, using pseudo-mono depth labels and a dynamic scale- and shift-invariant loss for additional supervision. Furthermore, we incorporate vision foundation model as an encoder to extract robust and transferable features, boosting accuracy and generalization. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach, achieving significant improvements in accuracy over existing methods, particularly in scenarios with limited labeled data and domain shifts.
3533: Going Beyond Consistency: Target-oriented Multi-view Graph Neural Network
Authors: Sujia Huang, Lele Fu, Shuman Zhuang, Yide Qiu, Bo Huang, Zhen Cui, Tong Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Multi‐view learning has emerged as a pivotal research area driven by the growing heterogeneity of real‐world data, and graph neural network-based models, modeling multi-view data as multi-view graphs, have achieved remarkable performance by revealing its deep semantics. However, by assuming cross‐view consistency, most approaches collect not only task-relevant (determinative) semantics but also symbiotic yet task-irrelevant (incidental) factors are collected to obscure model inference. Furthermore, these approaches often lack rigorous theoretical analysis that bridges training data to test data. To address these issues, we propose Target-oriented Graph Neural Network (TGNN), a novel framework that goes beyond traditional consistency by prioritizing task-relevant information, ensuring alignment with the target. Specifically, TGNN employs a class-level dual-objective loss to minimize the classification similarity between determinative and incidental factors, accentuating the former while suppressing the latter during model inference. Meanwhile, to ensure consistency between the learned semantics and predictions in representation learning, we introduce a penalty term that aims to amplify the divergence between these two types of factors. Furthermore, we derive an upper bound on the loss discrepancy between training and test data, providing formal guarantees for generalization to test domains. Extensive experiments conducted on three types of multi-view datasets validate the superiority of TGNN.
3551: BMIP: Bi-directional Modality Interaction Prompt Learning for VLM
Authors: Song-Lin Lv, Yu-Yang Chen, Zhi Zhou, Ming Yang, Lan-Zhe Guo
Location: Guangzhou | Day: TBD
Show Abstract
Vision-language models (VLMs) have exhibited remarkable generalization capabilities, and prompt learning for VLMs has attracted great attention for the ability to adapt pre-trained VLMs to specific downstream tasks. However, existing studies mainly focus on single-modal prompts or uni-directional modality interaction, overlooking the powerful alignment effects resulting from the interaction between the vision and language modalities. To this end, we propose a novel prompt learning method called Bi-directional Modality Interaction Prompt (BMIP), which dynamically weights bi-modal information through learning the information of the attention layer, enhancing trainability and inter-modal consistency compared to simple information aggregation methods. To evaluate the effectiveness of prompt learning methods, we propose a more realistic evaluation paradigm called open-world generalization complementing the widely adopted cross-dataset transfer and domain generalization tasks. Comprehensive experiments on various datasets reveal that BMIP not only outperforms current state-of-the-art methods across all three evaluation paradigms but is also flexible enough to be combined with other prompt-based methods for consistent performance enhancement.
3575: Learn from Global Rather Than Local: Consistent Context-Aware Representation Learning for Multi-View Graph Clustering
Authors: Lele Fu, Bowen Deng, Sheng Huang, Tianchi Liao, Chuanfu Zhang, Chuan Chen
Location: Guangzhou | Day: TBD
Show Abstract
Multi-view graph clustering (MVGC) has been of widespread interest owing to the ability of capturing the complementary information among views, thereby enhancing the performance of node clustering. Despite the impressive achievements of existing methods, they are limited by a common deficiency, namely, the curse of local manifold while failing to perceive the global manifold structure. In light of this drawback, we propose a Consistent Context-Aware Representation Learning (CCARL) method for MVGC, aiming to learn node representations from global space rather than just local topology. Concretely, we define a set of anchors to establish the global coordinate, which are optimally mapped to multi-view graphs with minimal cost via fused Gromov-Wasserstein optimal transport. To fuse the complementary information in various views, the attention mechanism is employed to integrate multiple graph embeddings into a consistent representation. By transforming to the global coordinate connecting with anchors, the consistent representation captures the contextual information, and its clustering-friendliness is further enhanced through a self-training strategy. Finally, extensive experiments on four multi-view graph datasets demonstrate the effectiveness of the proposed CCARL over existing MVGC methods.
3579: On the Learning with Augmented Class via Forests
Authors: Fan Xu, Wuyang Chen, Wei Gao
Location: Guangzhou | Day: TBD
Show Abstract
Decision trees and forests have achieved successes in various real applications, most working with all testing classes known in training data. In this work, we focus on learning with augmented class via forests, where an augmented class may appear in testing data yet not in training data. We incorporate information of augmented class into trees’ splitting, that is, augmented Gini impurity, a new splitting criterion is introduced to exploit some unlabeled data from testing distribution. We then develop the Learning with Augmented Class via Forests (short for LACForest) approach, which constructs shallow forests according to the augmented Gini impurity and then splits forests with pseudo-labeled augmented instances for better performance. We also develop deep neural forests via an optimization objective based on our augmented Gini impurity, which essentially utilizes the representation power of neural networks for forests. Theoretically, we present the convergence analysis for our augmented Gini impurity, and we finally conduct experiments to evaluate our approaches. The code is available at https://github.com/nju-xuf/LACForest.
3581: Noise-Resistant Label Reconstruction Feature Selection for Partial Multi-Label Learning
Authors: Wanfu Gao, Hanlin Pan, Qingqi Han, Kunpeng Liu
Location: Guangzhou | Day: TBD
Show Abstract
The "Curse of dimensionality" is prevalent across various data patterns, which increases the risk of model overfitting and leads to a decline in model classification performance. However, few studies have focused on this issue in Partial Multi-label Learning (PML), where each sample is associated with a set of candidate labels, at least one of which is correct. Existing PML methods addressing this problem are mainly based on the low-rank assumption. However, low-rank assumption is difficult to be satisfied in practical situations and may lead to loss of high-dimensional information. Furthermore, we find that existing methods have poor ability to identify positive labels, which is important in real-world scenarios. In this paper, a PML feature selection method is proposed considering two important characteristics of dataset: label relationship’s noise-resistance and label connectivity. Our proposed method utilizes label relationship’s noise-resistance to disambiguate labels. Then the learning process is designed through the reformed low-rank assumption. Finally, representative labels are found through label connectivity, and the weight matrix is reconstructed to select features with strong identification ability to these labels. The experimental results on benchmark datasets demonstrate the superiority of the proposed method.
3585: FS-KEN: Few-shot Knowledge Graph Reasoning by Adversarial Negative Enhancing
Authors: Lingyuan Meng, Ke Liang, Zeyu Zhu, Xinwang Liu, Wenpeng Lu
Location: Guangzhou | Day: TBD
Show Abstract
Few-shot knowledge graph reasoning (FS-KGR) try to infer missing facts in a knowledge graphs using limited data (such as only 3/5 samples).Existing strategies have shown good performance by mining more supervised information for few-shot learning through meta-learning and self-supervised learning. However, the problem of insufficient samples has not been fundamentally solved. In this paper, we propose a novel algorithm based on adversarial learning for Enhancing Negative samples in few-shot scenarios of FS-KGR, termed FS-KEN. Specifically, we are the first to use GAN to conduct data augmentation on FS-KGR scenario. FS-KEN uses policy gradient GANs for negative sample augmentation, solving the gradient back-propagation issue in traditional GANs. The generator aims to produce high-quality negative entities. while the objective of the discriminator is to distinguish between generated entities and real entities. Comprehensive experiments conducted on two few-shot knowledge graph completion datasets reveal that FS-KEN surpasses other baseline models, achieving state-of-the-art results.
3591: Two-Stage Feature Generation with Transformer and Reinforcement Learning
Authors: Wanfu Gao, Zengyao Man, Zebin He, Yuhao Tang, Jun Gao, Kunpeng Liu
Location: Guangzhou | Day: TBD
Show Abstract
Feature generation is a critical step in machine learning, aiming to enhance model performance by capturing complex relationships within the data and generating meaningful new features. Traditional feature generation methods heavily rely on domain expertise and manual intervention, making the process labor-intensive and challenging to adapt to different scenarios. Although automated feature generation techniques address these issues to some extent, they often face challenges such as feature redundancy, inefficiency in feature space exploration, and limited adaptability to diverse datasets and tasks. To address these problems, we propose a Two-Stage Feature Generation (TSFG) framework, which integrates a Transformer-based encoder-decoder architecture with Proximal Policy Optimization (PPO). The encoder-decoder model in TSFG leverages the Transformer’s self-attention mechanism to efficiently represent and transform features, capturing complex dependencies within the data. PPO further enhances TSFG by dynamically adjusting the feature generation strategy based on task-specific feedback, optimizing the process for improved performance and adaptability. TSFG dynamically generates high-quality feature sets, significantly improving the predictive performance of machine learning models. Experimental results demonstrate that TSFG outperforms existing state-of-the-art methods in terms of feature quality and adaptability.
3600: R2DQG: A Quality Meets Diversity Framework for Question Generation over Knowledge Bases
Authors: Yimeng Ren, Yanhua Yu, Lizi Liao, Yuhu Shang, Kangkang Lu, Mingliang Yan
Location: Guangzhou | Day: TBD
Show Abstract
The task of Knowledge-Based Question Generation (KBQG) involves generating natural language questions from structured knowledge sources, posing unique challenges in balancing linguistic diversity and semantic relevance. Existing models often focus on maximizing surface-level similarity to ground-truth questions, neglecting the need for diverse syntactic forms and leading to semantic drift during generation. To overcome these challenges, we propose Refine-Reinforced Diverse Question Generation (R2DQG), a two-phase framework leveraging a generation-then-refinement paradigm. The Generator first constructs a diverse set of expressive templates using dependency parse tree similarity, capturing a wide range of syntactic patterns and styles. These templates guide the creation of question drafts, ensuring both diversity and semantic relevance. In the second phase, a Corrector module refines the drafts to mitigate semantic drift and enhance overall coherence and quality. Experiments on public datasets show that R2DQG outperforms state-of-the-art models in generating diverse, contextually accurate questions. Moreover, synthetic datasets generated by R2DQG enhance downstream QA performance, underscoring the practical utility of our approach.
3603: STLSP: Integrating Structure and Text with Large Language Models for Link Sign Prediction of Networks
Authors: Lijia Ma, Haoyang Fu, Zhijie Cao, Xiongnan Jin, Qiuzhen Lin, Jianqiang Li
Location: Montreal | Day: August 21st | Time: 15:00 | Session: DM: Graph Data Mining
Show Abstract
Link Sign Prediction (LSP) in signed networks is a critical task with applications in recommendation systems, community detection, and social network analysis. Existing methods primarily rely on graph neural networks to exploit structural information, often neglecting the valuable insights from edge-level textual data. Furthermore, utilizing large language models (LLMs) for LSP faces challenges in reliability and interpreting graph structures. To address these issues, we propose a novel STLSP framework that integrates signed networks’ \underline{S}tructural and \underline{T}extual information with LLMs for the \underline{LSP} task. STLSP leverages structural balance theory to generate node embeddings that capture positive and negative relationships. These embeddings are transformed into natural language representations through clustering techniques, allowing LLMs to utilize the structural context fully. By integrating these representations with edge text, STLSP improves the accuracy and reliability of the LSP task. Extensive experiments conducted on five real-world datasets demonstrate that STLSP outperformed state-of-the-art baselines, achieving an 8.7% improvement in terms of accuracy. Moreover, STLSP shows robust performance across various LLMs, making it adaptable to different computational environments. The code and data are publically available at https://github.com/sss483/STLSP.
3621: PNAct: Crafting Backdoor Attacks in Safe Reinforcement Learning
Authors: Weiran Guo, Guanjun Liu, Ziyuan Zhou, Ling Wang
Location: Guangzhou | Day: TBD
Show Abstract
Reinforcement Learning (RL) is widely used in tasks where agents interact with an environment to maximize rewards. Building on this foundation, Safe Reinforcement Learning (Safe RL) incorporates a cost metric alongside the reward metric, ensuring that agents adhere to safety constraints during decision-making. In this paper, we identify that Safe RL is vulnerable to backdoor attacks, which can manipulate agents into performing unsafe actions. First, we introduce the relevant concepts and evaluation metrics for backdoor attacks in Safe RL. It is the first attack framework in the Safe RL field that involves both Positive and Negative Action sample (PNAct) is to implant backdoors, where positive action samples provide reference actions and negative action samples indicate actions to be avoided. We theoretically point out the properties of PNAct and design an attack algorithm. Finally, we conduct experiments to evaluate the effectiveness of our proposed backdoor attack framework, evaluating it with the established metrics. This paper highlights the potential risks associated with Safe RL and underscores the feasibility of such attacks. Our code and supplementary material are available at https://github.com/azure-123/PNAct.
3632: Optimal Policy Adaptation Under Covariate Shift
Authors: Xueqing Liu, Qinwei Yang, Zhaoqing Tian, Ruocheng Guo, Peng Wu
Location: Guangzhou | Day: TBD
Show Abstract
Transfer learning of prediction models has been extensively studied, while the corresponding policy learning approaches are rarely discussed. In this paper, we propose principled approaches for learning the optimal policy in the target domain by leveraging two datasets: one with full information from the source domain and the other from the target domain with only covariates. First, in the setting of covariate shift, we formulate the problem from a perspective of causality and present the identifiability assumptions for the reward induced by a given policy. Then, we derive the efficient influence function and the semiparametric efficiency bound for the reward. Based on this, we construct a doubly robust and semiparametric efficient estimator for the reward and then learn the optimal policy by optimizing the estimated reward. Moreover, we theoretically analyze the bias and the generalization error bound for the learned policy. Furthermore, in the presence of both covariate and concept shifts, we propose a novel sensitivity analysis method to evaluate the robustness of the proposed policy learning approach. Extensive experiments demonstrate that the approach not only estimates the reward more accurately but also yields a policy that closely approximates the theoretically optimal policy.
3645: TSTAI: A Time-varying Brain Effective Connectivity Network Construction Method Combining with Brain Active Information
Authors: Qi Chen, Zhiqiong Wang, Jiaxin Li, Jinying Tao, Junchang Xin
Location: Guangzhou | Day: TBD
Show Abstract
More accurate construction of brain effective conncetivity networks remains a great challenge to achieve accurate auxiliary diagnosis of brain diseases and in-depth exploration of brain function. However, existing methods only consider higher-order or non-stationary assumptions, rather than simultaneously constructing higher-order and non-stationary networks. Among many existing methods, Bayesian network methods demonstrate superior network structure learning ability. In this work, the forward-backward search (FBS) method is optimized by using brain active information, which is improved to a higher-order network structure learning method, called TSTAI. Firstly, in the process of non-stationary network structure learning, two-stage idea is used to search the change points. Then, in the process of learning higher-order network structure, FBS method is combined with two kinds of brain active information to improve the condition set filtering process and scoring function, respectively. Finally, the pruning strategy is used to reduce the search space. Extensive experiments on simulated and real data demonstrate the effectiveness of TSTAI. Through experiments, the TSTAI is compared with state-of-the-art higher-order network construction methods, and the proposed method achieves an improvement of 3.6% and 17.4% respectively in the network construction accuracy.
3647: Efficient Inter-Operator Scheduling for Concurrent Recommendation Model Inference on GPU
Authors: Shuxi Guo, Zikang Xu, Jiahao Liu, Jinyi Zhang, Qi Qi, Haifeng Sun, Jun Huang, Jianxin Liao, Jingyu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Deep learning-based recommendation systems are increasingly important in the industry. To meet strict SLA requirements, serving frameworks must efficiently handle concurrent queries. However, current serving systems fail to serve concurrent queries due to the following problems: (1) inefficient operator (op) scheduling due to the query-wise op launching mechanism, and (2) heavy contention caused by the mutable nature of recommendation model inference. This paper presents RecOS, a system designed to optimize concurrent recommendation model inference on GPUs. RecOS efficiently schedules ops from different queries by monitoring GPU workloads and assigning ops to the most suitable streams. This approach reduces contention and enhances inference efficiency by leveraging inter-op parallelism and op characteristics. To maintain correctness across multiple CUDA streams, RecOS introduces a unified asynchronous tensor management mechanism. Evaluations demonstrate that RecOS improves online service performance, reducing latency by up to 68%.
3663: Distilling A Universal Expert from Clustered Federated Learning
Authors: Zeqi Leng, Chunxu Zhang, Guodong Long, Riting Xia, Bo Yang
Location: Guangzhou | Day: TBD
Show Abstract
Clustered Federated Learning (CFL) addresses the challenges posed by non-IID data by training multiple group- or cluster-specific expert models. However, existing methods often overlook the shared information across clusters, which represents the generalizable knowledge valuable to all participants in the Federated Learning (FL) system. To overcome this limitation, this paper introduces a novel FL framework that distills a universal expert model from the knowledge of multiple clusters. This universal expert captures globally shared information across all clients and is subsequently distributed to each client as the initialization for the next round of model training. The proposed FL framework operates in three iterative steps: (1) local model training at each client, (2) cluster-specific model aggregation, and (3) universal expert distillation. This three-step learning paradigm ensures the preservation of fine-grained non-IID characteristics while effectively incorporating shared knowledge across clusters. Compared to traditional gradient-based aggregation methods, the distillation-based model aggregation introduces greater flexibility in handling model heterogeneity and reduces conflicts among cluster-specific experts. Extensive experimental results demonstrate the superior performance of the proposed method across various scenarios, highlighting its potential to advance the state of CFL by balancing personalized and shared knowledge more effectively.
3672: Exploring Efficient and Effective Sequence Learning for Visual Object Tracking
Authors: Dongdong Li, Zhinan Gao, Yangliu Kuai, Rui Chen
Location: Guangzhou | Day: TBD
Show Abstract
Sequence learning based tracking frameworks are popular in the tracking community. In practice, its auto-regressive sequence generation manner leads to inferior performance and high latency compared with latest advanced trackers. In this paper, to mitigate this issue, we propose an efficient and effective sequence-to-sequence tracking framework named FastSeqTrack. FastSeqTrack differs from previous sequence learning based trackers in terms of token initialization and sequence generation manner. Four tracking tokens are appended to patch embeddings and generated in the encoder as initial guesses for the bounding box sequence, which improves the tracking accuracy compared with randomly initialized tokens. Tracking tokens are then parallelly fed into the decoder in a one-pass manner and greatly boost the forward inference speed compared with the auto-regressive manner. Inspired by the early-exit mechanism, we inject internal classifiers after each decoder layer to early terminate forward inference when the softmax confidence is sufficiently reliable. In easy tracking frames, early exits avoid network overthinking and unnecessary computation. Extensive experiments on multiple benchmarks demonstrate that FastSeqTrack runs over 100 fps and showcases superior performance against state-of-the-art trackers. Codes and models are available at https://github.com/vision4drones/FastSeqTrack.
3680: RAMer: Reconstruction-based Adversarial Model for Multi-party Multi-modal Multi-label Emotion Recognition
Authors: Xudong Yang, Yizhang Zhu, Hanfeng Liu, Zeyi Wen, Nan Tang, Yuyu Luo
Location: Guangzhou | Day: TBD
Show Abstract
Conventional Multi-modal multi-label emotion recognition (MMER) assumes complete access to visual, textual, and acoustic modalities. However, real-world multi-party settings often violate this assumption, as non-speakers frequently lack acoustic and textual inputs, leading to a significant degradation in model performance. Existing approaches also tend to unify heterogeneous modalities into a single representation, overlooking each modality’s unique characteristics. To address these challenges, we propose RAMer (Reconstruction-based Adversarial Model for Emotion Recognition), which refines multi-modal representations by not only exploring modality commonality and specificity but crucially by leveraging reconstructed features, enhanced by contrastive learning, to overcome data incompleteness and enrich feature quality. RAMer also introduces a personality auxiliary task to complement missing modalities using modality-level attention, improving emotion reasoning. To further strengthen the model’s ability to capture label and modality interdependency, we propose a stack shuffle strategy to enrich correlations between labels and modality-specific features. Experiments on three benchmarks, i.e., MEmoR, CMU-MOSEI, and M³ED, demonstrate that RAMer achieves state-of-the-art performance in dyadic and multi-party MMER scenarios.
3693: Beyond Symmetry in Repeated Games with Restarts
Authors: Henry Fleischmann, Kiriaki Fragkia, Ratip Emin Berker
Location: Montreal | Day: August 21st | Time: 10:00 | Session: GTEP: Noncooperative games
Show Abstract
Infinitely repeated games support equilibrium concepts beyond those present in one-shot games (e.g., cooperation in the prisoner’s dilemma). Nonetheless, repeated games fail to capture our real-world intuition for settings with many anonymous agents interacting in pairs. Repeated games with restarts, introduced by Berker and Conitzer, address this concern by giving players the option to restart the game with someone new whenever their partner deviates from an agreed-upon sequence of actions. In their work, they studied symmetric games with symmetric strategies. We significantly extend these results, introducing and analyzing more general notions of equilibria in asymmetric games with restarts. We characterize which goal strategies players can be incentivized to play in equilibrium, and we consider the computational problem of finding such sequences of actions with minimal cost for the agents. We show that this problem is NP-hard in general. However, when the goal sequence maximizes social welfare, we give a pseudo-polynomial time algorithm.
3700: Where and When: Predict Next POI and Its Explicit Timestamp in Sequential Recommendation
Authors: Yuanbo Xu, Hongxu Shen, Yiheng Jiang, En Wang
Location: Guangzhou | Day: TBD
Show Abstract
Sequential point-of-interest (POI) recommendation aims to recommend the next POI for users in accordance with their historical check-in information. However, few attempts treat timestamps of check-ins as a core factor for sequence models, leading to insufficient insight into user behavior and subsequently suboptimal recommendations. To address these limitations, we propose to assign equal importance to both POIs and their timestamps, shifting the point of view to recommend the next POI and predict the corresponding timestamp. Along these lines, we present the Time-Aware POI Recommender with Timestamp Prediction (TAPT), a multi-task learning framework for explainable POI recommendations. Specifically, we begin by decoupling timestamps into multi-dimensional vectors and propose a timestamp encoding module to explicitly encode these vectors. Additionally, we design a specialized timestamp prediction module built on the traditional sequence-based POI recommender backbone, effectively learning the strong correlation between POIs and their corresponding timestamps through these two modules. We evaluated the proposed model with three real-world LBSN datasets and demonstrated that TAPT achieves comparable or superior performance in POI recommendation compared to the baseline backbone. Besides, TAPT can not only recommend the next POI, but predict the corresponding timestamp in the future.
3716: Metapath and Hypergraph Structure-based Multi-Channel Graph Contrastive Learning for Student Performance Prediction
Authors: Lingyun Song, Xiaofan Sun, Xinbiao Gan, Yudai Pan, Xiaolin Han, Jie Ma, Jun Liu, Xuequn Shang
Location: Guangzhou | Day: TBD
Show Abstract
Considerable attention has been paid to predicting student performance on exercises. The performance of prior studies is determined by the quality of the trait features of students and exercises. Nevertheless, most of the prior study primarily examines simple pairwise interactions in learning trait features, like those between students and exercises or exercises and concepts, while disregarding the complex higher-order interactions that typically exist among these components, which in turn hinders the prediction results. In this paper, we using an innovative Multi-Channel Graph Contrastive Learning (MCGCL) framework that integrates various high-order interactions for predicting student performance. MCGCL characterizes graph structures reflecting various high-order relationships among students, exercises, and concepts through multiple channels, thereby enhancing the trait features of both students and exercises. Moreover, graph contrastive learning is employed to enhance the representation of trait features acquired from high-order graph structures in diverse views. Extensive experiments on real-world datasets show that MCGCL achieves state-of-the-art results on the task of predicting student performance. The code is available at https://github.com/sunlitsong/MCGCL.
3726: ShortcutProbe: Probing Prediction Shortcuts for Learning Robust Models
Authors: Guangtao Zheng, Wenqian Ye, Aidong Zhang
Location: Montreal | Day: August 21st | Time: 10:00 | Session: Machine Learning (4/4)
Show Abstract
Deep learning models often achieve high performance by inadvertently learning spurious correlations between targets and non-essential features. For example, an image classifier may identify an object via its background that spuriously correlates with it. This prediction behavior, known as spurious bias, severely degrades model performance on data that lacks the learned spurious correlations. Existing methods on spurious bias mitigation typically require a variety of data groups with spurious correlation annotations called group labels. However, group labels require costly human annotations and often fail to capture subtle spurious biases such as relying on specific pixels for predictions. In this paper, we propose a novel post hoc spurious bias mitigation framework without requiring group labels. Our framework, termed ShortcutProbe, identifies prediction shortcuts that reflect potential non-robustness in predictions in a given model’s latent space. The model is then retrained to be invariant to the identified prediction shortcuts for improved robustness. We theoretically analyze the effectiveness of the framework and empirically demonstrate that it is an efficient and practical tool for improving a model’s robustness to spurious bias on diverse datasets.
3731: Approximated Behavioral Metric-based State Projection for Federated Reinforcement Learning
Authors: Zengxia Guo, Bohui An, Zhongqi Lu
Location: Montreal | Day: August 19th | Time: 15:00 | Session: ML: Federated Learning
Show Abstract
Federated reinforcement learning (FRL) methods usually share the encrypted local state or policy information and help each client to learn from others while preserving everyone’s privacy. In this work, we propose that sharing the approximated behavior metric-based state projection function is a promising way to enhance the performance of FRL and concurrently provides an effective protection of sensitive information. We introduce FedRAG, a FRL framework to learn a computationally practical projection function of states for each client and aggregating the parameters of projection functions at a central server. The FedRAG approach shares no sensitive task-specific information, yet provides information gain for each client. We conduct extensive experiments on the DeepMind Control Suite to demonstrate insightful results.
3743: GarmentDiffusion: 3D Garment Sewing Pattern Generation with Multimodal Diffusion Transformers
Authors: Xinyu Li, Qi Yao, Yuanda Wang
Location: Montreal | Day: August 19th | Time: 15:00 | Session: CV: Difusion models
Show Abstract
Garment sewing patterns are fundamental design elements that bridge the gap between design concepts and practical manufacturing. The generative modeling of sewing patterns is crucial for creating diversified garments. However, existing approaches are limited either by reliance on a single input modality or by suboptimal generation efficiency. In this work, we present GarmentDiffusion, a new generative model capable of producing centimeter-precise, vectorized 3D sewing patterns from multimodal inputs (text, image, and incomplete sewing pattern). Our method efficiently encodes 3D sewing pattern parameters into compact edge token representations, achieving a sequence length that is 10 times shorter than that of the autoregressive SewingGPT in DressCode. By employing a diffusion transformer, we simultaneously denoise all edge tokens along the temporal axis, while maintaining a constant number of denoising steps regardless of dataset-specific edge and panel statistics. With all combination of designs of our model, the sewing pattern generation speed is accelerated by 100 times compared to SewingGPT. We achieve new state-of-the-art results on DressCodeData, as well as on the largest sewing pattern dataset, namely GarmentCodeData. The project website is available at https://shenfu-research.github.io/Garment-Diffusion.
3756: Learning Advanced Self-Attention for Linear Transformers in the Singular Value Domain
Authors: Hyowon Wi, Jeongwhan Choi, Noseong Park
Location: Montreal | Day: August 21st | Time: 15:00 | Session: ML: Large Language Models
Show Abstract
Transformers have demonstrated remarkable performance across diverse domains. The key component of Transformers is self-attention, which learns the relationship between any two tokens in the input sequence. Recent studies have revealed that the self-attention can be understood as a normalized adjacency matrix of a graph. Notably, from the perspective of graph signal processing (GSP), the self-attention can be equivalently defined as a simple graph filter, applying GSP using the value vector as the signal. However, the self-attention is a graph filter defined with only the first order of the polynomial matrix, and acts as a low-pass filter preventing the effective leverage of various frequency information. Consequently, existing self-attention mechanisms are designed in a rather simplified manner. Therefore, we propose a novel method, called Attentive Graph Filter (AGF), interpreting the self-attention as learning the graph filter in the singular value domain from the perspective of graph signal processing for directed graphs with the linear complexity w.r.t. the input length. In our experiments, we demonstrate that AGF achieves state-of-the-art performance on various tasks, including Long Range Arena benchmark and time series classification. Code is available at https://github.com/hyowonwi/agf.
3758: Automated Superscalar Processor Design by Learning Data Dependencies
Authors: Shuyao Cheng, Rui Zhang, Wenkai He, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Yifan Hao, Guanglin Xu, Yuanbo Wen, Ling Li, Qi Guo, Yunji Chen
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Machine Learning (2/4)
Show Abstract
Automated processor design, which can significantly reduce human efforts and accelerate design cycles, has received considerable attention. While recent advancements have automatically designed single-cycle processors that execute one instruction per cycle, their performance cannot compete with modern superscalar processors that execute multiple instructions per cycle. Previous methods fail on superscalar processor design because they cannot address inter-instruction data dependencies, leading to inefficient sequential instruction execution.

This paper proposes a novel approach to automatically designing superscalar processors using a hardware-friendly model called the Stateful Binary Speculation Diagram (State-BSD). We observe that processor parallelism can be enhanced through on-the-fly inter-instruction dependent data predictors, reusing the processor’s internal states to learn the data dependency. To meet the challenge of both hardware-resource limitation and design functional correctness, State-BSD consists of two components: 1) a lightweight state-selector trained by simulated annealing method to detect the most reusable processor states and store them in a small buffer; and 2) a highly precise state-speculator trained by BSD expansion method to predict the inter-instruction dependent data using the selected states. It is the first work to achieve the automated superscalar processor design, i.e. QiMeng-CPU-v2, which improves the performance by about 380x than the state-of-the-art automated design and is comparable to human-designed superscalar processors such as ARM Cortex A53.
3768: Efficient Diversity-based Experience Replay for Deep Reinforcement Learning
Authors: Kaiyan Zhao, Yiming Wang, Yuyang Chen, Yan Li, Leong Hou U, Xiaoguang Liu
Location: Guangzhou | Day: TBD
Show Abstract
Experience replay is widely used to improve learning efficiency in reinforcement learning by leveraging past experiences. However, existing experience replay methods, whether based on uniform or prioritized sampling, often suffer from low efficiency, particularly in real-world scenarios with high-dimensional state spaces. To address this limitation, we propose a novel approach, Efficient Diversity-based Experience Replay (EDER). EDER employs a determinantal point process to model the diversity between samples and prioritizes replay based on the diversity between samples. To further enhance learning efficiency, we incorporate Cholesky decomposition for handling large state spaces in realistic environments. Additionally, rejection sampling is applied to select samples with higher diversity, thereby improving overall learning efficacy. Extensive experiments are conducted on robotic manipulation tasks in MuJoCo, Atari games, and realistic indoor environments in Habitat. The results demonstrate that our approach not only significantly improves learning efficiency but also achieves superior performance in high-dimensional, realistic environments.
3776: Graph Random Walk with Feature-Label Space Alignment: A Multi-Label Feature Selection Method
Authors: Wanfu Gao, Jun Gao, Qingqi Han, Hanlin Pan, Kunpeng Liu
Location: Guangzhou | Day: TBD
Show Abstract
The rapid growth in feature dimension may introduce implicit associations between features and labels in multi-label datasets, making the relationships between features and labels increasingly complex. Moreover, existing methods often adopt low-dimensional linear decomposition to explore the associations between features and labels. However, linear decomposition struggles to capture complex nonlinear associations and may lead to misalignment between the feature space and the label space. To address these two critical challenges, we propose innovative solutions. First, we design a random walk graph that integrates feature-feature, label-label, and feature-label relationships to accurately capture nonlinear and implicit indirect associations, while optimizing the latent representations of associations between features and labels after low-rank decomposition. Second, we align the variable spaces by leveraging low-dimensional representation coefficients, while preserving the manifold structure between the original high-dimensional multi-label data and the low-dimensional representation space. Extensive experiments and ablation studies conducted on seven benchmark datasets and three representative datasets using various evaluation metrics demonstrate the superiority of the proposed method.
3777: Unveiling Maternity and Infant Care Conversations: A Chinese Dialogue Dataset for Enhanced Parenting Support
Authors: Bo Xu, Liangzhi Li, Junlong Wang, Xuening Qiao, Erchen Yu, Yiming Qian, Linlin Zong, Hongfei Lin
Location: Guangzhou | Day: TBD
Show Abstract
The rapid development of large language models has greatly advanced human-computer dialogue research. However, applying these models to specialized fields like maternity and infant care often leads to subpar performance due to a lack of domain-specific datasets. To address this problem, we have created MicDialogue, a Chinese dialogue dataset for maternity and infant care. MicDialogue involves a wide range of specialized topics, including gynecological health, pediatric care, pregnancy preparation, emotional counseling and other related topics. This dataset is curated from two types of Chinese social media: short videos and blog posts. Short videos capture real-time interactions and pragmatic dialogue patterns, while blog posts offer comprehensive coverage of various topics within the domain. We have also included detailed annotations for topics, diseases, symptoms, and causes, enabling in-depth research. Additionally, we developed a knowledge-driven benchmark model using LLM-based prompt learning and multiple knowledge graphs to address diverse dialogue topics. Experiments validate MicDialogue’s usability, providing benchmarks for future research and essential data for fine-tuning language models in maternity and infant care.
3796: Unlocking Dark Vision Potential for Medical Image Segmentation
Authors: Hongpeng Yang, Xiangyu Hu, Yingxin Chen, Siyu Chen, Srihari Nelakuditi, Yan Tong, Shiqiang Ma, Fei Guo
Location: Guangzhou | Day: TBD
Show Abstract
Accurate segmentation of lesions is crucial for disease diagnosis and treatment planning. However, blurring and low contrast in the imaging process can affect segmentation results. We have observed that noninvasive medical imaging shares considerable similarities with natural images under low light conditions and that nocturnal animals possess extremely strong night vision capabilities. Inspired by the dark vision of these nocturnal animals, we proposed a novel plug-and-play dark vision network (DVNet) to enhance the model’s perception for low-contrast medical images. Specifically, by employing the wavelet transform, we decompose medical images into subbands of varying frequencies, mimicking the sensitivity of photoreceptor cells to different light intensities. To simulate the antagonistic receptive fields of horizontal cells and bipolar cells, we design a Mamba-Enhanced Fusion Module to achieve global information correlation and enhance contrast between lesions and surrounding healthy tissues. Extensive experiments demonstrate that the DVNet achieves SOTA performance in various medical image segmentation tasks.
3805: Dividing Conflicting Items Fairly
Authors: Ayumi Igarashi, Pasin Manurangsi, Hirotaka Yoneda
Location: Montreal | Day: August 21st | Time: 11:30 | Session: GTEP: Fair division
Show Abstract
We study the allocation of indivisible goods under conflicting constraints, represented by a graph. In this framework, vertices correspond to goods and edges correspond to conflicts between a pair of goods. Each agent is allocated an independent set in the graph. In a recent work of Kumar et al. (AAMAS, 2024), it was shown that a maximal EF1 allocation exists for interval graphs and two agents with monotone valuations. We significantly extend this result by establishing that a maximal EF1 allocation exists for any graph when the two agents have monotone valuations. To compute such an allocation, we present a polynomial-time algorithm for additive valuations, as well as a pseudo-polynomial time algorithm for monotone valuations. Moreover, we complement our findings by providing a counterexample demonstrating a maximal EF1 allocation may not exist for three agents with monotone valuations; further, we establish NP-hardness of determining the existence of such allocations for every fixed number n >= 3 of agents. All of our results for goods also apply to the allocation of chores.
3814: Endowing Interpretability for Neural Cognitive Diagnosis by Efficient Kolmogorov-Arnold Networks
Authors: Shangshang Yang, Linrui Qin, Xiaoshan Yu, Ziwen Wang, Xueming Yan, Haiping Ma, Ye Tian
Location: Guangzhou | Day: TBD
Show Abstract
Cognitive diagnosis is crucial for intelligent education because of its ability to reveal students’ proficiency in knowledge concepts. Although neural network-based neural cognitive diagnosis models (CDMs) have exhibited significantly better performance than traditional models, neural cognitive diagnosis is criticized for the poor model interpretability due to the multi-layer perceptron(MLP) employed, even with the monotonicity assumption. Therefore, this paper proposes to empower the interpretability of neural cognitive diagnosis models through efficient Kolmogorov-Arnold networks (KANs), named KAN2CD, where KANs are used to enhance interpretability in two manners. Specifically, in the first manner, KANs are directly used to replace the used MLPs in existing neural CDMs; while in the second manner, the student embedding, exercise embedding, and concept embedding are directly processed by several KANs, and then their outputs are further combined and learned in a unified KAN to get final predictions. Besides, the implementation of original KANs is modified without affecting the interpretability to overcome the problem of training KANs slowly. Extensive experiments show KAN2CD outperforms traditional CDMs and slightly surpasses existing neural CDMs, and its learned structures ensure interpretability on par with traditional CDMs and better than neural CDMs. The datasets, associated code, and more experimental results are available at https://github.com/null233QAQ/KAN2CD.
3822: Reliable Disentanglement Multi-view Learning Against View Adversarial Attacks
Authors: Xuyang Wang, Siyuan Duan, Qizhi Li, Guiduo Duan, Yuan Sun, Dezhong Peng
Location: Guangzhou | Day: TBD
Show Abstract
Trustworthy multi-view learning has attracted extensive attention because evidence learning can provide reliable uncertainty estimation to enhance the credibility of multi-view predictions. Existing trusted multi-view learning methods implicitly assume that multi-view data is secure. However, in safety-sensitive applications such as autonomous driving and security monitoring, multi-view data often faces threats from adversarial perturbations, thereby deceiving or disrupting multi-view models. This inevitably leads to the adversarial unreliability problem (AUP) in trusted multi-view learning. To overcome this tricky problem, we propose a novel multi-view learning framework, namely Reliable Disentanglement Multi-view Learning (RDML). Specifically, we first propose evidential disentanglement learning to decompose each view into clean and adversarial parts under the guidance of corresponding evidences, which is extracted by a pretrained evidence extractor. Then, we employ the feature recalibration module to mitigate the negative impact of adversarial perturbations and extract potential informative features from them. Finally, to further ignore the irreparable adversarial interferences, a view-level evidential attention mechanism is designed. Extensive experiments on multi-view classification tasks with adversarial attacks show that RDML outperforms the state-of-the-art methods by a relatively large margin. Our code is available at https://github.com/Willy1005/2025-IJCAI-RDML.
3825: RePST: Language Model Empowered Spatio-Temporal Forecasting via Semantic-Oriented Reprogramming
Authors: Hao Wang, Jindong Han, Wei Fan, Leilei Sun, Hao Liu
Location: Guangzhou | Day: TBD
Show Abstract
Spatio-temporal forecasting is pivotal in numerous real-world applications, including transportation planning, energy management, and climate monitoring.
In this work, we aim to harness the reasoning and generalization abilities of Pre-trained Language Models (PLMs) for more effective spatio-temporal forecasting, particularly in data-scarce scenarios.
However, recent studies uncover that PLMs, which are primarily trained on textual data, often falter when tasked with modeling the intricate correlations in numerical time series, thereby limiting their effectiveness in comprehending spatio-temporal data.
To bridge the gap, we propose RePST, a semantic-oriented PLM reprogramming framework tailored for spatio-temporal forecasting.
Specifically, we first propose a semantic-oriented decomposer that adaptively disentangles spatially correlated time series into interpretable sub-components, which facilitates PLM to understand sophisticated spatio-temporal dynamics via a divide-and-conquer strategy.
Moreover, we propose a selective discrete reprogramming scheme, which introduces an expanded spatio-temporal vocabulary space to project spatio-temporal series into discrete representations. This scheme minimizes the information loss during reprogramming and enriches the representations derived by PLMs.
Extensive experiments on real-world datasets show that the proposed RePST outperforms twelve state-of-the-art baseline methods, particularly in data-scarce scenarios, highlighting the effectiveness and superior generalization capabilities of PLMs for spatio-temporal forecasting.
Codes and Appendix can be found at https://github.com/usail-hkust/REPST.
3838: Graph Prompts: Adapting Video Graph for Video Question Answering
Authors: Yiming Li, Xiaoshan Yang, Bing-Kun Bao, Changsheng Xu
Location: Guangzhou | Day: TBD
Show Abstract
Due to the dynamic nature in videos, it is evident that perceiving and reasoning about temporal information are the key focus of Video Question Answering (VideoQA). In recent years, several methods have explored relationship-level temporal modeling with graph-structured video representation. Unfortunately, these methods heavily rely on the question text, thus making it challenging to perceive and reason about video content that is not explicitly mentioned in the question. To address the above challenge, we propose Graph Prompts-based VideoQA (GP-VQA), which adopts a video-based graph structure for enhanced video understanding. The proposed GP-VQA contains two stages, i.e., pre-training and prompt tuning. In pre-training, we define the pretext task that requires GP-VQA to reason about the randomly masked nodes or edges in the video graph, thus prompting GP-VQA to learn the reasoning ability with video-guided information. In prompt-tuning, we organize the textual question into question graph and implement message passing from video graph to question graph, therefore inheriting the video-based reasoning ability from video graph completion to VideoQA. Extensive experiments on various datasets have demonstrated the promising performance of GP-VQA.
3845: From Individual to Universal: Regularized Multi-view Joint Representation for Multi-view Subspace-Preserving Recovery
Authors: Libin Wang, Yulong Wang, Xinwei He, Qiwei Xie, Kit Ian Kou, Yuan Yan Tang
Location: Guangzhou | Day: TBD
Show Abstract
Recent years have witnessed an explosion of Multi- view Subspace Classification (MSCla) and Multi-view Subspace Clustering (MSClu) methods for various applications. However, their theoretical foundation have not been well explored and understood. In this paper, we investigate the multi-view subspace-preserving recovery theory, which is the theoretical underpinnings for MSCla and MSClu methods. Specifically, we derive novel geometrically interpretable conditions for the success of multi-view subspace-preserving recovery. Compared with prior related works, we make the following innovations: First, our theory does not require the equality constraint, which is a common requirement in prior theoretical works and may be too restrictive in reality. Second, we provide both Individual Theoretical Guarantee (ITG) and Universal Theoretical Guarantee (UTG) for multi-view subspace-preserving recovery while prior works only give the UTG. Third, we also apply the proposed theory to establish theoretical guarantees for MSCla and MSClu, respectively. Numerical results validate the proposed theory for multi-view subspace-preserving recovery.
3856: HIPP: Protecting Image Privacy via High-Quality Reversible Protected Version
Authors: Xi Ye, Lina Wang, Run Wang, Jiatong Liu, Geying Yang
Location: Guangzhou | Day: TBD
Show Abstract
With the rapid development of the internet, sharing photos through Social Network Platforms (SNPs) has become a new way for people to socialize, which poses serious threats to personal privacy. Recently, a thumbnail-preserving image privacy protection technique has emerged and garnered widespread attention. However, the existing schemes based on this technique often introduce noticeable noise into the protected image, resulting in poor visual quality. Motivated by the observation that a latent vector can be decoupled into the detail and contour components, in this paper, we propose HIPP, a thumbnail-preserving image privacy protection scheme that decouples the detail and contour information contained in the latent vector corresponding to the original image and reconstructs details by generation model. As a result, the generated protected image appears natural and has a thumbnail similar to the original one. Moreover, the protected images can be restored to versions that are indistinguishable from the original images. Experiments on CelebA, Helen, and LSUN datasets show that the SSIM between the restored and original images achieves 0.9899. Furthermore, compared to the previous works, HIPP achieves the lowest runtime and file expansion rate, with values of 0.07 seconds and 1.1046, respectively.
3863: Adaptive Wizard for Removing Cross-Tier Misconfigurations in Active Directory
Authors: Huy Q. Ngo, Mingyu Guo, Hung X. Nguyen
Location: Montreal | Day: August 21st | Time: 10:00 | Session: MTA: Security and privacy
Show Abstract
Security vulnerabilities in Windows Active Directory (AD) systems are typically modeled using an attack graph and hardening AD systems involves an iterative workflow: security teams propose an edge to remove, and IT operations teams manually review these fixes before implementing the removal. As verification requires significant manual effort, we formulate an Adaptive Path Removal Problem to minimize the number of steps in this iterative removal process. In our model, a wizard proposes an attack path in each step and presents it as a set of multiple-choice options to the IT admin. The IT admin then selects one edge from the proposed set to remove. This process continues until the target t is disconnected from source s or the number of proposed paths reaches B. The model aims to optimize the human effort by minimizing the expected number of interactions between the IT admin and the security wizard. We first prove that the problem is #P-hard. We then propose a set of solutions including an exact algorithm, an approximate algorithm, and several scalable heuristics. Our best heuristic, called DPR, can operate effectively on larger-scale graphs compared to the exact algorithm and consistently outperforms the approximate algorithm across all graphs. We verify the effectiveness of our algorithms on several synthetic AD graphs and an AD attack graph collected from a real organization.
3867: Human-Imperceptible, Machine-Recognizable Images
Authors: Fusheng Hao, Fengxiang He, Yikai Wang, Fuxiang Wu, Jing Zhang, Dacheng Tao, Jun Cheng
Location: Guangzhou | Day: TBD
Show Abstract
Massive human-related data is collected to train neural networks for computer vision tasks. A major conflict is exposed relating to software engineers between better developing AI systems and distancing from the sensitive training data. To reconcile this conflict, the paper proposes an efficient privacy-preserving learning paradigm, where images are encrypted to become “human-imperceptible, machine-recognizable” via one of the two encryption strategies: (1) random shuffling equally-sized patches and (2) mixing-up sub-patches. Then, minimal adaptations are made to vision transformer to enable it to learn on the encrypted images for vision tasks, including image classification and object detection. Extensive experiments on ImageNet and COCO show that the proposed paradigm achieves comparable accuracy with the competitive methods. Decrypting the encrypted images requires solving an NP-hard jigsaw puzzle or ill-posed inverse problem, which is empirically shown intractable to be recovered by various attackers, including the powerful vision transformer-based attacker. We thus show that the proposed paradigm can ensure the encrypted images have become human-imperceptible while preserving machine-recognizable information.
3881: DANCE: Resource-Efficient Neural Architecture Search with Data-Aware and Continuous Adaptation
Authors: Maolin Wang, Tianshuo Wei, Sheng Zhang, Ruocheng Guo, Wangyu Wang, Shanshan Ye, Lixin Zou, Xuetao Wei, Xiangyu Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Neural Architecture Search (NAS) has emerged as a powerful approach for automating neural network design. However, existing NAS methods face critical limitations in real-world deployments: architectures lack adaptability across scenarios, each deployment context requires costly separate searches, and performance consistency across diverse platforms remains challenging. We propose DANCE (Dynamic Architectures with Neural Continuous Evolution), which reformulates architecture search as a continuous evolution problem through learning distributions over architectural components. DANCE introduces three key innovations: a continuous architecture distribution enabling smooth adaptation, a unified architecture space with learned selection gates for efficient sampling, and a multi-stage training strategy for effective deployment optimization. Extensive experiments across five datasets demonstrate DANCE’s effectiveness. Our method consistently outperforms state-of-the-art NAS approaches in terms of accuracy while significantly reducing search costs. Under varying computational constraints, DANCE maintains robust performance while smoothly adapting architectures to different hardware requirements. The code and appendix can be found at https://github.com/Applied-Machine-Learning-Lab/DANCE.
3882: Towards Generalizable Neural Simulators: Addressing Distribution Shifts Induced by Environmental and Temporal Variations
Authors: Jiaqi Liu, Jiaxu Cui, Shiang Sun, Yizhu Zhao, Bo Yang
Location: Guangzhou | Day: TBD
Show Abstract
With advancements in deep learning, neural simulators have become increasingly important for improving the efficiency and effectiveness of simulating complex dynamical systems in various scientific and technological fields. This paper presents a novel neural simulator called Context-informed Polymorphic Neural ODE Processes (CoPoNDP), aimed at addressing the challenges of modeling dynamical systems encountering concurrent environmental and temporal distribution shifts, which are common in real-world scenarios. CoPoNDP employs a context-driven neural stochastic process governed by a combination of basic differential equations in a time-sensitive manner to adaptively modulate the evolution of system states. This allows for flexible adaptation to changing temporal dynamics and generalization across different environments. Extensive experiments conducted on dynamical systems from ecology, chemistry, physics, and energy demonstrate that by effectively utilizing contextual information, CoPoNDP outperforms the state-of-the-art models in handling joint distribution shifts. It also shows robustness in sparse and noisy settings, making it a promising approach for modeling dynamical systems in complex real-world applications.
3883: Dual Encoder Contrastive Learning with Augmented Views for Graph Anomaly Detection
Authors: Nannan Wu, Hongdou Dong, Wenjun Wang, Yiming Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Graph anomaly detection (GAD), which aims to identify patterns that deviate significantly from normal nodes in attributed networks, is widely used in financial fraud, cybersecurity, and bioinformatics. The paradigms of jointly optimizing contrastive learning and reconstruction learning have shown significant potential in this field. However, when using GNNs as an encoder, it still faces the problem of over-smoothing, and it is difficult to effectively capture the fine-grain topology information of the graph. In this paper, we introduce an innovative approach: Dual Encoder Contrastive Learning with Augmented Views for Graph Anomaly Detection, named DECLARE. Specifically, the dual encoder integrates the strengths of GNNs and Graph Transformers to learn graph representation from multiple perspectives comprehensively. Although contrastive learning enhances the model’s ability to learn discriminative features, it cannot directly identify anomalous patterns. To address this, the reconstruction module independently reconstructs graph structures and attributes, helping the model focus on learning the normal patterns of both structure and attributes. Through extensive experimental analysis, we demonstrate the superiority of DECLARE over the state-of-the-art baselines on six benchmark datasets.
3894: Graph OOD Detection via Plug-and-Play Energy-based Evaluation and Propagation
Authors: Yunxia Zhang, Mingchen Sun, Yutong Zhang, Funing Yang, Ying Wang
Location: Guangzhou | Day: TBD
Show Abstract
Existing graph neural network (GNN) methods are typically built upon the i.i.d. assumption, emphasizing the enhancement of the test performance for in-distribution (ID) data. However, there has been limited exploration of their adaptability to scenarios involving unknown distribution data. On the one hand, in real-world application scenarios, graph data often expands continuously with the acquisition of external knowledge, which means that new nodes with unknown categories may be added to the graph data. The gap between the new node distribution and the original node distribution can make existing GNN methods less effective. On the other hand, existing out-of-distribution (OOD) detection methods often rely on the softmax confidence score, which makes the OOD data suffer from overconfident posterior distributions. To address the above issues, we propose an Energy Propagation-based Graph Neural Network (EPGNN), which improves the OOD generalization ability by endowing GNN with the capacity to detect the OOD nodes in the graph. Specifically, we first construct GNN encoder to obtain node embedding that incorporates neighborhood structural information. Then, we design a plug-and-play energy-based OOD evaluator by assigning corresponding energy values to different nodes. Finally, we construct a plug-and-play structure-aware energy propagation module and joint alignment regularization, which make the node energy more flexible during the training process. Extensive experiments on benchmark datasets demonstrate the superiority of our method.
3899: Efficient Quantum Approximate kNN Algorithm via Granular-Ball Computing
Authors: Shuyin Xia, Xiaojiang Tian, Suzhen Yuan, Jeremiah D. Deng
Location: Guangzhou | Day: TBD
Show Abstract
High time complexity is one of the biggest challenges faced by k-Nearest Neighbors (kNN). Although current classical and quantum kNN algorithms have made some improvements, they still have a speed bottleneck when facing large amounts of data. To address this issue, we propose an innovative algorithm called Granular-Ball based Quantum kNN(GB-QkNN). This approach achieves higher efficiency by first employing granular-balls, which reduces the data size needed to processed. The search process is then accelerated by adopting a Hierarchical Navigable Small World (HNSW) method. Moreover, we optimize the time-consuming steps, such as distance calculation, of the HNSW via quantization, further reducing the time complexity of the construct and search process. By combining the use of granular-balls and quantization of the HNSW method, our approach manages to take advantage of these treatments and significantly reduces the time complexity of the kNN-like algorithms, as revealed by a comprehensive complexity analysis.
3909: A Reduction-Based Algorithm for the Clique Interdiction Problem
Authors: Chenghao Zhu, Yi Zhou, Haoyu Jiang
Location: Guangzhou | Day: TBD
Show Abstract
The Clique Interdiction Problem (CIP) aims to minimize the size of the largest clique in a given graph by removing a given number of vertices.
The CIP models a special Stackelberg game and has important applications in fields such as pandemic control and terrorist identification.
However, the CIP is a bilevel graph optimization problem, making it very challenging to solve. Recently, data reduction techniques have been successfully applied in many (single-level) graph optimization problems like vertex cover.
Motivated by this, we investigate a set of novel reduction rules and design a reduction-based algorithm, RECIP, for practically solving the CIP.
RECIP enjoys an effective preprocessing procedure that systematically reduces the input graph, making the problem much easier to solve.
Extensive experiments on 124 large real-world networks demonstrate the superior performance of RECIP and validate the effectiveness of the proposed reduction rules.
3914: State Feedback Enhanced Graph Differential Equations for Multivariate Time Series Forecasting
Authors: Jiaxu Cui, Qipeng Wang, Yiming Zhao, Bingyi Sun, Pengfei Wang, Bo Yang
Location: Guangzhou | Day: TBD
Show Abstract
Multivariate time series forecasting holds significant theoretical and practical importance in various fields, including web analytics and transportation. Recently, graph neural networks and graph differential equations have shown exceptional capabilities in modeling spatio-temporal features. However, existing methods often suffer from over-smoothing, hindering real-world problem-solving. In this work, we analyze the graph propagation process as a dynamical system and propose a novel feedback mechanism to enhance representation power, adaptively adjusting the representations to align with desired performance outcomes, thereby fundamentally mitigating the issue of over-smoothing. Moreover, we introduce an effective multivariate time series forecasting model called SF-GDE, based on the proposed graph propagation with the feedback mechanism. Intensive experiments are conducted on three real-world datasets from diverse fields. Results show that SF-GDE outperforms the state of the arts, and the feedback mechanism can serve as a universal booster to improve performance for graph propagation models.
3926: LoD: Loss-difference OOD Detection by Intentionally Label-Noisifying Unlabeled Wild Data
Authors: Chuanxing Geng, Qifei Li, Xinrui Wang, Dong Liang, Songcan Chen, Pong C. Yuen
Location: Guangzhou | Day: TBD
Show Abstract
Using unlabeled wild data containing both in-distribution (ID) and out-of-distribution (OOD) data to improve the safety and reliability of models has recently received increasing attention. Existing methods either design customized losses for labeled ID and unlabeled wild data then perform joint optimization, or first filter out OOD data from the latter then learn an OOD detector. While achieving varying degrees of success, two potential issues remain: (i) Labeled ID data typically dominates the learning of models, inevitably making models tend to fit OOD data as IDs; (ii) The selection of thresholds for identifying OOD data in unlabeled wild data usually faces dilemma due to the unavailability of pure OOD samples. To address these issues, we propose a novel loss-difference OOD detection framework (LoD) by intentionally label-noisifying unlabeled wild data. Such operations not only enable labeled ID data and OOD data in unlabeled wild data to jointly dominate the models’ learning but also ensure the distinguishability of the losses between ID and OOD samples in unlabeled wild data, allowing the classic clustering technique (e.g., K-means) to filter these OOD samples without requiring thresholds any longer. We also provide theoretical foundation for LoD’s viability, and extensive experiments verify its superiority.
3931: Rethinking Removal Attack and Fingerprinting Defense for Model Intellectual Property Protection: A Frequency Perspective
Authors: Cheng Zhang, Yang Xu, Tingqiao Huang, Zixing Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Training deep neural networks is resource-intensive, making it crucial to protect their intellectual property from infringement. However, current model ownership resolution (MOR) methods predominantly address general removal attacks that involve weight modifications, with limited research considering alternative attack perspectives. In this work, we propose a frequency-based model ownership removal attack, grounded in a key observation: modifying a model’s high-frequency coefficients does not significantly impact its performance but does alter its weights and decision boundary. This change invalidates the existing MOR methods. We further propose a frequency-based fingerprinting technique as a defense mechanism. By extracting frequency-domain characteristics instead of decision boundary or model weights, our fingerprinting defense effectively against the proposed frequency-based removal attack and demonstrates robustness against existing general removal attacks. The experimental results show that the frequency-based removal attack can easily defeat state-of-the-art white-box watermarking and fingerprinting schemes while preserving model performance, and the proposed defense method is also effective. Our code is released at: https://github.com/huangtingqiao/RRA-IJCAI25.
3935: Enabling Visual Foundation Models to Teach Compact Students via Mixture of Distillation
Authors: Xinye Yang, Shang Wang, Li Luking, Yipeng Chen
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we present a novel Mixture of Distillation (MoD) framework for distilling lightweight student models using Visual Foundation Models (VFMs) as teachers. Knowledge distillation (KD) is a crucial training strategy for improving model performance. However, conventional KD methods face two main challenges: (1) selecting \& training appropriate teacher models and (2) designing effective knowledge distillation techniques. To address the first challenge, we leverage recent VFMs like CLIP, Grounding DINO, and SAM as teachers, capitalizing on their remarkable zero-shot generalization abilities and low fine-tuning requirements for new tasks, thereby avoiding expensive retraining of teachers. For the second challenge, our MoD framework focuses on extracting and decomposing the feature and logit knowledge from VFMs into multiple knowledge experts, which capture modality-specific information across batches, channels, and instances. Each knowledge expert undergoes separate projections, reshaping, normalization, and learnable magnitude operations. Then, we employ sparse knowledge gates with a softmax function followed by a KeepTopK operation for different knowledge experts. In this way, our MoD not only bridges the distillation gap between VFMs and students but also allows the adaptive transfer of useful knowledge across different domains. Extensive experiments on various classification, detection, and medical segmentation tasks validate the effectiveness of our approach with other models. Moreover, our MoD framework demonstrates the potential for transferring zero-shot abilities from VFMs without relying on ground-truth labels. Notably, our MoD achieves impressive performance, attaining 72.48% for RepViT with 76.20% CLIP teacher on ImageNet-1K without annotations.
3946: Priority Guided Explanation for Knowledge Tracing with Dual Ranking and Similarity Consistency
Authors: Fan Li, Tiancheng Zhang, Yifang Yin, Minghe Yu, Mengxiang Wang, Ge Yu
Location: Guangzhou | Day: TBD
Show Abstract
Knowledge tracing plays a pivotal role in enabling personalized learning on online platforms. While deep learning-based approaches have achieved impressive predictive performance, their limited interpretability poses a significant barrier to practical adoption. Existing explanation methods primarily focus on specific model architectures and fall short in 1) explicitly prioritizing critical interactions to generate fine-grained explanations, and 2) maintaining similarity consistency across interaction importance. These limitations hinder actionable insights for improving student outcomes. To bridge the gap, we propose a model-agnostic approach that provides enhanced explanations applicable to diverse knowledge tracing methods. Specifically, we propose a novel ranking loss designed to explicitly optimize the importance ranking of past interactions by comparing their corresponding perturbed outputs. Furthermore, we introduce a similarity loss to capture temporal dependencies, ensuring consistency in the assigned importance scores for conceptually similar interactions. Extensive experiments conducted on various knowledge tracing models and benchmark datasets demonstrate substantial enhancements in explanation quality.
3958: Public Signaling in Markets with Information Asymmetry Using a Limited Number of Signals
Authors: Xu Zhao, Ren Liu, Weiran Shen
Location: Guangzhou | Day: TBD
Show Abstract
Consider a market with a seller and many buyers. The seller has a kind of item for sale to the buyers. The items have a quality and each buyer has a private type. The quality is only known to the seller, and the buyers only have a prior belief of the quality. A third party (e.g., intermediaries or product reviewers) is able to reveal information about the actual quality by using a so-called signaling scheme. After receiving the information, buyers can update their beliefs accordingly and decide whether to buy the items.
We consider the third party’s problem of maximizing the purchasing probability by sending signals. However, the optimal signaling scheme has implementation issues, as the number of signals in the optimal scheme is the same as the number of buyer types, which can be exceedingly large or even infinite. We therefore investigate whether a finite and limited set of signals could still approximate the performance of the optimal signaling scheme. Unfortunately, our results show that with a finite number of signals, no signaling scheme can achieve a certain fraction of the performance of the optimal signaling scheme. This limitation persists even with the regularity or the monotone hazard rate assumption. Nevertheless, we identify a mild technical condition under which the third party can approximate the optimal performance within a constant factor by employing only two signals.
We also conduct extensive experiments to substantiate our theoretic results. These experiments compare the performance of using a small signal set across different value distributions. Despite the negative results, our experiment results show that using only a small number of signals is able to achieve a fairly reasonable performance in average cases.
3973: Advancing Community Detection with Graph Convolutional Neural Networks: Bridging Topological and Attributive Cohesion
Authors: Anjali de Silva, Gang Chen, Hui Ma, Seyed Mohammad Nekooei, Xingquan Zuo
Location: Montreal | Day: August 19th | Time: 15:00 | Session: ML: Reinforcement learning (1/2)
Show Abstract
Community detection, a vital technology for real-world applications, uncovers cohesive node groups (communities) by leveraging both topological and attribute similarities in social networks. However, existing Graph Convolutional Networks (GCNs) trained to maximize modularity often converge to suboptimal solutions. Additionally, directly using human-labeled communities for training can undermine topological cohesiveness by grouping disconnected nodes based solely on node attributes. We address these issues by proposing a novel Topological and Attributive Similarity-based Community detection (TAS-Com) method. TAS-Com introduces a novel loss function that exploits the highly effective and scalable Leiden algorithm to detect community structures with global optimal modularity. Leiden is further utilized to refine human-labeled communities to ensure connectivity within each community, enabling TAS-Com to detect community structures with desirable trade-offs between modularity and compliance with human labels. Experimental results on multiple benchmark networks confirm that TAS-Com can significantly outperform several state-of-the-art algorithms.
3991: Dynamic Anchor-based Ensemble Clustering via Hypergraph Reconstruction
Authors: Jiaxuan Xu, Lei Duan, Xinye Wang, Liang Du
Location: Guangzhou | Day: TBD
Show Abstract
Ensemble clustering learns a consensus result by integrating a set of base clustering results. Recently, anchor-based methods construct an anchor similarity matrix to represent the affinity relationships among samples, significantly improving computational efficiency. However, these methods struggle with fixed anchors generated by static anchor learning strategies, which lead to low-quality anchor similarity matrix and poor clustering accuracy. To address this issue, we propose a novel method named dynamic anchor-based ensemble clustering via hypergraph reconstruction (YACHT). Specifically, YACHT first transforms the base clustering results into a hypergraph and designs a novel hypergraph enhancement strategy to improve the reliability of the initial hypergraph. YACHT reconstructs the hypergraph through matrix factorization and introduces a mapping matrix to filter out redundant information, capturing a high-quality anchor similarity matrix. Then, YACHT attempts to incorporate the hypergraph into the optimization objective to achieve hypergraph updates. To ensure the accuracy of hypergraph updates, we impose a hypergraph regularizer and a local consensus information alignment term. The alignment term is implemented by minimizing the discrepancy between the label partition derived from the hypergraph regularizer and the local consensus information indicator matrix extracted from the base clustering results. Extensive experimental results demonstrate the outstanding performance of the proposed YACHT. The code is available at https://github.com/scu-kdde/YACHT.
4015: EVICheck: Evidence-Driven Independent Reasoning and Combined Verification Method for Fact-Checking
Authors: Lingxiao Wang, Lei Shi, Feifei Kou, Ligu Zhu, Chen Ma, Pengfei Zhang, Mingying Xu, Zeyu Li
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) have demonstrated significant potential in automated fact-checking. However, existing methods face limitations in insufficient evidence utilization and lack of explicit verification criteria. Specifically, these approaches aggregate evidence for collective reasoning without independently analyzing each piece, hindering their ability to leverage the available information thoroughly. Additionally, they rely on simple prompts or few-shot learning for verification, which makes truthfulness judgments less reliable, especially for complex claims. To address these limitations, we propose a novel method to enhance evidence utilization and introduce explicit verification criteria, named EVICheck. Our approach independently reasons each evidence piece and synthesizes the results to enable more thorough exploration and enhance interpretability. Additionally, by incorporating fine-grained truthfulness criteria, we make the model’s verification process more structured and reliable, especially when handling complex claims. Experimental results on the public RAWFC dataset demonstrate that EVICheck achieves state-of-the-art performance across all evaluation metrics. Our method demonstrates strong potential in fake news verification, significantly improving the accuracy.
4054: DPMamba: Distillation Prompt Mamba for Multimodal Remote Sensing Image Classification with Missing Modalities
Authors: Yueguang Yang, Jiahui Qu, Ling Huang, Wenqian Dong
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal remote sensing image classification (RSIC) has emerged as a key focus in Earth observation, driven by its capacity to extract complementary information from diverse sources. Existing methods struggle with modality absence caused by weather or equipment failures, leading to performance degradation. As a solution, knowledge distillation-based methods train student networks (SN) using a full-modality teacher, but they usually require training separate SN for each modality absence scenario, increasing complexity. To this end, we propose a unified Distillation Prompt Mamba (DPMamba) framework for multimodal RSIC with missing modalities. DPMamba leverages knowledge distillation in a shared text semantic space to optimize learnable prompts, transforming them from “placeholder" to “adaptation" states by enriching missing modality information with full-modality knowledge. To achieve this, we focus on two main aspects: first, we propose a new modality-aware Mamba for dynamically and hierarchically extracting cross-modality interactive features, providing richer, contextually relevant representations for backpropagation-based optimization of prompts; and second, we introduce a novel text-bridging distillation method to efficiently transfer full-modality knowledge, guiding the inclusion of missing modality information into prompts. Extensive evaluations demonstrate the effectiveness and robustness of the proposed DPMamba.
4056: 2D Gaussian Splatting for Outdoor Scene Decomposition and Relighting
Authors: Wei Feng, Kangrui Ye, Qi Zhang, Qian Zhang, Nan Li
Location: Guangzhou | Day: TBD
Show Abstract
Gaussian splatting techniques have recently revolutionized outdoor scene decomposition and relighting through multi-view images. However, achieving high rendering quality still requires a fixed lighting condition among all input views, which is costly or even impractical to capture in outdoor scenes. In this paper, we propose outdoor scene decomposition and relighting with 2D Gaussian splatting (OSDR-GS), a novel inverse rendering strategy under outdoor changing and unknown lighting conditions. Firstly, we present a lighting-based group learning framework that categorizes input images into multiple lighting groups, to learn the separate lighting from each group individually. Secondly, OSDR-GS introduces a fine-grained outdoor lighting component to represent sun-light and sky-light, respectively, which are also adjusted via the correlative exposure factors adaptively. Finally, we construct a visibility-driven shadow module to characterize the nuanced interplay of light and occlusion realistically, for eliminating the uncertainty of dark pixels on lighting-based group learning. Extensive experiments on multiple challenging outdoor datasets validate the effectiveness of OSDR-GS, which achieves the state-of-the-art performance in changing lighting scene inverse rendering.
4065: Multi-Source Collaborative Style Augmentation and Domain-Invariant Learning for Federated Domain Generalization
Authors: Yikang Wei
Location: Guangzhou | Day: TBD
Show Abstract
Federated domain generalization aims to learn a generalizable model from multiple decentralized source domains for deploying on the unseen target domain. The style augmentation methods have achieved great progress on domain generalization. However, the existing style augmentation methods either explore the data styles within isolated source domain or interpolate the style information across existing source domains under the data decentralization scenario, which leads to limited style space. To address this issue, we propose a Multi-source Collaborative Style Augmentation and Domain-invariant learning method (MCSAD) for federated domain generalization. Specifically, we propose a multi-source collaborative style augmentation module to generate data in the broader style space. Furthermore, we conduct domain-invariant learning between the original data and augmented data by cross-domain feature alignment within the same class and classes relation ensemble distillation between different classes to learn a domain-invariant model. By alternatively conducting collaborative style augmentation and domain-invariant learning, the model can generalize well on unseen target domain. Extensive experiments on multiple domain generalization datasets indicate that our method significantly outperforms the state-of-the-art federated domain generalization methods.
4066: CAN-ST: Clustering Adaptive Normalization for Spatio-temporal OOD Learning
Authors: Min Yang, Yang An, Jinliang Deng, Xiaoyu Li, Bin Xu, Ji Zhong, Xiankai Lu, Yongshun Gong
Location: Guangzhou | Day: TBD
Show Abstract
Spatio-temporal data mining is crucial for decision-making and planning in diverse domains. However, in real-world scenarios, training and testing data are often not independent or identically distributed due to rapid changes in data distributions over time and space, resulting in spatio-temporal out-of-distribution (OOD) challenges. This non-stationarity complicates accurate predictions and has motivated research efforts focused on mitigating non-stationarity through normalization operations. Existing methods, nonetheless, often address individual time series in isolation, neglecting correlations across series, which limits their capacity to handle complex spatio-temporal dynamics and results in suboptimal solutions. To overcome these challenges, we propose Clustering Adaptive Normalization (CAN-ST), a general and model-agnostic method that mitigates non-stationarity by capturing both localized distributional changes and shared patterns across nodes via adaptive clustering and a parameter register. As a plugin, CAN-ST can be easily integrated into various spatio-temporal prediction models. Extensive experiments on multiple datasets with diverse forecasting models demonstrate that CAN-ST consistently improves performance by over 20% on average and outperforms state-of-the-art normalization methods.
4070: Balancing Invariant and Specific Knowledge for Domain Generalization with Online Knowledge Distillation
Authors: Di Zhao, Jingfeng Zhang, Hongsheng Hu, Philippe Fournier-Viger, Gillian Dobbie, Yun Sing Koh
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: Computer Vision (3/3)
Show Abstract
Recent research has demonstrated the effectiveness of knowledge distillation in Domain Generalization. However, existing approaches often overlook domain-specific knowledge and rely on an offline distillation strategy, limiting the effectiveness of knowledge transfer. To address these limitations, we propose Balanced Online knowLedge Distillation (BOLD). BOLD leverages a multi-domain expert teacher model, with each expert specializing in a specific source domain, enabling the student to distill both domain-invariant and domain-specific knowledge. We incorporate the Pareto optimization principle and uncertainty weighting to balance these two types of knowledge, ensuring simultaneous optimization without compromising either. Additionally, BOLD employs an online knowledge distillation strategy, allowing the teacher and student to learn concurrently. This dynamic interaction enables the teacher to adapt based on student feedback, facilitating more effective knowledge transfer. Extensive experiments on seven benchmarks demonstrate that BOLD outperforms state-of-the-art methods. Furthermore, we provide theoretical insights that highlight the importance of domain-specific knowledge and the advantages of uncertainty weighting.
4074: Dynamic Higher-Order Relations and Event-Driven Temporal Modeling for Stock Price Forecasting
Authors: Kijeong Park, Sungchul Hong, Jong-June Jeon
Location: Montreal | Day: August 20th | Time: 14:00 | Session: ML: time series, sequences and signals
Show Abstract
In stock price forecasting, modeling the probabilistic dependence between stock prices within a time-series framework has remained a persistent and highly challenging area of research. We propose a novel model to explain the extreme co-movement in multivariate data with time-series dependencies. Our model incorporates a Hawkes process layer to capture abrupt co-movements, thereby enhancing the temporal representation of market dynamics. We introduce dynamic hypergraphs into our model adapting to higher-order (groupwise rather than pairwise) relationships within the stock market. Extensive experiments on real-world benchmarks demonstrate the robustness of our approach in predictive performance and portfolio stability.
4075: CD^2: Constrained Dataset Distillation for Few-Shot Class-Incremental Learning
Authors: Kexin Bao, Daichi Zhang, Hansong Zhang, Yong Li, Yutao Yue, Shiming Ge
Location: Guangzhou | Day: TBD
Show Abstract
Few-shot class-incremental learning (FSCIL) receives significant attention from the public to perform classification continuously with a few training samples, which suffers from the key catastrophic forgetting problem. Existing methods usually employ an external memory to store previous knowledge and treat it with incremental classes equally, which cannot properly preserve previous essential knowledge. To solve this problem and inspired by recent distillation works on knowledge transfer, we propose a framework termed Constrained Dataset Distillation (CD^2) to facilitate FSCIL, which includes a dataset distillation module (DDM) and a distillation constraint module (DCM). Specifically, the DDM synthesizes highly condensed samples guided by the classifier, forcing the model to learn compacted essential class-related clues from a few incremental samples. The DCM introduces a designed loss to constrain the previously learned class distribution, which can preserve distilled knowledge more sufficiently. Extensive experiments on three public datasets show the superiority of our method against other state-of-the-art competitors.
4090: PatternCIR Benchmark and TisCIR: Advancing Zero-Shot Composed Image Retrieval in Remote Sensing
Authors: Zhechun Liang, Tao Huang, Fangfang Wu, Shiwen Xue, Zhenyu Wang, Weisheng Dong, Xin Li, Guangming Shi
Location: Guangzhou | Day: TBD
Show Abstract
Remote sensing composed image retrieval
(RSCIR) is a new vision-language task that takes
a composed query of an image and text, aiming to
search for a target remote sensing image satisfying
two conditions from intricate remote sensing
imagery. However, the existing attribute-based
benchmark Patterncom in RSCIR has significant
flaws, including the lack of query text sentences
and paired triplets, thus making it unable to evaluate the latest methods. To address this, we propose
the Zero-Shot Query Text Generator (ZS-QTG)
that can generate full query text sentences based on
attributes, and then, by capitalizing on ZS-QTG,
we develop the PatternCIR benchmark. PatternCIR rectifies Patterncom’s deficiencies and enables
the evaluation of existing methods. Additionally,
we explore zero-shot composed image retrieval
methods that do not rely on massive pre-collected
triplets for training. Existing methods use only
the text during retrieval, performing poorly in
RSCIR. To improve this, we propose Text-image
Sequential Training of Composed Image Retrieval
(TisCIR). TisCIR undergoes sequential training of
multiple self-masking projection and fine-grained
image attention modules, which endows it with
the capacity to filter out conflicting information
between the image and text, enhancing the retrieval
by utilizing both modalities in harmony. TisCIR
outperforms existing methods by 12.40% to
62.03% on PatternCIR, achieving state-of-the-art
performance in RSCIR. The data and code are
available here.
4098: Counterfactual Strategies for Markov Decision Processes
Authors: Paul Kobialka, Lina Gerlach, Francesco Leofante, Erika Ábrahám, Silvia Lizeth Tapia Tarifa, Einar Broch Johnsen
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: ETF: Explainability and interpretability
Show Abstract
Counterfactuals are widely used in AI to explain how minimal changes to a model’s input can lead to a different output.
However, established methods for computing counterfactuals typically focus on one-step decision-making, and are not directly applicable to sequential decision-making tasks.
This paper fills this gap by introducing counterfactual strategies for Markov Decision Processes (MDPs).
During MDP execution, a strategy decides which of the enabled actions (with known probabilistic effects) to execute next.
Given an initial strategy that reaches an undesired outcome with a probability above some limit, we identify minimal changes to the initial strategy to reduce that probability below the limit.
We encode such counterfactual strategies as solutions to non-linear optimization problems, and further extend our encoding to synthesize diverse counterfactual strategies.
We evaluate our approach on four real-world datasets and demonstrate its practical viability in sophisticated sequential decision-making tasks.
4105: Facets in Argumentation: A Formal Approach to Argument Significance
Authors: Johannes K. Fichte, Nicolas Fröhlich, Markus Hecher, Victor Lagerkvist, Yasir Mahmood, Arne Meier, Jonathan Persson
Location: Montreal | Day: August 19th | Time: 15:00 | Session: KRR: Argumentation
Show Abstract
Argumentation is a central subarea of Artificial Intelligence (AI) for modeling and reasoning about arguments.
The semantics of abstract argumentation frameworks (AFs) is given by sets of arguments (extensions) and conditions on the relationship between arguments, such as stable or admissible.
Today’s solvers implement tasks such as finding extensions, deciding credulously or skeptically acceptance, counting, or enumerating extensions.
While these tasks are well charted, the area between decision and counting/enumeration and fine-grained reasoning requires expensive reasoning so far.
We introduce a novel concept (facets) for reasoning between decision and enumeration.
Facets are arguments that belong to some extensions (credulous) but not to all extensions (skeptical).
They are most natural when a user aims to navigate, filter, or comprehend specific arguments, according to their needs.
We study the complexity and show that tasks involving facets are much easier than counting extensions.
Finally, we provide an implementation, and conduct experiments to demonstrate feasibility.
4106: Spatio-temporal Prototype-based Hierarchical Learning for OD Demand Prediction
Authors: Shilu Yuan, Xiaoyu Li, Wenqian Mu, Ji Zhong, Meng Chen, Haoliang Sun, Yongshun Gong
Location: Guangzhou | Day: TBD
Show Abstract
Origin-Destination (OD) demand prediction is a pivotal yet highly challenging task in intelligent transportation systems, aiming to accurately forecast cross-region ridership flows within urban networks. While previous studies have focused on modeling node-to-node relationships, most of them neglect the fact that nodes (regions/stations) exhibit similar spatio-temporal (ST) patterns, which are termed as spatio-temporal prototypes. Capturing these prototypes is crucial for understanding the unified ST dependencies across the network. To bridge this gap, we propose STPro, an ST prototype-based hierarchical model with a dual-branch structure that extracts ST features from the micro and macro perspectives. At the micro level, our model learns unified ST features of individual nodes, while at the macro level, it employs dynamic clustering to identify city-wide ST prototypes, thereby uncovering latent patterns of urban mobility. Besides, we leverage different roles of nodes as origins and destinations by constructing dual O and D branches and learn the mutual information to model their intricate interactions and correlations. Extensive experiments on two public datasets demonstrate that our STPro outperforms recent state-of-the-art baselines, achieving remarkable predictive improvements in OD demand prediction.
4129: Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency
Authors: Ruixiao Li, Fahao Chen, Peng Li
Location: Guangzhou | Day: TBD
Show Abstract
Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated into LLM inference serving systems. However, inference requests typically exhibit uncertain execution time, which poses a significant challenge of efficiently scheduling requests in these systems. Existing work estimates execution time based solely on predicted output length, which could be inaccurate because execution time depends on both output length and token acceptance rate of verification by the LLM. In this paper, we propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD). Given a number of inference requests, LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding. When token acceptance rate is dynamic and execution time is difficult to estimate, LAPS-SD maintains multiple priority queues and allows request execution preemption across different queues. Once the token acceptance rate becomes stable, LAPS-SD can accurately estimate the execution time and schedule requests accordingly. Extensive experiments show that LAPS-SD reduces inference latency by approximately 39% compared to state-of-the-art scheduling methods.
4146: Squeezing Context into Patches: Towards Memory-Efficient Ultra-High Resolution Semantic Segmentation
Authors: Wang Liu, Puhong Duan, Xudong Kang, Shutao Li
Location: Guangzhou | Day: TBD
Show Abstract
Segmenting ultra-high-resolution (UHR) images poses a significant challenge due to constraints on GPU memory, leading to a trade-off between detailed local information and a comprehensive contextual understanding. Current UHR methods often employ a multi-branch encoder to handle local and contextual information, which can be memory-intensive. To address the need for both high accuracy and low memory usage in processing UHR images, we introduce a memory-efficient semantic segmentation approach by squeezing context information into local patches (SCPSeg). Our method integrates the processing of local and contextual information within a single-branch encoder. Specifically, we introduce a context squeezing module (CSM) designed to compress global context details into local patches, enabling segmentation networks to perceive broader image contexts. Additionally, we propose a super-resolution guided local feature alignment (LFA) technique to improve segmentation precision by aligning local feature relationships. This approach calculates similarities within sliding windows, avoiding heavy computational costs during the training phase. We evaluate the effectiveness of our proposed method on four widely used UHR segmentation benchmarks. Experimental results demonstrate that our approach enhances UHR segmentation accuracy without incurring additional memory overhead during the inference stage. The code is available at https://github.com/StuLiu/SCPSeg.
4147: Identifying Causal Mechanism Shifts Under Additive Models with Arbitrary Noise
Authors: Yewei Xia, Xueliang Cui, Hao Zhang, Yixin Ren, Feng Xie, Jihong Guan, Ruxin Wang, Shuigeng Zhou
Location: Guangzhou | Day: TBD
Show Abstract
In many real-world scenarios, the goal is to identify variables whose causal mechanisms change across related datasets. For example, detecting abnormal root nodes in manufacturing, and identifying key genes that influence cancer by analyzing differences in gene regulatory mechanisms between healthy individuals and cancer patients. This can be done by recovering the causal structure for each dataset independently and then comparing them to identify differences, but the performance is often suboptimal. Typically, existing methods directly identify causal mechanism shifts based on linear additive noise models (ANMs) or by imposing restrictive assumptions on the noise distribution. In this paper, we introduce CMSI, a novel and more general algorithm based on nonlinear ANMs that identifies variables with shifting causal mechanisms under arbitrary noise distributions. Evaluated on various synthetic datasets, CMSI consistently outperforms existing baselines in terms of F1 score. Additionally, we demonstrate CMSI’s applicability on gene expression datasets of ovarian cancer patients at different disease stages.
4158: Verifying Quantized Graph Neural Networks is PSPACE-complete
Authors: Marco Sälzer, Francois Schwarzentruber, Nicolas Troquard
Location: Montreal | Day: August 21st | Time: 11:30 | Session: Knowledge Representation and Reasoning (3/4)
Show Abstract
In this paper, we investigate verification of quantized Graph Neural Networks (GNNs), where some fixed-width arithmetic is used to represent numbers.
We introduce the linear-constrained validity (LVP) problem for verifying GNNs properties, and provide an efficient translation from LVP instances into a logical language. We show that LVP is in PSPACE, for any reasonable activation functions. We provide a proof system. We also prove PSPACE-hardness, indicating that while reasoning about quantized GNNs is feasible, it remains generally computationally challenging.
4173: Breaking the Self-Evaluation Barrier: Reinforced Neuro-Symbolic Planning with Large Language Models
Authors: Jie-Jing Shao, Hong-Jie You, Guohao Cai, Quanyu Dai, Zhenhua Dong, Lan-Zhe Guo
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in language understanding and commonsense reasoning, yet they often struggle with constraint satisfaction in planning problems. Previous studies relying on test-time improvement with self-evaluation fail to address this limitation effectively. In this work, we identify this critical gap and propose a novel neuro-symbolic framework, Reinforced Neuro-Symbolic Planning (\algo), that enhances LLM-powered planning by incorporating a symbolic verifier. The verifier provides explicit feedback on constraint satisfaction, enabling iterative refinement of the state evaluation. Specifically, we utilize the outcome feedback from each logical goal to update the process value along planning paths through a reinforcement value function maximization objective. We further employ T-norms to aggregate the satisfaction levels of multiple constraints, which provided more effective guidance for the test-time search. Our framework bridges the strengths of neural and symbolic methods, leveraging the generative power of LLMs while ensuring rigorous adherence to constraints through symbolic verification. Extensive experiments demonstrate that our approach significantly improves planning accuracy and constraint satisfaction across various domains, outperforming traditional self-evaluation methods. It highlights the potential of hybrid neuro-symbolic systems to address complex constrained planning tasks.
4183: MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces
Authors: Shaojun E, Yuchen Yang, Jiaheng Wu, Yan Zhang, Tiejun Zhao, Ziyan Chen
Location: Guangzhou | Day: TBD
Show Abstract
In the latest advancements in multimodal learning, effectively addressing the spatial and semantic losses of visual data after encoding remains a critical challenge. This is because the performance of large multimodal models is positively correlated with the coupling between visual encoders and large language models. Existing approaches often face issues such as vector gaps or semantic disparities, resulting in information loss during the propagation process. To address these issues, we propose MAGE (Multimodal Alignment and Generation Enhancement), a novel framework that bridges the semantic spaces of vision and text through an innovative alignment mechanism. By introducing the Intelligent Alignment Network (IAN), MAGE achieves dimensional and semantic alignment. To reduce the gap between synonymous heterogeneous data, we employ a training strategy that combines cross-entropy and mean squared error, significantly enhancing the alignment effect. Moreover, to enhance MAGE’s “Any-to-Any” capability, we developed a fine-tuning dataset for multimodal tool-calling instructions to expand the model’s output capability boundaries. Finally, our proposed multimodal large model architecture, MAGE, achieved significantly better performance compared to similar works across various evaluation benchmarks, including MME, MMBench, and SEED. Complete code and appendix are available at: https://github.com/GTCOM-NLP/MAGE
4190: Capturing Individuality and Commonality Between Anchor Graphs for Multi-View Clustering
Authors: Zhoumin Lu, Yongbo Yu, Linru Ma, Feiping Nie, Rong Wang
Location: Guangzhou | Day: TBD
Show Abstract
The use of anchors often leads to better efficiency and scalability, making them highly favored. However, there is a challenge in anchor-based multi-view subspace learning. A unified anchor graph overly emphasize the commonality between views, failing to adequately capture the view-specific individuality. This has led some models to independently explore the individuality of each view before aligning and integrating them, often achieving better performance but making the process more cumbersome. Therefore, this paper proposes a new model, simultaneously capturing the individuality and commonality between anchor graphs for multi-view clustering. The model has three notable advantages: First, it allows view-specific anchor graphs to align in real-time with a common anchor graph as a reference, eliminating the need for post-alignment. Second, it enforces a cluster-wise structure among anchors and balances sample distribution among them, providing strong discriminative power. Lastly, it maintains linear complexity with respect to the numbers of samples and anchors, avoiding the significant time costs associated with their increase. Comprehensive experiments demonstrate the effectiveness and efficiency of our method compared to various state-of-the-art algorithms.
4194: Balancing User-Item Structure and Interaction with Large Language Models and Optimal Transport for Multimedia Recommendation
Authors: Haodong Li, Lianyong Qi, Weiming Liu, Xiaolong Xu, Wanchun Dou, Yang Cao, Xuyun Zhang, Amin Beheshti, Xiaokang Zhou
Location: Guangzhou | Day: TBD
Show Abstract
The rapid growth of multimedia content has driven the development of recommender systems. Most previous work focuses on uncovering latent relationships among items to learn better representations. However, this approach does not sufficiently account for user affinities, potentially leading to an imbalance in the structure modeling of users and items. Moreover, the sparsity and imbalance of user-item interactions further hinder effective representation learning. To address these challenges, we propose a framework called BLAST, which balances structures and interactions via large language models and optimal transport for multimodal recommendation. Specifically, we utilize large language models to summarize side information and generate user profiles. Based on these profiles, we design an intra- and inter-entity structure balancing module to capture item-item and user-user relationships, integrating these affinities into the final representations. Furthermore, we impose constraints on negative sample selection, augment the training data with false negative items and the optimal transport algorithm, thereby leading to smoother interactions. We evaluate BLAST on three real-world datasets, and the results demonstrate that our method significantly outperforms state-of-the-art baselines, which validates the superiority and effectiveness of BLAST.
4198: Rethinking Federated Graph Learning: A Data Condensation Perspective
Authors: Hao Zhang, Xunkai Li, Yinlin Zhu, Lianglin Hu
Location: Montreal | Day: August 19th | Time: 15:00 | Session: ML: Federated Learning
Show Abstract
Federated graph learning is a widely recognized technique that promotes collaborative training of graph neural networks (GNNs) by multi-client graphs.However, existing approaches heavily rely on the communication of model parameters or gradients for federated optimization and fail to adequately address the data heterogeneity introduced by intricate and diverse graph distributions. Although some methods attempt to share additional messages among the server and clients to improve federated convergence during communication, they introduce significant privacy risks and increase communication overhead. To address these issues, we introduce the concept of a condensed graph as a novel optimization carrier to address FGL data heterogeneity and propose a new FGL paradigm called FedGM. Specifically, we utilize a generalized condensation graph consensus to aggregate comprehensive knowledge from distributed graphs, while minimizing communication costs and privacy risks through a single transmission of the condensed data. Extensive experiments on six public datasets consistently demonstrate the superiority of FedGM over state-of-the-art baselines, highlighting its potential for a novel FGL paradigm.
4202: Soft Reasoning Paths for Knowledge Graph Completion
Authors: Yanning Hou, Sihang Zhou, Ke Liang, Lingyuan Meng, Xiaoshu Chen, Ke Xu, Siwei Wang, Xinwang Liu, Jian Huang
Location: Guangzhou | Day: TBD
Show Abstract
Reasoning paths are reliable information in knowledge graph completion (KGC) in which algorithms can find strong clues of the actual relation between entities. However, in real-world applications, it is difficult to guarantee that computationally affordable paths exist toward all candidate entities. According to our observation, the prediction accuracy drops significantly when paths are absent. To make the proposed algorithm more stable against the missing path circumstances, we introduce soft reasoning paths. Concretely, a specific learnable latent path embedding is concatenated to each relation to help better model the characteristics of the corresponding paths. The combination of the relation and the corresponding learnable embedding is termed a soft path in our paper. By aligning the soft paths with the reasoning paths, a learnable embedding is guided to learn a generalized path representation of the corresponding relation. In addition, we introduce a hierarchical ranking strategy to make full use of information about the entity, relation, path, and soft path to help improve both the efficiency and accuracy of the model. Extensive experimental results illustrate that our algorithm outperforms the compared state-of-the-art algorithms by a notable margin. Our code will be released at https://github.com/7HHHHH/SRP-KGC.
4210: A Neuro-Symbolic Framework for Sequence Classification with Relational and Temporal Knowledge
Authors: Luca Salvatore Lorello, Marco Lippi, Stefano Melacci
Location: Montreal | Day: August 20th | Time: 14:00 | Session: ML: time series, sequences and signals
Show Abstract
One of the goals of neuro-symbolic artificial intelligence is to exploit background knowledge to improve the performance of learning tasks. However, most of the existing frameworks focus on the simplified scenario where knowledge does not change over time and does not cover the temporal dimension. In this work we consider the much more challenging problem of knowledge-driven sequence classification where different portions of knowledge must be employed at different timesteps, and temporal relations are available. Our extensive experimental evaluation compares multi-stage neuro-symbolic and neural-only architectures, and it is conducted on a newly-introduced benchmarking framework. Results not only demonstrate the challenging nature of this novel setting, but also highlight under-explored shortcomings of neuro-symbolic methods, representing a precious reference for future research.
4221: ADPFedGNN: Adaptive Decoupling Personalized Federated Graph Neural Network
Authors: Zeli Guan, Yawen Li, Junping Du, Runqing Tang, Xiaolong Meng
Location: Guangzhou | Day: TBD
Show Abstract
Personalized federated graph neural networks (PFGNN) are an emerging technology that allows multiple graph data owners to collaboratively train personalized models without sharing raw data. However, the Non-IID nature of graph data can cause the coupling of global and local knowledge parameters, which disrupts the optimization in personalized federated learning. Additionally, node neighbors may carry global and local knowledge, and their direct inclusion in training may introduce noise, degrading federated model performance. In this work, we propose the Adaptive Decoupling Personalized Federated Graph Neural Network (ADPFedGNN), which leverages multi-party collaboration to train personalized models for classifying local client graph nodes. We use two automatically updated masks and mutual information minimization to decouple global and local parameters in FGNN. We employ reinforcement learning to adaptively select appropriate neighbors for training global or local knowledge-related parameters while filtering out irrelevant nodes. We also design a personalized federated masked parameter aggregation mechanism that efficiently updates local personalized model parameters and aggregates the masked parameters. Experimental results on three public datasets demonstrate that ADPFedGNN outperforms existing methods, achieving average improvements of 5.66 percent, 5.83 percent, and 12.45 percent in ACC, F1, and Recall, respectively.
4227: Multimodal Prior Learning with Double Constraint Alignment for Snapshot Spectral Compressive Imaging
Authors: Mingjin Zhang, Longyi Li, Fei Gao, Qiming Zhang, Jie Guo
Location: Guangzhou | Day: TBD
Show Abstract
The objective of snapshot spectral compressive imaging reconstruction is to recover the 3D hyperspectral image (HSI) from a 2D measurement. Existing methods either focus on network architecture design or simply introduce image-level prior to the model. However, these methods lack guiding information for accurate reconstruction. Recognizing that textual description contain rich semantic information that can significantly enhance details, this paper introduces a novel framework, CAMM, which integrates text information into the model to improve the performance. The framework comprises two key components: Fine-grained Alignment Module (FAM) and Multimodal Fusion Mamba (MFM). Specifically, FAM is used to reduce the knowledge gap between the RGB domain obtained by the pre-trained vision-language model and the HSI domain. Through the double constraints of distribution similarity and entropy, the adaptive alignment of different complexity features is realized, which makes the encoded features more accurate. MFM aims to identify the guiding effect of RGB features and text features on HSI in space and channel dimensions. Instead of fusing features directly, it integrates prior at image-level and text-level prior into Mamba’s state-space equation, so that each scanning step can be accurately guided. This kind of positive feedback adjustment ensures the authenticity of the guiding information. To our knowledge, this is the first text-guided model for compressive spectral imaging. Extensive experimental results the public datasets demonstrate the superior performance of CAMM, validating the effectiveness of our proposed method.
4234: GATES: Cost-aware Dynamic Workflow Scheduling via Graph Attention Networks and Evolution Strategy
Authors: Ya Shen, Gang Chen, Hui Ma, Mengjie Zhang
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Planning and Scheduling (1/5)
Show Abstract
Cost-aware Dynamic Workflow Scheduling (CADWS) is a key challenge in cloud computing, focusing on devising an effective scheduling policy to efficiently schedule dynamically arriving workflow tasks, represented as Directed Acyclic Graphs (DAG), to suitable virtual machines (VMs). Deep reinforcement learning (DRL) has been widely employed for automated scheduling policy design. However, the performance of DRL is heavily influenced by the design of the problem-tailored policy network and is highly sensitive to hyperparameters and the design of reward feedback. Considering the above-mentioned issues, this study proposes a novel DRL method combining Graph Attention Networks-based policy network and Evolution Strategy, referred to as GATES. The contributions of GATES are summarized as follows: (1) GATES can capture the impact of current task scheduling on subsequent tasks by learning the topological relationships between tasks in a DAG. (2) GATES can assess the importance of each VM to the ready task, enabling it to adapt to dynamically changing VM resources. (3) Utilizing Evolution Strategy’s robustness, exploratory nature, and tolerance for delayed rewards, GATES achieves stable policy learning in CADWS. Extensive experimental results demonstrate the superiority of the proposed GATES in CADWS, outperforming several state-of-the-art algorithms. The source code is available at: https://github.com/YaShen998/GATES.
4256: Simulate, Refine and Integrate: Strategy Synthesis for Efficient SMT Solving
Authors: Bingzhe Zhou, Hannan Wang, Yuan Yao, Taolue Chen, Feng Xu, Xiaoxing Ma
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: MTA: Software engineering
Show Abstract
Satisfiability Modulo Theories (SMT) solvers are crucial in many applications, yet their performance is often a bottleneck. This paper introduces SIRISMT, a novel framework that employs machine learning techniques for the automatic synthesis of efficient SMT-solving strategies. Specifically, SIRISMT targets at Z3 and consists of three key stages. First, given a set of training SMT formulas, SIRISMT simulates the solving process by leveraging reinforcement learning to guide its exploration within the strategy space. Next, SIRISMT refines the collected strategies by pruning redundant tactics and generating augmented strategies based on the subsequence structure of the learned strategies. These refined strategies are then fed back into the reinforcement learning model. Finally, the refined and optimized strategies are integrated into one strategy, which can be directly plugged into modern SMT solvers. Extensive evaluations show the superior performance of SIRISMT over the baseline methods. For example, compared to the default Z3, it solves 26.8% more formulas and achieves up to an 86.3% improvement in the Par-2 score on benchmark datasets. Additionally, we show that the synthesized strategy can improve the code coverage by up to 11.8% in a downstream symbolic execution benchmark.
4257: Dynamic Network Discovery via Infection Tracing
Authors: Ben Bals, Michelle Döring, Nicolas Klodt, George Skretas
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Multidisciplinary Topics and Applications (1/2)
Show Abstract
Researchers, policy makers, and engineers need to make sense of data from spreading processes as diverse as
rumor spreading in social networks, viral infections, and water contamination.
Classical questions include predicting infection behavior in a given network or deducing the network structure from infection data.
Most of the research on network infections studies static graphs, that is, the connections in the network are assumed to not change.
More recently, temporal graphs, in which connections change over time, have been used to more accurately represent real-world infections, which rarely occur in unchanging networks.
We propose a model for temporal graph discovery that is consistent with previous work on static graphs and embraces the greater expressiveness of temporal graphs.
For this model, we give algorithms and lower bounds which are often tight. We analyze different variations of the problem, which make our results widely applicable and it also clarifies which aspects of temporal infections make graph discovery easier or harder.
We round off our analysis with an experimental evaluation of our algorithm on real-world interaction data from the Stanford Network Analysis Project and on temporal Erdős-Renyi graphs.
On Erdős-Renyi graphs, we uncover a threshold behavior, which can be explained by a novel connectivity parameter that we introduce during our theoretical analysis.
4263: AccCtr: Accelerating Training-Free Conditional Control For Diffusion Models
Authors: Longquan Dai, He Wang, Yiming Zhang, Shaomeng Wang, Jinhui Tang
Location: Guangzhou | Day: TBD
Show Abstract
In current training-free Conditional Diffusion Models (CDM), the sampling process is steered by the gradient, which measures the discrepancy between the guidance and the condition extracted by a pre-trained condition extraction network. These methods necessitate small guidance steps, resulting in longer sampling times.
To address the issue of slow sampling, we introduce AccCtr, a method that simplifies the conditional sampling algorithm by maximizing the sum of two objectives. The local maximum set of one objective is contained within the local maximum set of the other. Leveraging this relationship, we decompose the joint optimization into two parts, alternately maximizing each objective. By analyzing the steps involved in optimizing these objectives, we identify the most time-consuming steps and recommend retraining condition extraction network—a relatively simple task—to reduce its computational cost.
Integrating AccCtr into current CDMs is a seamless task that does not impose a significant computational burden. Extensive testing has demonstrated that AccCtr offers superior sample quality and faster generation times.
4269: Multimodal Regression for Enzyme Turnover Rates Prediction
Authors: Bozhen Hu, Cheng Tan, Siyuan Li, Jiangbin Zheng, Jun Xia, Stan Z. Li
Location: Guangzhou | Day: TBD
Show Abstract
The enzyme turnover rate is a fundamental parameter in enzyme kinetics, reflecting the catalytic efficiency of enzymes. However, enzyme turnover rates remain scarce across most organisms due to the high cost and complexity of experimental measurements. To address this gap, we propose a multimodal framework for predicting the enzyme turnover rate by integrating enzyme sequences, substrate structures, and environmental factors. Our model combines a pre-trained language model and a convolutional neural network to extract features from protein sequences, while a graph neural network captures informative representations from substrate molecules. An attention mechanism is incorporated to enhance interactions between enzyme and substrate representations. Furthermore, we leverage symbolic regression via Kolmogorov-Arnold Networks to explicitly learn mathematical formulas that govern the enzyme turnover rate, enabling interpretable and accurate predictions. Extensive experiments demonstrate that our framework outperforms both traditional and state-of-the-art deep learning approaches. This work provides a robust tool for studying enzyme kinetics and holds promise for applications in enzyme engineering, biotechnology, and industrial biocatalysis.
4274: MASTER: A Multi-granularity Invariant Structure Clustering Scheme for Multi-view Clustering
Authors: Suixue Wang, Shilin Zhang, Qingchen Zhang, Peng Li, Weiliang Huo
Location: Guangzhou | Day: TBD
Show Abstract
Deep multi-view clustering has attracted increasing attention in the pattern mining of data. However, most of them perform self-learning mechanisms in a single space, ignoring the fruitful structural information hidden in different-level feature spaces. Meanwhile, they conduct the reconstruction constraint to learn generalized representations of samples, failing to explore the discriminative ability of complementary and consistent information. To address the challenges, a multi-granularity invariant structure clustering scheme (MASTER) is proposed to define a bottom-up process that extracts multi-level information in sample, neighborhood, and category granularities from low-level, high-level, and semantics feature space, respectively. Specifically, it leverages the self-learning reconstruction with information-theoretic overclustering to capture invariant sample structure in the low-level feature space. Then, it models data diffusion of the clustering process in the reliable neighborhood to capture invariant local structure in the high-level feature space. Meanwhile, it defines dual divergences induced by the space geometry to capture invariant global structure in the semantics space. Finally, extensive experiments on 8 real-world datasets show that MASTER achieves state-of-the-art performance compared to 11 baselines.
4276: COGRASP: Co-Occurrence Graph Based Stock Price Forecasting
Authors: Zhengze Li, Zilin Song, Tingting Yuan, Xiaoming Fu
Location: Montreal | Day: August 19th | Time: 15:00 | Session: ML: Reinforcement learning (1/2)
Show Abstract
Forecasting stock prices is complex and challenging. Uncovering correlations among stocks has proven to enhance stock price forecasting. However, existing correlation discovery methods, such as concept-based methods, are slow, inaccurate, and limited by their reliance on predefined concepts and manual analysis. In this paper, we propose COGRASP, a novel approach for stock price forecasting that constructs stock co-occurrence graphs automatically by analyzing rapidly updated sources such as reports, newspapers, and social media. Besides, we aggregate forecasts across multiple timescales (i.e., long-, medium-, and short-term) to capture multi-timescale trends fluctuations, thereby enhancing price forecasting accuracy. In experiments with real-world open-source stock market data, COGRASP outperforms state-of-the-art methods.
4280: Towards Region-Adaptive Feature Disentanglement and Enhancement for Small Object Detection
Authors: Yanchao Bi, Yang Ning, Xiushan Nie, Xiankai Lu, Yongshun Gong, Leida Li
Location: Guangzhou | Day: TBD
Show Abstract
Current feature fusion strategies often fail to adequately account for the influence of activation intensity across different scales on small object features, which impedes the effective detection of small objects. To address this limitation, we propose the Region-Adaptive Feature Disentanglement and Enhancement (RAFDE) strategy, which improves both downsampling and feature fusion by leveraging activation intensity variations at multiple scales. First, we introduce the Boundary Transitional Region-enhanced Downsampling (BTRD) module, which enhances boundary transitional regions containing both strongly and weakly activated features, thereby mitigating the loss of crucial boundary information for small objects. Second, we present the Regional-Adaptive Feature Fusion (RAFF) module, which adaptively disentangles and fuses co-activated and uni-activated regions from adjacent levels into the current level, effectively reducing the risk of small objects being overwhelmed. Extensive experiments on several public datasets demonstrate that the RAFDE strategy is highly effective and outperforms state-of-the-art methods. The code is available at https://github.com/b-yanchao/RAFDE.git.
4295: TOTF: Missing-Aware Encoders for Clustering on Multi-View Incomplete Attributed Graphs
Authors: Mengyao Li, Xu Zhou, Jiapeng Zhang, Zhibang Yang, Cen Chen, Kenli Li
Location: Guangzhou | Day: TBD
Show Abstract
As the network data in real life become multi-modal and multi-relational, multi-view attributed graphs have garnered significant attention. Numerous methods have achieved excellent performance in multi-view attributed graph clustering; however, they cannot efficiently handle incomplete attribute scenarios, which are prevalent in many real-life applications. Inspired by this, we investigate the problem of multi-view incomplete attributed graph clustering for the first time. In particular, the TOTF (Train Once Then Freeze) framework is designed to train missing-aware encoders that capture view-specific information while ignoring the impact of incomplete attributes, and then employs frozen encoders to uncover common information driven by clustering. After that, we propose a correlation strength-aware graph neural network on the basis of the inherent relationships among attributes to enhance accuracy. It is proven theoretically that traditional Generative Adversarial Networks (GANs) are unable to generate the unique real distribution. To address this issue, we further introduce the missing-position reminder mechanism into our intra-view adversarial games for better clustering results. Extensive experimental results demonstrate that our method achieves up to a 17% improvement in accuracy over the state-of-the-art methods. The source code is available at https://anonymous.4open.science/r/TOTF-main.
4297: SecV: LLM-based Secure Verilog Generation with Clue-Guided Exploration on Hardware-CWE Knowledge Graph
Authors: Fanghao Fan, YingJie Xia, Li Kuang
Location: Guangzhou | Day: TBD
Show Abstract
Verilog is specified as the primary Register Transfer Level (RTL) hardware description language, which designs the logical functions between registers for digital circuit systems. Recently, there emerges much cutting-edge research in leveraging Large Language Models (LLMs) to generate Verilog, aiming at effectively reducing errors and costs in the logic design of chips. However, these works mainly focus on logical correctness or PPA (Power, Performance, Area) measurement of the generated results, while neglecting the security problems in Verilog. In this study, we propose SecV, a novel and unified framework to generate secure Verilog by clue-guided exploration on Common Weakness Enumeration (CWE) knowledge graph (KG) for chips. First, the builder of the KG utilizes the instance-adapted chain of thought (COT) to extract entities and their relationships from raw Hardware-CWE corpora. Then, a fine-tuned BERT model is employed to verify the Hardware-CWE KG and collaborate with builder iteratively to achieve the precise KG. Based on Hardware-CWE KG, a clue-guided graph exploration paradigm is designed to facilitate collaborative inference of knowledge to generate secure Verilog by LLMs. Experiments demonstrate that SecV achieves 82.6% secure Verilog code without specified CWE in the generated functionally correct Verilog, with superior performance of a 21.7% performance improvement compared to SOTA.
4308: Sketch Decompositions for Classical Planning via Deep Reinforcement Learning
Authors: Michael Aichmüller, Hector Geffner
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Planning and Scheduling (2/5)
Show Abstract
In planning and reinforcement learning, the identification of common subgoal structures across problems is important when goals are to be achieved over long horizons. Recently, it has been shown that such structures can be expressed as feature-based rules, called sketches, over a number of classical planning domains. These sketches split problems into subproblems which then become solvable in low polynomial time by a greedy sequence of IW(k) searches. Methods for learning sketches using feature pools and min-SAT solvers have been developed, yet they face two key limitations: scalability and expressivity. In this work, we address these limitations by formulating the problem of learning sketch decompositions as a deep reinforcement learning (DRL) task, where general policies are sought in a modified planning problem where the successor states of a state s are defined as those reachable from s through an IW(k) search. The sketch decompositions obtained through this method are experimentally evaluated across various domains, and problems are regarded as solved by the decomposition when the goal is reached through a greedy sequence of IW(k) searches.
While our DRL approach for learning sketch decompositions does not yield interpretable sketches in the form of rules, we demonstrate that the resulting decompositions can often be understood in a crisp manner.
4310: Not All Layers of LLMs Are Necessary During Inference
Authors: Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang
Location: Guangzhou | Day: TBD
Show Abstract
Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. However, not all requests posed to LLMs are equally difficult to handle. Through analysis, we show that for some tasks, LLMs can achieve results comparable to the final output at some intermediate layers. That is, not all layers of LLMs are necessary during inference. If we can predict at which layer the inferred results match the final results (produced by evaluating all layers), we could significantly reduce the inference cost. To this end, we propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance. AdaInfer relies on easily obtainable statistical features and classic classifiers like SVM. Experiments on well-known LLMs like the Llama2 series and OPT, show that AdaInfer can achieve an average of 17.8% pruning ratio, and up to 43% on sentiment tasks, with nearly no performance drop (<1%). Because AdaInfer does not alter LLM parameters, the LLMs incorporated with AdaInfer maintain generalizability across tasks.
4319: High-Fidelity Road Network Generation with Latent Diffusion Models
Authors: Jinming Wang, Hongkai Wen, Geyong Min, Man Luo
Location: Guangzhou | Day: TBD
Show Abstract
Road networks are the vein of modern cities. Yet, maintaining up-to-date and accurate road network information is a persistent challenge, especially in areas with rapid urban changes or limited surveying resources. Crowdsourced trajectories, e.g., from GPS records collected by mobile devices and vehicles, have emerged as a powerful data source for continuously mapping the urban areas. However, the inherent noise, irregular and often sparse sampling rates, and the vast variability in movement patterns make the problem of road network generation from trajectories a non-trivial task. Existing methods often approach this from an appearance-based perspective: they typically render trajectories as 2D density maps and then employ heuristic algorithms to extract road networks – leading to inevitable information loss and thus poor performance especially when trajectories are sparse or ambiguities present, e.g. flyovers. In this paper, we propose a novel approach, called GraphWalker, to generate high-fidelity road network graphs from raw trajectories in an end-to-end manner. We achieve this by designing a bespoke latent diffusion transformer T2W-DiT, which treats input trajectories as generation conditions, and gradually denoises samples from a latent space to obtain the corresponding walks on the underlying road network graph – then assemble them together as the final road network. Extensive experiments on multiple datasets demonstrate the proposed GraphWalker can effectively generate high quality road networks from noisy and sparse trajectories, showcasing significant improvements over state-of-the-art.
4332: Localizing Before Answering: A Benchmark for Grounded Medical Visual Question Answering
Authors: Dung Nguyen, Minh Khoi Ho, Huy Ta, Thanh Tam Nguyen, Qi Chen, Kumar Rav, Quy Duong Dang, Satwik Ramchandre, Son Lam Phung, Zhibin Liao, Minh-Son To, Johan Verjans, Phi Le Nguyen, Vu Minh Hieu Phan
Location: Montreal | Day: August 21st | Time: 11:30 | Session: CV: Benchmarks
Show Abstract
Medical Large Multi-modal Models (LMMs) have demonstrated remarkable capabilities in processing medical multi-modal data. However, they are prone to hallucinations, often generating content that conflicts with true sources. In this work, we reveal a critical limitation in current medical LMMs: a lack of localization reasoning, where models rely on shortcuts from language or irrelevant visual regions instead of focusing on pathological areas when answering disease-related queries. To address this, we introduce HEAL-MedVQA (Hallucination Evaluation via Localization in Medical VQA), a novel large-scale benchmark for evaluating the localization ability and hallucination robustness of LMMs. HEAL-MedVQA features (i) two innovative evaluation protocols to assess visual and textual shortcut learning, and (ii) a dataset of 67K VQA pairs, annotated by doctors with anatomical segmentation masks for pathological regions. To improve visual reasoning, we propose the Localize-before-Answer (LobA) framework, which trains LMMs to localize target regions of interest and self-prompt to emphasize segmented pathological areas, generating grounded and reliable answers. Experimental results demonstrate that our approach significantly outperforms state-of-the-art biomedical LMMs on the challenging HEAL-MedVQA benchmark, advancing robustness in medical VQA.
4334: Global Information Compensation Network for Image Denoising
Authors: Shifei Ding, Qidong Wang, Lili Guo
Location: Guangzhou | Day: TBD
Show Abstract
In image denoising research, discriminative models have achieved impressive results which mainly owes to the powerful ability of convolutional networks in local feature extraction. However, there is still room for improvement due to insufficient utilization of global information. Although using fully connected layers or increasing network depth can supplement global information, this results in a significant increase in parameters and computational cost. To address these issues, we propose a global information compensation network (GICN) for image denoising in this paper. Firstly, at the shallow network part, we propose a global feature mining block that enhances the network’s ability to extract global information by combining non-local blocks and the Fourier transform while improving the interpretability of the model. Secondly, between the encoder and decoder, we propose a cross-scale feature aggregation block to fuse information at different scales. Finally, we employ attention blocks to improve skip connections to better capture long-distance dependencies. Extensive experimental results show that our proposed GICN effectively compensates for global information, achieves a balance between denoising efficiency and effect, and surpasses mainstream methods in multiple benchmark tests.
4336: Modular Deep Reinforcement Learning for Multi-Workload Offloading in Edge Networks
Authors: Hongchang Ke, Yan Ding, Lin Pan, Yang Chen, Jia Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Dynamic edge networks revolutionize mobile edge computing by enabling real-time applications in intelligent transportation, augmented reality, and industrial Internet of Things (IoT). Efficient workload offloading in dynamic edge networks is crucial for addressing the increasing demands of time-varying workloads while contending with limited computational and communication resources. Existing deep reinforcement learning (DRL)-based offloading decision-making schemes are inadequate for managing scenarios involving multiple workloads and edge servers, particularly when faced with time-varying workload arrivals and fluctuating channel states. To this end, we propose a flexible module weighted fusion DRL framework (DRL-MWF) for scalable and robust multi-workload offloading in edge environments. Unlike traditional monolithic networks, DRL-MWF employs a weighted fusion modular architecture that adapts flexibly to diverse workload distributions. Specifically, DRL-MWF introduces a state representation and normalization strategy to model state and workload characteristics, enabling precise and adaptive decision-making. Furthermore, we design two key mechanisms: a weighted policy correction method to stabilize learning and a prioritized experience replay with weighted importance sampling to accelerate convergence by emphasizing critical transitions. Extensive evaluations on real-world datasets demonstrate that DRL-MWF consistently outperforms state-of-the-art baselines. These results reveal DRL-MWF’s potential to transform workload offloading in next-generation edge computing systems, ensuring high performance in dynamic scenarios.
4362: CoderAgent: Simulating Student Behavior for Personalized Programming Learning with Large Language Models
Authors: Yi Zhan, Qi Liu, Weibo Gao, Zheng Zhang, Tianfu Wang, Shuanghong Shen, Junyu Lu, Zhenya Huang
Location: Guangzhou | Day: TBD
Show Abstract
Personalized programming tutoring, such as exercise recommendation, can enhance learners’ efficiency, motivation, and outcomes, which is increasingly important in modern digital education. However, the lack of sufficient and high-quality programming data, combined with the mismatch between offline evaluation and real-world learning, hinders the practical deployment of such systems. To address this challenge, many approaches attempt to simulate learner practice data, yet they often overlook the fine-grained, iterative nature of programming learning, resulting in a lack of interpretability and granularity. To fill this gap, we propose a LLM-based agent, CoderAgent, to simulate students’ programming processes in a fine-grained manner without relying on real data. Specifically, we equip each human learner with an intelligent agent, the core of which lies in capturing the cognitive states of the human programming practice process. Inspired by ACT-R, a cognitive architecture framework, we design the structure of CoderAgent to align with human cognitive architecture by focusing on the mastery of programming knowledge and the application of coding ability. Recognizing the inherent patterns in multi-layered cognitive reasoning, we introduce the Programming Tree of Thought (PTOT), which breaks down the process into four steps: why, how, where, and what. This approach enables a detailed analysis of iterative problem-solving strategies. Finally, experimental evaluations on real-world datasets demonstrate that CoderAgent provides interpretable insights into learning trajectories and achieves accurate simulations, paving the way for personalized programming education.
4367: Fast Second-Order Online Kernel Learning Through Incremental Matrix Sketching and Decomposition
Authors: Dongxie Wen, Xiao Zhang, Zhewei Wei, Chenping Hou, Shuai Li, Weinan Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Second-order Online Kernel Learning (OKL) has attracted considerable research interest due to its promising predictive performance in streaming environments. However, existing second-order OKL approaches suffer from at least quadratic time complexity with respect to the pre-set budget, rendering them unsuitable for large-scale datasets. Moreover, the singular value decomposition required to obtain explicit feature mapping is computationally expensive due to the complete decomposition process. To address these issues, we propose FORKS, a fast incremental matrix sketching and decomposition approach tailored for second-order OKL. FORKS constructs an incremental maintenance paradigm for second-order kernelized gradient descent, which includes incremental matrix sketching for kernel approximation and incremental matrix decomposition for explicit feature mapping construction. Theoretical analysis demonstrates that FORKS achieves a logarithmic regret guarantee on par with other second-order approaches while maintaining a linear time complexity w.r.t. the budget, significantly enhancing efficiency over existing methods. We validate the performance of our method through extensive experiments conducted on real-world datasets, demonstrating its superior scalability and robustness against adversarial attacks.
4372: Seeking Proxy Point via Stable Feature Space for Noisy Correspondence Learning
Authors: Yucheng Xie, Songyue Cai, Tao Tong, Ping Hu, Xiaofeng Zhu
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: Computer Vision (3/3)
Show Abstract
To meet the growing demand for cross-modal training data, directly collecting multimodal data from the Internet has become prevalent. However, such data inevitably suffer from Noisy Correspondence. Previous works focused on recasting soft labels to mitigate noise’s negative impact. We explore a novel perspective to solve this problem: pursuing proxy representation for noisy data to enable reliable feature learning. To this end, we propose a novel framework: Seeking Proxy Point via Stable Feature Space (SPS). This framework employs a fine-grained partitioning strategy to obtain a high-confidence reliable set. By imposing intermodal cross-transformation consistency constraints and intramodal metric consistency constraints, a stable feature space is constructed.
Building on this foundation, SPS seeks proxy points for noisy data, enabling even noisy data to be accurately embedded into appropriate positions within the feature space. Combined with partial alignment for partially matched data pairs, SPS ultimately achieves robust learning under Noisy Correspondence. Experiments on three widely used cross-modal datasets demonstrate that SPS significantly outperforms previous methods. Our code is available at https://github.com/C-TeaRanger/SPS.
4373: Balance-Aware Sequence Sampling Makes Multi-Modal Learning Better
Authors: Zhi-Hao Guan, Qing-Yuan Jiang, Yang Yang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-modal learning (MML) is frequently hindered by modality imbalance, leading to suboptimal performance in real-world applications. To address this issue, existing approaches primarily focus on rebalancing MML from the perspective of optimization or architecture design. However, almost all existing methods ignore the impact of sample sequences, i.e., an inappropriate training order tends to trigger learning bias in the model, further exacerbating modality imbalance. In this paper, we propose Balance-aware Sequence Sampling (BSS) to enhance the robustness of MML. Specifically, we first define a multi-perspective measurer to evaluate the balance degree of each sample in terms of correlation and information criteria. Via this evaluation, we employ a heuristic scheduler based on curriculum learning (CL) that incrementally provides training subsets, progressing from balanced to imbalanced samples to alleviate the imbalance. Moreover, we propose a learning-based probabilistic sampling method to dynamically update the training sequence in a more fine-grained manner, further improving MML performance. Extensive experiments on widely used datasets demonstrate the superiority of our method compared with state-of-the-art (SOTA) baselines. The code is available at https://github.com/njustkmg/IJCAI25-BSS.
4387: Hierarchy Knowledge Graph for Parameter-Efficient Entity Embedding
Authors: Hepeng Gao, Funing Yang, Yongjian Yang, Ying Wang
Location: Guangzhou | Day: TBD
Show Abstract
Traditional knowledge graphs (KGs) provide each entity with a unique embedding as a representation, which contains a lot of redundant information. Meanwhile, the space complexities of the KGs are positively related to the number of entities. In this work, we propose a hierarchical representation learning method, namely HRL, which is a parameter-efficient model where the number of model parameters is independent of dataset scales. Specifically, we propose a hierarchical model comprising a Meta Encoder and a Context Encoder to generate the representation of entities and relations. The Meta Encoder captures the common representations shared across entities, while the Context Encoder learns entity-specific representations. We further provide a theoretical analysis of model design by constructing a structural causal model (SCM) when completing a knowledge graph. The SCM outlines the relationships between nodes, where entity embeddings are conditioned on both common and entity-specific representations. Note that our model is designed to reduce model scale while maintaining competitive performance. We evaluate HRL on the knowledge graph completion task using three real-world datasets. The results demonstrate that HRL significantly outperforms existing parameter-efficient baselines, as well as traditional state-of-the-art baselines of similar scale.
4391: EF1 and EFX Orientations
Authors: Argyrios Deligkas, Eduard Eiben, Tiger-Lily Goldsmith, Viktoriia Korchemna
Location: Guangzhou | Day: TBD
Show Abstract
We study the problem of finding fair allocations — EF1 and EFX — of indivisible goods with orientations. In an orientation, every agent gets items from their own predetermined set. For EF1, we show that EF1 orientations always exist when agents have monotone valuations, via a pseudopolynomial-time algorithm. This surprisingly positive result is the main contribution of our paper. We complement this result with a comprehensive set of scenarios where our algorithm, or a slight modification of it, finds an EF1 orientation in polynomial time. For EFX, we focus on the recently proposed graph instances, where every agent corresponds to a vertex on a graph and their allowed set of items consists of the edges incident to their vertex. It was shown that finding an EFX orientation is NP-complete in general. We prove that it remains intractable even when the graph has a vertex cover of size 8, or when we have a multigraph with only 10 vertices. We essentially match these strong negative results with a fixed-parameter tractable algorithm that is virtually the best someone could hope for.
4424: SeqPose: An End-to-End Framework to Unify Single-frame and Video-based RGB Category-Level Pose Estimation
Authors: Yuzhu Ji, Mingshan Sun, Jianyang Shi, Xiaoke Jiang, Yiqun Zhang, Haijun Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Category-level object pose estimation is a longstanding and fundamental task crucial for augmented reality and robotic manipulation applications. Existing RGB-based approaches struggle with multi-stage settings and heavily rely on off-the-shelf techniques, such as object detectors, depth estimators, non-differentiable NOCS shape alignment, etc. Extra dependencies lead to the accumulation of errors and complicate the whole pipeline, limiting the deployment of these approaches in practical applications. This paper streamlined an end-to-end framework unifying the single-frame and video-based category-level pose estimation. Specifically, instead of explicitly introducing extra dependencies, the DINOv2 encoder and depth decoder, as robust semantic and geometric prior extractors, are leveraged to produce intra-frame hierarchical semantic and geometric features. A spatial-temporal sparse query network is developed to model the implicit correspondence and inter-frame correlations between a set of implicit 3D query anchors and intra-frame features. Finally, a pose prediction head is employed using the bipartite matching algorithm. Experimental results demonstrate that our model achieves state-of-the-art performance compared with RGB-based categorical pose estimation methods on the REAL275 and CAMERA25 datasets. Our code is available at https://andrewchiyz.github.io/vision.3dv.seqpose/.
4433: Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning
Authors: Xudong Yan, Songhe Feng, Yang Zhang, Jian Yang, Yueguan Lin, Haojun Fei
Location: Guangzhou | Day: TBD
Show Abstract
Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attributes and objects by extracting shared and exclusive parts between the image pair sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) The efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attributes with objects in the same parts. (2) Existing word embeddings fail to capture complex multimodal semantic information. (3) Overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named multimodal large language model (MLLM) embeddings and attribute smoothing guided disentanglement for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multi-granularity features for disentanglement. Moreover, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Furthermore, we propose attribute smoothing with auxiliary attributes generated by the large language model (LLM) for seen compositions to address the overconfidence challenge. Extensive experiments demonstrate that our method achieves state-of-the-art performance on three challenging datasets. The supplementary material and source code will be available at https://github.com/xud-yan/Trident.
4437: Hybrid Local Causal Discovery
Authors: Zhaolong Ling, Honghui Peng, Yiwen Zhang, Debo Cheng, Xingyu Wu, Peng Zhou, Kui Yu
Location: Guangzhou | Day: TBD
Show Abstract
Local causal discovery aims to identify and distinguish the direct causes and effects of a target variable from observational data. Due to the inherent incompleteness of local information, popular methods from global causal discovery often face new challenges in local causal discovery tasks, such as 1) erroneous symmetry constraint tests and the resulting cascading errors in constraint-based methods, and 2) confusion within score-based approaches caused by local spurious equivalence classes. To address the above issues, we propose a Hybrid Local Causal Discovery algorithm, called HLCD. Specifically, HLCD initially utilizes a constraint-based approach with the OR rule to obtain a candidate skeleton, which is subsequently refined using a score-based method to eliminate redundant structures. Furthermore, during the local causal orientation phase, HLCD distinguishes between V-structures and equivalence classes by comparing local structure scores between the two, thereby avoiding orientation interference caused by local equivalence class ambiguities. Comprehensive experiments on 14 benchmark Bayesian networks and two real datasets validate that the proposed algorithm outperforms the existing local causal discovery methods.
4439: FreqLLM: Frequency-Aware Large Language Models for Time Series Forecasting
Authors: Shunnan Wang, Min Gao, Zongwei Wang, Yibing Bai, Feng Jiang, Guansong Pang
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) have recently shown promise in Time Series Forecasting (TSF) by effectively capturing intricate time-domain dependencies. However, our preliminary experiments reveal that standard LLM-based approaches often fail to capture global correlations, limiting predictive performance. We found that embedding frequency-domain signals smooths weight distributions and enhances structured correlations by clearly separating global trends (low-frequency components) from local variations (high-frequency components). Building on these insights, we propose FreqLLM, a novel framework that integrates frequency-domain semantic alignment into LLMs to refine prompts for improved time series analysis. By bridging the gap between frequency signals and textual embeddings, FreqLLM effectively captures multi-scale temporal patterns and provides more robust forecasting results. Extensive experiments on benchmark datasets demonstrate that FreqLLM outperforms state-of-the-art TSF methods in both accuracy and generalization. The code is available at https://github.com/biya0105/FreqLLM.
4461: GCTAM: Global and Contextual Truncated Affinity Combined Maximization Model For Unsupervised Graph Anomaly Detection
Authors: Xiong Zhang, Hong Peng, Zhenli He, Cheng Xie, Xin Jin, Hua Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Anomalies often occur in real-world information networks/graphs, such as malevolent users, malicious comments, banned users, and fake news in social graphs.
The latest graph anomaly detection methods use a novel mechanism called truncated affinity maximization (TAM) to detect anomaly nodes without using any label information and achieve impressive results.
TAM maximizes the affinities among the normal nodes while truncating the affinities of the anomalous nodes to identify the anomalies.
However, existing TAM-based methods truncate suspicious nodes according to a rigid threshold that ignores the specificity and high-order affinities of different nodes.
This inevitably causes inefficient truncations from both normal and anomalous nodes, limiting the effectiveness of anomaly detection.
To this end, this paper proposes a novel truncation model combining contextual and global affinity to truncate the anomalous nodes.
The core idea of the work is to use contextual truncation to decrease the affinity of anomalous nodes, while global truncation increases the affinity of normal nodes.
Extensive experiments on massive real-world datasets show that our method surpasses peer methods in most graph anomaly detection tasks.
In highlights, compared with previous state-of-the-art methods, the proposed method has +15% ~ +20% improvements in two famous real-world datasets, Amazon and YelpChi.
Notably, our method works well in large datasets, Amazin-all and YelpChi-all, and achieves the best results, while most previous models cannot complete the tasks.
4479: KnowMDD: Knowledge-guided Cross Contrastive Learning for Major Depressive Disorder Diagnosis
Authors: Anchen Lin, Weikun Wang, Haijun Han, Fanwei Zhu, Qi Ma, Zengwei Zheng, Binbin Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Major Depressive Disorder (MDD) is a prevalent and severe mental disease. Functional Magnetic Resonance Imaging (fMRI)-based diagnostic methods, which analyze Functional Connectivity (FC) to identify abnormal functional connections, have shown promise as biomarker-based approaches for diagnosing depression. However, the high costs of fMRI data result in small sample sizes, hindering the effective identification of abnormal FC patterns. Moreover, existing methods often overlook the potential benefits of incorporating domain knowledge into their models. In this paper, we propose KnowMDD, a novel knowledge-guided cross contrastive learning framework for MDD diagnosis. By incorporating domain knowledge and employing data augmentation, KnowMDD addresses data sparsity while improving robustness and interpretability. Specifically, multiple atlases are used to construct complementary brain graph representations. The default mode network, closely associated with depression, is introduced into the contrastive learning paradigm for diverse subgraph augmentations, while an attention mechanism captures global semantic relationships between brain regions. Based on them, a cross contrastive learning is designed to learn robust representations for accurate diagnosis. Extensive experiments demonstrate the effectiveness, robustness, and interpretability of KnowMDD, which outperforms state-of-the-art methods. We also develop a demonstration system to show its practical application.
4485: Coming Out of the Dark: Human Pose Estimation in Low-light Conditions
Authors: Yong Su, Defang Chen, Meng Xing, Changjae Oh, Xuewei Liu, Jieyang Li
Location: Guangzhou | Day: TBD
Show Abstract
Human pose estimation in low-light conditions is vital for applications such as surveillance and autonomous systems, yet the severe visual distortions hinder both manual annotation and estimation precision. Existing approaches typically rely on additional reference information to mitigate these issues, however, customized data collection equipment poses limitations on their scalability. To alleviate the issue, we construct a Low-Light Images and Poses (LLIP) dataset, which includes only paired low-light images and pose annotations obtained using off-the-shelf motion capture devices. Furthermore, we propose a Multi-grained High-frequency Feature Consistency Learning framework (MHFCL), which does not rely on additional reference information. MHFCL employs a Retinex-inspired restoration stream to recover high-frequency details and integrates them into pose estimation using a multi-grained consistency mechanism. Experiments demonstrate that our approach achieves a new benchmark in low-light pose estimation, while maintaining competitive performance in well-lit conditions.
4488: A Fine-Grained Complexity View on Propositional Abduction – Algorithms and Lower Bounds
Authors: Victor Lagerkvist, Mohamed Maizia, Johannes Schmidt
Location: Montreal | Day: August 21st | Time: 10:00 | Session: KRR: Learning and reasoning
Show Abstract
The Boolean satisfiability problem (SAT) is a well-known example of monotonic reasoning, of intense practical interest due to fast solvers, complemented by rigorous fine-grained complexity results. However, for non-monotonic reasoning, e.g., abductive reasoning, comparably little is known outside classic complexity theory. In this paper we take a first step of bridging the gap between monotonic and non-monotonic reasoning by analyzing the complexity of intractable abduction problems under the seemingly overlooked but natural parameter n: the number of variables in the knowledge base. We obtain several positive results for SigmaP2- as well as NP- and coNP-complete fragments, which implies the first example of beating exhaustive search for a SigmaP2-complete problem (to the best of our knowledge). We complement this with lower bounds and for many fragments rule out improvements under the (strong) exponential-time hypothesis.
4494: A Case for Validation Buffer in Pessimistic Actor-Critic
Authors: Michał Nauman, Mateusz Ostaszewski, Marek Cygan
Location: Montreal | Day: August 21st | Time: 15:00 | Session: ML: Reinforcement Learning (2/2)
Show Abstract
In this paper, we investigate the issue of error accumulation in critic networks updated via pessimistic temporal difference objectives. We show that the critic approximation error can be approximated via a recursive fixed-point model similar to that of the Bellman value. We use such recursive definition to retrieve the conditions under which the pessimistic critic is unbiased. Building on these insights, we propose Validation Pessimism Learning (VPL) algorithm. VPL uses a small validation buffer to adjust the levels of pessimism throughout the agent training, with the pessimism set such that the approximation error of the critic targets is minimized. We investigate the proposed approach on a variety of locomotion and manipulation tasks and report improvements in sample efficiency and performance.
4504: Language-Conditioned Open-Vocabulary Mobile Manipulation with Pretrained Models
Authors: Shen Tan, Dong Zhou, Xiangyu Shao, Junqiao Wang, Guanghui Sun
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Robotics
Show Abstract
Open-vocabulary mobile manipulation (OVMM) that involves the handling of novel and unseen objects across different workspaces remains a significant challenge for real-world robotic applications. In this paper, we propose a novel Language-conditioned Open-Vocabulary Mobile Manipulation framework, named LOVMM, incorporating the large language model (LLM) and vision-language model (VLM) to tackle various mobile manipulation tasks in household environments. Our approach is capable of solving various OVMM tasks with free-form natural language instructions (e.g. "toss the food boxes on the office room desk to the trash bin in the corner", and "pack the bottles from the bed to the box in the guestroom"). Extensive experiments simulated in complex household environments show strong zero-shot generalization and multi-task learning abilities of LOVMM. Moreover, our approach can also generalize to multiple tabletop manipulation tasks and achieve better success rates compared to other state-of-the-art methods.
4505: Fusion of Granular-Ball Visual Spatial Representations for Enhanced Facial Expression Recognition
Authors: Shuaiyu Liu, Qiyao Shen, Yunxi Wang, Yazhou Ren, Guoyin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Facial Expression Recognition (FER) is a fundamental problem in computer vision. Despite recent advances, significant challenges remain. Current methods primarily focus on extracting visual representations while overlooking other valuable information. To address this limitation, we propose a novel method called Component Separation and Granular-ball Space Bootstrap Fusion (CS-GBSBF), which leverages granular balls to transform visual images to spatial graphs, thereby enlarging the spatial information embedded in images. Our method separates the face into different components and utilizes the spatial information to bootstrap the fusion. More specifically, CS-GBSBF mainly consists of three crucial networks: Represent Extraction Network (REN), Represent Separation Network (RSN) and Represent Fusion Network (RFN). First, granular balls are used to represent expression images as graphs, which are fed into REN along with images. Then, RSN separates basic visual/spatial representations extracted from REN into a set of component visual/spatial representations. Next, RFN utilizes spatial representations to bootstrap component visual integration. A significant challenge in two-stream models is feature alignment, for which we have developed Attention Guidance Module (AGM) and Bootstrap Alignment Loss (L_BA) in REN and RFN, respectively. Results of experiment on eight databases show that CS-GBSBF consistently achieves higher recognition accuracy than several state-of-the-art methods. The code is available at https://github.com/Lsy235/CS-GBSBF.
4507: Efficient Constraint-based Window Causal Graph Discovery in Time Series with Multiple Time Lags
Authors: Yewei Xia, Yixin Ren, Hong Cheng, Hao Zhang, Jihong Guan, Minchuan Xu, Shuigeng Zhou
Location: Guangzhou | Day: TBD
Show Abstract
We address the identification of direct causes in time series with multiple time lags, and propose a constraint-based window causal graph discovery method. A key advantage of our method is that the number of required conditional independence (CI) tests scales quadratically with the number of sub-series. The method first uses CI tests to find the minimum trek lag between two arbitrary sub-series, followed by designing an efficient CI testing strategy to identify the direct causes between them. We show that the method is both sound and complete under some graph constraints. We compare the proposed method with typical baselines on various datasets. Experimental results show that our method outperforms all the counterparts in both accuracy and running speed.
4509: Multi-view Clustering via Multi-granularity Ensemble
Authors: Jie Yang, Wei Chen, Feng Liu, Peng Zhou, Zhongli Wang, Xinyan Liang, Bingbing Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-view clustering aims to integrate complementary information from multiple views to improve clustering performance. However, existing ensemble-based methods suffer from information loss due to their reliance on single-granularity labels, limiting the discriminative capability of learned representations. Meanwhile, representation and graph fusion-based approaches face challenges such as explicit view alignment and manual weight tuning, making them less effective for heterogeneous views with varying data distributions. To address these limitations, we propose a novel multi-view clustering framework via Multi-granularity Ensemble (MGE), fully using the multi-granularity information across diverse views for accurate and consistent clustering. Specifically, MGE first modifies the hierarchical clustering and then leverages it on each view (including the fused view) to achieve multi-granularity labels. Moreover, the cross-view and cross-granularity fusion strategy is designed to learn a robust co-association similarity matrix, which effectively preserves the fine-grained and coarse-grained structures of multi-view data and facilitates subsequent clustering. Therefore, MGE can provide a comprehensive representation of local and global patterns within data, eliminating the requirement for view alignment and weight tuning. Experiments demonstrate that MGE consistently outperforms state-of-the-art methods across multiple datasets, validating its effectiveness and superiority in handling heterogeneous views.
4516: Let’s Group: A Plug-and-Play SubGraph Learning Method for Memory-Efficient Spatio-Temporal Graph Modeling
Authors: Wenchao Weng, Hanyu Jiang, Mei Wu, Xiao Han, Haidong Gao, Guojiang Shen, Xiangjie Kong
Location: Guangzhou | Day: TBD
Show Abstract
Spatio-temporal graph modeling is widely applied to spatio-temporal data, analyzing the relationships between data to achieve accurate predictions. However, despite the excellent predictive performance of increasingly complex models, their intricate architectures result in significant memory overhead and computational complexity when handling spatio-temporal data, which limits their practical applications. To address these challenges, we propose a plug-and-play SubGraph Learning (SGL) method to reduce the memory overhead without compromising performance. Specifically, we introduce a SubGraph Partition Module (SGPM), which leverages a set of learnable memory vectors to select node groups with similar features from the graph, effectively partitioning the graph into smaller subgraphs. Noting that partitioning the graph may lead to feature redundancy, as overlapping information across subgraphs can occur. To overcome this, we design a SubGraph Feature Aggregation Module (SGFAM), which mitigates redundancy by averaging node features from different subgraphs. Experiments on four traffic network datasets of various scales demonstrate that SGL can significantly reduce memory overhead, achieving up to a 56.4\% reduction in average GPU memory overhead, while maintaining robust prediction performance. The source code is available at https://github.com/wengwenchao123/SubGraph-Learning.
4523: Instructing Text-to-Image Diffusion Models via Classifier-Guided Semantic Optimization
Authors: Yuanyuan Chang, Yinghua Yao, Tao Qin, Mengmeng Wang, Ivor Tsang, Guang Dai
Location: Guangzhou | Day: TBD
Show Abstract
Text-to-image diffusion models have emerged as powerful tools for high-quality image generation and editing. Many existing approaches rely on text prompts as editing guidance. However, these methods are constrained by the need for manual prompt crafting, which can be time-consuming, introduce irrelevant details, and significantly limit editing performance. In this work, we propose optimizing semantic embeddings guided by attribute classifiers to steer text-to-image models toward desired edits, without relying on text prompts or requiring any training or fine-tuning of the diffusion model. We utilize classifiers to learn precise semantic embeddings at the dataset level. The learned embeddings are theoretically justified as the optimal representation of attribute semantics, enabling disentangled and accurate edits. Experiments further demonstrate that our method achieves high levels of disentanglement and strong generalization across different domains of data. Code is available at https://github.com/Chang-yuanyuan/CASO.
4530: DERI: Cross-Modal ECG Representation Learning with Deep ECG-Report Interaction
Authors: Jian Chen, Xiaoru Dong, Wei Wang, Shaorui Zhou, Lequan Yu, Xiping Hu
Location: Guangzhou | Day: TBD
Show Abstract
Electrocardiogram (ECG) is widely used to diagnose cardiac conditions via deep learning methods. Although existing self-supervised learning (SSL) methods have achieved great performance in learning representation for ECG-based cardiac conditions classification, the clinical semantics can not be effectively captured. To overcome this limitation, we proposed to learn cross-modal ECG representations that contain more clinical semantics via a novel framework with \textbf{D}eep \textbf{E}CG-\textbf{R}eport \textbf{I}nteraction (\textbf{DERI}). Specifically, we design a novel framework combining multiple alignments and mutual feature reconstructions to learn effective representation of the ECG with the clinical report, which fuses the clinical semantics of the report. An RME-Module inspired by masked modeling is proposed to improve the ECG representation learning. Furthermore, we extend ECG representation learning to report generation with a language model, which is significant for evaluating clinical semantics in the learned representations and even clinical applications. Comprehensive experiments with various settings are conducted on various datasets to show the superior performance of our DERI. Our code is released on https://github.com/cccccj-03/DERI.
4531: Multimodal Image Matching Based on Cross-Modality Completion Pre-training
Authors: Meng Yang, Fan Fan, Jun Huang, Yong Ma, Xiaoguang Mei, Zhanchuan Cai, Jiayi Ma
Location: Guangzhou | Day: TBD
Show Abstract
The differences in imaging devices cause multimodal images to have modal differences and geometric distortions, complicating the matching task. Deep learning-based matching methods struggle with multimodal images due to the lack of large annotated multimodal datasets. To address these challenges, we propose XCP-Match based on cross-modality completion pre-training. XCP-Match has two phases. (1) Self-supervised cross-modality completion pre-training based on real multimodal image dataset. We develop a novel pre-training model to learn cross-modal semantic features. The pre-training uses masked image modeling method for cross-modality completion, and introduces an attention-weighted contrastive loss to emphasize matching in overlapping areas. (2) Supervised fine-tuning for multimodal image matching based on the augmented MegaDepth dataset. XCP-Match constructs a complete matching framework to overcome geometric distortions and achieve precise matching. Two-phase training encourages the model to learn deep cross-modal semantic information, improving adaptation to modal differences without needing large annotated datasets. Experiments demonstrate that XCP-Match outperforms existing algorithms on public datasets.
4544: Resistance is Futile: Gradually Declining Immunity Retains the Exponential Duration of Immunity-Free Diffusion
Authors: Andreas Göbel, Nicolas Klodt, Martin S. Krejca, Marcus Pappik
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Agent-based and Multi-agent Systems (1/3)
Show Abstract
Diffusion processes pervade numerous areas of AI, abstractly modeling the dynamics of exchanging, oftentimes volatile, information in networks. A central question is how long the information remains in the network, known as survival time. For the commonly studied SIS process, the expected survival time is at least super-polynomial in the network size already on star graphs, for a wide range of parameters. In contrast, the expected survival time of the SIRS process, which introduces temporary immunity, is always at most polynomial on stars and only known to be super-polynomial for far denser networks, such as expanders. However, this result relies on featuring full temporary immunity, which is not always present in actual processes. We introduce the cSIRS process, which incorporates gradually declining immunity such that the expected immunity at each point in time is identical to that of the SIRS process. We study the survival time of the cSIRS process rigorously on star graphs and expanders and show that its expected survival time is very similar to that of the SIS process, which features no immunity. This suggests that featuring gradually declining immunity is almost as having none at all.
4545: NAAST-GNN: Neighborhood Adaptive Aggregation and Spectral Tuning for Graph Anomaly Detection
Authors: Ronghui Guo, Xiaowang Zhang, Zhizhi Yu, Minghui Zou, Sai Zhang, Zhiyong Feng
Location: Guangzhou | Day: TBD
Show Abstract
Heterophily emerges as a critical challenge in Graph Anomaly Detection (GAD). Recent studies reveal that neighborhood distributions, rather than heterophily itself, are the fundamental factor for the expressive power of Graph Neural Networks (GNNs). However, two key challenges remain unresolved. First, the overlap in neighborhood distributions between anomalous and normal nodes poses significant difficulties in distinguishing them effectively. Second, the dispersion in neighborhood distributions within the same class prevents the application of a fixed aggregation strategy to accommodate the diverse patterns within the class. To tackle the aforementioned challenges, we propose a novel Graph Neural Network model called Neighborhood Adaptive Aggregation and Spectral Tuning (NAAST-GNN). Specifically, we first design a neighborhood adaptive aggregation module that adjusts the message passing mechanism based on the predicted probabilities for different node classes, ensuring that nodes from distinct classes but with similar neighborhood distributions derive unique aggregated neighborhood information. We then present a spectral tuning module that dynamically selects and combines spectral filters based on the predicted neighborhood distribution, ensuring adaptability to the diverse neighborhood distributions of nodes within the same class. Comprehensive experimental results demonstrate that our method outperforms state-of-the-art baselines.
4551: FedCM: Client Clustering and Migration in Federated Learning via Gradient Path Similarity and Update Direction Deviation
Authors: Peng Wang, Shoupeng Lu, Hao Yin, Banglie Yang, Tianli Zhu, Cheng Dai
Location: Guangzhou | Day: TBD
Show Abstract
Federated learning (FL) enables collaborative training among multiple clients while preserving data privacy. However, its practical application is significantly limited by two major challenges: statistical heterogeneity and data distribution drift. Statistical heterogeneity causes the direction of local model updates to deviate from the global training objective, while data distribution drift leads to a mismatch between local models and their cluster models. To address these challenges, this paper proposes an adaptive clustered federated learning framework, Fed-CM. Initially, by capturing the dynamic patterns of personalized layer parameters in clients’ models, Fed-CM effectively characterizes the correlations and distributional similarities among clients, reflecting the underlying statistical heterogeneity. Subsequently, this framework leverages client similarities to construct an undirected graph and adaptively performs effective cluster discovery with minimal dependence on hyperparameters. Furthermore, a monitoring strategy tracks the deviation between clients’ update directions and the dominant update direction of their clusters and then adaptively migrates clients experiencing data drift. Such a dynamic strategy helps maintain intra-cluster homogeneity and addresses the mismatch between local models and their cluster models. Compared to other state-of-the-art methods, experimental results on multiple datasets demonstrate that the proposed Fed-CM framework effectively addresses the challenges posed by statistical heterogeneity and data drift, significantly improving the performance and robustness of federated learning models.
4560: DiffSQL: Leveraging Diffusion Model for Zero-Shot Self-Supervised Monocular Depth Estimation
Authors: Heyuan Zheng, Yunji Liang, Lei Liu, Zhiwen Yu
Location: Guangzhou | Day: TBD
Show Abstract
Self-supervised monocular depth estimation has attracted significant attention due to its broad applications in autonomous driving and robotics. Although significant performance improvements have been achieved by learning the relative distance of objects with the introduction of Self Query Layer (SQL), it struggles with zero-shot generalization due to the lack of geometric features and the fixed number of query sizes. To address these problems, we propose a diffusion-augmented self-supervised depth estimation framework, named DiffSQL, to learn geometric priors for feature augmentation. Additionally, we introduce a dynamic self-query layer that implicitly computes the relative distances between objects by adjusting the query size according to the feature distribution. Experimental results on the KITTI dataset show that DiffSQL outperforms SQLdepth by 1.03% in terms of AbsRel and 2.79% in terms of SqRel. Furthermore, our experiments demonstrate that DiffSQL is superior in zero-shot generalization.
4566: ILIF: Temporal Inhibitory Leaky Integrate-and-Fire Neuron for Overactivation in Spiking Neural Networks
Authors: Kai Sun, Peibo Duan, Levin Kuhlmann, Beilun Wang, Bin Zhang
Location: Montreal | Day: August 20th | Time: 10:00 | Session: ML: Spiking Neural Networks
Show Abstract
The Spiking Neural Network (SNN) has drawn increasing attention for its energy-efficient, event-driven processing and biological plausibility. To train SNNs via backpropagation, surrogate gradients are used to approximate the non-differentiable spike function, but they only maintain nonzero derivatives within a narrow range of membrane potentials near the firing threshold—referred to as the surrogate gradient support width gamma. We identify a major challenge, termed the dilemma of gamma: a relatively large gamma leads to overactivation, characterized by excessive neuron firing, which in turn increases energy consumption, whereas a small gamma causes vanishing gradients and weakens temporal dependencies. To address this, we propose a temporal Inhibitory Leaky Integrate-and-Fire (ILIF) neuron model, inspired by biological inhibitory mechanisms. This model incorporates interconnected inhibitory units for membrane potential and current, effectively mitigating overactivation while preserving gradient propagation. Theoretical analysis demonstrates ILIF’s effectiveness in overcoming the gamma dilemma, and extensive experiments on multiple datasets show that ILIF improves energy efficiency by reducing firing rates, stabilizes training, and enhances accuracy. The code is available at github.com/kaisun1/ILIF.
4573: RRG-Mamba: Efficient Radiology Report Generation with State Space Model
Authors: Xiaodi Hou, Xiaobo Li, Mingyu Lu, Simiao Wang, Yijia Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Recent advancements in radiology report generation have utilized deep neural networks such as CNNs and Transformers, achieving notable improvements in generating accurate and detailed reports. However, their practical adoption is hindered by the challenge of balancing global dependency modeling with computational efficiency. The state space model, particularly its enhanced variant Mamba, offers promising linear-complexity solutions for long-range dependency modeling. Despite its strengths, Mamba’s fixed positional encoding limits its ability to effectively capture complex spatial dependencies. To address this gap, we propose RRG-Mamba, an advanced framework for efficient radiology report generation. Within the RRGMamba, we enhance the vanilla Mamba by integrating rotary position encoding (RoPE), enabling dynamic modeling of relative positional information in visual feature sequences. Furthermore, we design a global dependency learning module to optimize long-range visual feature sequence modeling. Extensive experiments on publicly available datasets, including IU X-Ray and MIMIC-CXR, demonstrate that RRG-Mamba achieves a 3.7% improvement in BLEU-4 score over existing models, along with significant gains in computational and memory efficiency. Our code is available at https://github.com/Eleanorhxd/RRG-Mamba.
4574: Keypoints as Dynamic Centroids for Unified Human Pose and Segmentation
Authors: Niaz Ahmad, Jawad Khan, Kang G. Shin, Youngmoon Lee, Guanghui Wang
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Computer vision (2/3)
Show Abstract
The dynamic movement of the human body presents a fundamental challenge for human pose estimation and body segmentation. State-of-the-art approaches primarily rely on combining keypoint heatmaps with segmentation masks, but often struggle in scenarios involving overlapping joints during pose estimation or rapidly changing poses for instance-level segmentation. To address these limitations, we leverage Keypoints as Dynamic Centroid (KDC), a new centroid-based representation for unified human pose estimation and instance-level segmentation. KDC adopts a bottom-up paradigm to generate keypoint heatmaps for easily distinguishable and complex keypoints, and improves keypoint detection and confidence scores by introducing KeyCentroids using a keypoint disk. It leverages high-confidence keypoints as dynamic centroids in the embedding space to generate MaskCentroids, allowing for the swift clustering of pixels to specific human instances during rapid changes in human body movements in a live environment. Our experimental evaluations focus on crowded and occluded cases using the CrowdPose, OCHuman, and COCO benchmarks, demonstrating KDC’s effectiveness and generalizability in challenging scenarios in terms of both accuracy and runtime performance. Our implementation is available at https://sites.google.com/view/niazahmad/projects/kdc.
4582: MSCI: Addressing CLIP’s Inherent Limitations for Compositional Zero-Shot Learning
Authors: Yue Wang, Shuai Xu, Xuelin Zhu, Yicong Li
Location: Guangzhou | Day: TBD
Show Abstract
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-modal Interaction (MSCI) model that effectively explores and utilizes intermediate-layer information from CLIP’s visual encoder. Specifically, we design two self-adaptive aggregators to extract local information from low-level visual features and integrate global information from high-level visual features, respectively. These key information are progressively incorporated into textual representations through a stage-by-stage interaction mechanism, significantly enhancing the model’s perception capability for fine-grained local visual information. Additionally, MSCI dynamically adjusts the attention weights between global and local visual information based on different combinations, as well as different elements within the same combination, allowing it to flexibly adapt to diverse scenarios. Experiments on three widely used datasets fully validate the effectiveness and superiority of the proposed model. Data and code are available at https://github.com/ltpwy/MSCI.
4589: Concurrent Planning and Execution Using Dispatch-Dependent Values
Authors: Andrew Coles, Erez Karpas, Eyal Shimony, Shahaf Shperberg, Wheeler Ruml
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Planning and Scheduling (2/5)
Show Abstract
Agents operating in the real world must cope with the fact that time passes while they plan.
In some cases, such as under tight deadlines, the only way for such an agent to achieve its goal is to execute an action before a complete plan has been found. This problem is called Concurrent Planning and Execution (CoPE). Previous work on CoPE relied on a value function that assumes search will finish before actions are executed, causing the agent to be overly pessimistic in many situations.
In this paper, we define a new value function that takes into account the agent’s ability to dispatch actions incrementally. This allows us to devise a much simpler algorithm for concurrent planning and execution. An experimental evaluation on problems with time pressure shows that the new method significantly outperforms the previous state-of-the-art.
4614: Instantiation-based Formalization of Logical Reasoning Tasks Using Language Models and Logical Solvers
Authors: Mohammad Raza, Natasa Milic-Frayling
Location: Montreal | Day: August 20th | Time: 14:00 | Session: KR: Logic
Show Abstract
Robustness of reasoning remains a significant challenge for large language models, and addressing it is essential for the practical applicability of AI-driven reasoning systems. We introduce Semantic Self-Verification (SSV), a novel approach that addresses the key challenge in combining language models with the rigor of logical solvers: to accurately formulate the reasoning problem from natural language to the formal language of the solver. SSV uses a consistency-based approach to produce strong abstract formalizations of problems using concrete instantiations that are generated by the model and verified by the solver. In addition to significantly advancing the overall reasoning accuracy over the state-of-the-art, a key novelty that this approach presents is a feature of verification that has near-perfect precision over a significant coverage of cases, as we demonstrate on open reasoning benchmarks. We propose such *near-certain reasoning* as a new approach to reduce the need for manual verification in many cases, taking us closer to more dependable and autonomous AI reasoning systems.
4624: APIMig: A Project-Level Cross-Multi-Version API Migration Framework Based on Evolution Knowledge Graph
Authors: Li Kuang, Qi Xie, HaiYang Yang, Yang Yang, Xiang Wei, HaoYue Kang, YingJie Xia
Location: Guangzhou | Day: TBD
Show Abstract
API migration is essential for software maintenance due to the rapid evolution of third-party libraries where API elements may change continuously through updates. There are two main challenges for API migration at the project level, especially across multiple versions: 1) lack of specific library evolution knowledge across multi-version; 2) difficulty in identifying the chain of changes at the project level. This paper proposes a project-level cross-multi-version API migration framework APIMig. We first construct an API evolution knowledge graph (KG) to capture changes between adjacent library versions and then derive coherent cross-version API evolution knowledge by KG reasoning. Second, we design a chain exploration algorithm to track the chain of changes and aggregate the affected code segments. Finally, a large language model is employed in completing API migration by providing the API evolution knowledge and the chain of changes. We construct an evolution KG for the Lucene library from version 4.0.0 to 10.1.0 and evaluate our approach through project migration pairs that depend on different major versions. Our framework shows improvements over the baseline in migrating projects across 7 major versions, achieving average increases of 16.52% in CodeBLEU scores and 28.49% in VCEU scores in GPT-4o.
4681: Hyper-graph Video Object Segmentation via Text-depth Collaborative Reasoning
Authors: Jiaqing Fan, Yifan Liao, Fanzhang Li
Location: Guangzhou | Day: TBD
Show Abstract
Current video object segmentation (VOS) solutions often overlook the wealthy information subtitles and depth cues among video sequences, which are crucial for effectively linking video content. Recognizing the significance of these elements, in this paper, we introduce a novel approach termed as "Hyper-graph Text-Depth Collaborative Reasoning Video Object Segmentation" (HTD). It aims to leverage the synergy between textual and depth information to enhance the segmentation of objects in video sequences. The HTD framework integrates textual and depth data into a hyper-graph structure, where nodes represent objects, text, and depth features, and hyper-edges encode complex relationships among them. After grabbing the multimodal context of video scenes, the proposed collaborative reasoning mechanism within the hyper-graph iteratively refines object boundaries by considering the interplay between textual cues, depth information, and spatial-temporal coherence. We demonstrate the effectiveness of HTD through extensive experiments on four benchmarks. The results show that our approach outperforms state-of-the-art VOS methods, particularly in scenarios with complex backgrounds, occlusions, and dynamic scenes. The inclusion of text and depth data not only improves segmentation accuracy but also enhances the interpretability of the segmentation process. We have released the training and testing
code on https://github.com/zyaireleo/HTD.git.
4691: Online 3D Gaussian Splatting Modeling with Novel View Selection
Authors: Byeonggwon Lee, Junkyu Park, Khang Truong Giang, Soohwan Song
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Computer Vision (1/3)
Show Abstract
This study addresses the challenge of generating online 3D Gaussian Splatting (3DGS) models from RGB-only frames. Previous studies have employed dense SLAM techniques to estimate 3D scenes from keyframes for 3DGS model construction. However, these methods are limited by their reliance solely on keyframes, which are insufficient to capture an entire scene, resulting in incomplete reconstructions. Moreover, building a generalizable model requires incorporating frames from diverse viewpoints to achieve broader scene coverage. However, online processing restricts the use of many frames or extensive training iterations. Therefore, we propose a novel method for high-quality 3DGS modeling that improves model completeness through adaptive view selection. By analyzing reconstruction quality online, our approach selects optimal non-keyframes for additional training. By integrating both keyframes and selected non-keyframes, the method refines incomplete regions from diverse viewpoints, significantly enhancing completeness. We also present a framework that incorporates an online multi-view stereo approach, ensuring consistency in 3D information throughout the 3DGS modeling process. Experimental results demonstrate that our method outperforms state-of-the-art methods, delivering exceptional performance in complex outdoor scenes.
4707: DFMU: Distribution-based Framework for Modeling Aleatoric Uncertainty in Multimodal Sentiment Analysis
Authors: Chen Tang, Tingrui Shen, Xinrong Gong, Chong Zhao, Tong Zhang
Location: Guangzhou | Day: TBD
Show Abstract
In Multimodal Sentiment Analysis (MSA), data noise arising from various sources can lead to uncertainty in Aleatoric Uncertainty (AU), significantly impacting model performance. Current efforts to address AU have insufficiently explored its sources. They primarily focus on modeling noise rather than implementing targeted modeling based on its origin. Consequently, these approaches struggle to effectively mitigate the influence of AU, resulting in sustained limitations in model performance. Our research identifies that the AU primarily stems from two problems: subjective bias in the annotation process and the complex set relationships of sentiment features. To specifically address them, we propose DFMU, a Distribution-based Framework for Modeling Aleatoric Uncertainty, which incorporates an uncertainty modeling block capable of encoding uncertainty distributions and adaptively adjusting optimization objectives. Furthermore, we introduce distribution-based contrastive learning with sentiment words replacement to better capture the complex relationships among features. Extensive experiments on three public MSA datasets, i.e., MOSI, MOSEI, and SIMS, demonstrate that the proposed model maintains robust performance even under high noise conditions and achieves state-of-the-art results on these popular datasets.
4708: Reinforced In-Context Black-Box Optimization
Authors: Lei Song, Chen-Xiao Gao, Ke Xue, Chenyang Wu, Dong Li, Jianye Hao, Zongzhang Zhang, Chao Qian
Location: Guangzhou | Day: TBD
Show Abstract
Black-Box Optimization (BBO) has found successful applications in many fields of science and engineering. Recently, there has been a growing interest in meta-learning particular components of BBO algorithms to speed up optimization and get rid of tedious hand-crafted heuristics. As an extension, learning the entire algorithm from data requires the least labor from experts and can provide the most flexibility. In this paper, we propose RIBBO, a method to reinforce-learn a BBO algorithm from offline data in an end-to-end fashion. RIBBO employs expressive sequence models to learn the optimization histories produced by multiple behavior algorithms and tasks, leveraging the in-context learning ability of large models to extract task information and make decisions accordingly. Central to our method is to augment the optimization histories with regret-to-go tokens, which are designed to represent the performance of an algorithm based on cumulative regret over the future part of the histories. The integration of regret-to-go tokens enables RIBBO to automatically generate sequences of query points that are positively correlated to the user-desired regret, verified by its universally good empirical performance on diverse problems, including BBO benchmark, hyper-parameter optimization, and robot control problems.
4713: Hypernetwork Aggregation for Decentralized Personalized Federated Learning
Authors: Weishi Li, Yong Peng, Mengyao Du, Fuhui Sun, Xiaoyan Wang, Li Shen
Location: Guangzhou | Day: TBD
Show Abstract
Personalized Federated Learning (PFL) meets each user’s personalized needs while still facing the high communication costs due to the large amount of data transmission and frequent communication. Decentralized PFL (DPFL) as an alternative discards the central server in PFL, which reduces the pressure of communication and the risk of server failure by using peer-to-peer communication.Nevertheless, DPFL still suffers from the significant communication pressure due to the transmission of a large number of model parameters, especially numerous nodes. To address the issues, we propose a novel personalized framework, DFedHP, in which each client utilizes a hypernetwork to generate the shared part of model parameters and train the personalized parameters separately. The number of parameters in a hypernetwork is much smaller than those in a typical local network, so hypernetwork aggregation reduces communication costs and the risk of privacy leakage. Furthermore, DFedHP can seamlessly integrate into existing DPFL algorithms as a plugin to boost their efficacy. At last, extensive experiments on various data heterogeneous environments demonstrate that DFedHP can reduce communication costs, accelerate convergence rate, and improve generalization performance compared with state-of-the-art (SOTA) baselines.
4718: VimGeo: Efficient Cross-View Geo-Localization with Vision Mamba Architecture
Authors: Jinglin Huang, Maoqiang Wu, Peichun Li, Wen Wu, Rong Yu
Location: Guangzhou | Day: TBD
Show Abstract
Cross-view geo-localization is a crucial task with diverse applications, yet it remains challenging due to the significant variations in viewpoints and visual appearances between images from different perspectives. While recent advancements have been made, existing methods often suffer from high model complexity, excessive resource consumption, and the impact of sample learning difficulty on optimization. To overcome these limitations, we optimize the Vision Mamba (Vim) model, built on a State Space Model (SSM) architecture, by replacing the traditional classification head with Channel Group Pooling (CGP) for efficient feature integration. This optimization reduces model parameters by 1.5% and computational complexity by 0.4%. Additionally, we propose a novel Dynamic Weighted Batch-tuple Loss (DWBL) to dynamically adjust the weighting of negative samples, improving model performance. By combining CGP and DWBL, we develop an efficient end-to-end network, VimGeo, which achieves state-of-the-art performance with enhanced computational efficiency. Specifically, VimGeo achieves a Recall@1 of 81.67% on the CVACT_test dataset, outperforming prior approaches. Extensive experiments on CVUSA, CVACT, and VIGOR datasets validate VimGeo’s effectiveness and competitiveness in cross-view geo-localization tasks, achieving the leading results among sequence modeling-based methods. The implementation is available at: https://github.com/VimGeoTeam/VimGeo.
4720: ForgDiffuser: General Image Forgery Localization with Diffusion Models
Authors: Mengxi Wang, Shaozhang Niu, Jiwei Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Current general image forgery localization (GIFL) methods confront two main challenges: decoder overconffdence causing misidentiffcation of the authentic regions or incomplete predicted masks, and limited accuracy in localizing forgery details. Recently, diffusion models have excelled as dominant approach for generative models, particularly effective in capturing complex scene details. However, their potential for GIFL remains underexplored. Therefore, we propose a GIFL framework named ForgDiffuser with diffusion models. The core of ForgDiffuser lies in leveraging diffusion models conditioned on the forgery image to efffciently generate the segmentation mask for tampered regions. Speciffcally, we introduce the attentionguided module (AGM) to aggregate and enhance image feature representations. Meanwhile, we design the boundary-driven module (BDM) with edge supervision to improve the localization accuracy of boundary details. Additionally, the probabilistic modeling and stochastic sampling mechanisms of diffusion models effectively alleviate the overconffdence issue commonly observed in traditional decoders. Experiments on six benchmark datasets demonstrate that ForgDiffuser outperforms existing mainstream GIFL methods in both localization accuracy and robustness, especially under challenging manipulation conditions.
4721: Discrete Budget Aggregation: Truthfulness and Proportionality
Authors: Ulrike Schmidt-Kraepelin, Warut Suksompong, Markus Utke
Location: Montreal | Day: August 21st | Time: 15:00 | Session: GTEP: Computational social choice (2/2)
Show Abstract
We study a budget aggregation setting where voters express their preferred allocation of a fixed budget over a set of alternatives, and a mechanism aggregates these preferences into a single output allocation. Motivated by scenarios in which the budget is not perfectly divisible, we depart from the prevailing literature by restricting the mechanism to output allocations that assign integral amounts. This seemingly minor deviation has significant implications for the existence of truthful mechanisms. Specifically, when voters can propose fractional allocations, we demonstrate that the Gibbard-Satterthwaite theorem can be extended to our setting. In contrast, when voters are restricted to integral ballots, we identify a class of truthful mechanisms by adapting moving-phantom mechanisms to our context. Finally, we show that while a weak form of proportionality can be achieved alongside truthfulness, stronger proportionality notions derived from approval-based committee voting are incompatible with truthfulness.
4730: VeRecycle: Reclaiming Guarantees from Probabilistic Certificates for Stochastic Dynamical Systems after Change
Authors: Sterre Lutz, Matthijs T.J. Spaan, Anna Lukina
Location: Montreal | Day: August 21st | Time: 10:00 | Session: AI Ethics, Trust, Fairness (2/3)
Show Abstract
Autonomous systems operating in the real world encounter a range of uncertainties. Probabilistic neural Lyapunov certification is a powerful approach to proving safety of nonlinear stochastic dynamical systems. When faced with changes beyond the modeled uncertainties, e.g., unidentified obstacles, probabilistic certificates must be transferred to the new system dynamics. However, even when the changes are localized in a known part of the state space, state-of-the-art requires complete re-certification, which is particularly costly for neural certificates. We introduce VeRecycle, the first framework to formally reclaim guarantees for discrete-time stochastic dynamical systems. VeRecycle efficiently reuses probabilistic certificates when the system dynamics deviate only in a given subset of states. We present a general theoretical justification and algorithmic implementation. Our experimental evaluation shows scenarios where VeRecycle both saves significant computational effort and achieves competitive probabilistic guarantees in compositional neural control.
Code — https://github.com/SUMI-lab/VeRecycle
Extended version — https://doi.org/10.48550/arXiv.2505.14001
4744: Deduction with Induction: Combining Knowledge Discovery and Reasoning for Interpretable Deep Reinforcement Learning
Authors: Haodi Zhang, Xiangyu Zeng, Junyang Chen, Yuanfeng Song, Rui Mao, Fangzhen Lin
Location: Guangzhou | Day: TBD
Show Abstract
Deep reinforcement learning (DRL) has achieved remarkable success in dynamic decision-making tasks. However, its inherent opacity and cold start problem hinder transparency and training efficiency. To address these challenges, we propose HRL-ID, a neural-symbolic framework that combines automated rule discovery with logical reasoning within a hierarchical DRL structure. HRL-ID dynamically extracts first-order logic rules from environmental interactions, iteratively refines them through success-based updates, and leverages these rules to guide action execution during training. Extensive experiments on Atari benchmarks demonstrate that HRL-ID outperforms state-of-the-art methods in training efficiency and interpretability, achieving higher reward rates and successful knowledge transfer between domains.
4751: Problem-dependent Regret for Lexicographic Multi-Armed Bandits with Adversarial Corruptions
Authors: Bo Xue, Xi Lin, Yuanyu Wan, Qingfu Zhang
Location: Guangzhou | Day: TBD
Show Abstract
This paper studies lexicographic multi-armed bandits (MAB), where after selecting an arm, the agent observes a reward vector including multiple objectives, each with a different level of importance. Although previous literature has proposed the algorithm for lexicographic MAB, their algorithm suffers from several limitations: (1) it exhibits poor adversarial robustness due to its reliance on stochastic rewards, (2) its regret bound is suboptimal compared to single-objective counterparts, and (3) the regret bound does not adapt to specific problem instances. To address these limitations, we study lexicographic MAB with adversarial corruptions, where an adversary might corrupt the stochastic rewards with a corruption budget of C. First, when the value of C is known, we propose an algorithm achieving a problem-dependent regret bound of O(∑(log T / Δⁱ(a) + C)) for the i-th objective (i ∈ [M]), where Δⁱ(a) is the reward gap for arm a on the i-th objective, and M is the number of objectives. In the purely stochastic setting (C=0), this regret bound approaches optimality. Second, we introduce another algorithm that does not require value of C but incurs a less favorable regret bound of O(∑(γ_T / Δⁱ(a) + γ_T)) for the i-th objective, where γ_T = O((log T)² + KC(log T)²). Finally, we conduct experiments on both synthetic and real-world datasets to verify the effectiveness of our algorithms.
4772: A Symmetric Relative-Error Loss Function for Intermittent Multiscale Signal Modelling
Authors: Sergio M. Vanegas Arias, Lasse Lensu, Fredy Ruiz Palacios
Location: Montreal | Day: August 20th | Time: 14:00 | Session: ML: time series, sequences and signals
Show Abstract
Multiscale signals represent a formidable modelling challenge in Machine Learning as the ubiquitous Mean Squared Error loss function neglects signal behaviour at smaller values. Several scale-equalizing error metrics have been devised to tackle this problem, amongst which the Mean Absolute Percentage Error (MAPE) remains the most widely used due to its simplicity and interpretability. However, by its very definition, MAPE introduces three major issues: asymptotic behaviour at zero-target values, asymptotic gradient behaviour at zero error, and accuracy loss for large signal scales. We address these limitations by proposing the Symmetric Mean Arctangent Squared Percentage Error (SMASPE), which builds up from the Mean Arctangent Absolute Percentage Error (MAAPE) and leverages a mathematically smoother definition along with user-provided signal bounds to extend its functionality. The numerical properties of SMASPE are explored, and its performance is tested in two real-life cases for deterministic and stochastic optimization. The experiments show a clear advantage of the proposed loss function, with an improvement of up to 42% with respect to MAAPE in terms of Mean Absolute Error for deep learning models when appropriate bounds are selected.
4777: Decentralized Online Learning by Selfish Agents in Coalition Formation
Authors: Saar Cohen, Noa Agmon
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Game Theory and Economic Paradigms
Show Abstract
Coalition formation involves self-organized coalitions generated through strategic interactions of autonomous selfish agents. In online learning of coalition structures, agents’ preferences toward each other are initially unknown before agents interact. Coalitions are formed iteratively based on preferences that agents learn online from repeated feedback resulting from their interactions. In this paper, we introduce online learning in coalition formation through the lens of distributed decision-making, where self-interested agents operate without global coordination or information sharing, and learn only from their own experience. Under our selfish perspective, each agent seeks to maximize her own utility. Thus, we analyze the system in terms of Nash stability, where no agent can improve her utility by unilaterally deviating. We devise a sample-efficient decentralized algorithm for selfish agents that minimize their Nash regret, yielding approximately Nash stable solutions. In our algorithm, each agent uses only one utility feedback per round to update her strategy, but our algorithm still has Nash regret and sample complexity bounds that are optimal up to logarithmic factors.
4782: EFormer: An Effective Edge-based Transformer for Vehicle Routing Problems
Authors: Dian Meng, Zhiguang Cao, Yaoxin Wu, Yaqing Hou, Hongwei Ge, Qiang Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Recent neural heuristics for the Vehicle Routing Problem (VRP) primarily rely on node coordinates as input, which may be less effective in practical scenarios where real cost metrics—such as edge-based distances—are more relevant. To address this limitation, we introduce EFormer, an Edge-based Transformer model that uses edge as the sole input for VRPs. Our approach employs a precoder module with a mixed-score attention mechanism to convert edge information into temporary node embeddings. We also present a parallel encoding strategy characterized by a graph encoder and a node encoder, each responsible for processing graph and node embeddings in distinct feature spaces, respectively. This design yields a more comprehensive representation of the global relationships among edges. In the decoding phase, parallel context embedding and multi-query integration are used to compute separate attention mechanisms over the two encoded embeddings, facilitating efficient path construction. We train EFormer using reinforcement learning in an autoregressive manner. Extensive experiments on the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) reveal that EFormer outperforms established baselines on synthetic datasets, including large-scale and diverse distributions. Moreover, EFormer demonstrates strong generalization on real-world instances from TSPLib and CVRPLib. These findings confirm the effectiveness of EFormer’s core design in solving VRPs.
4783: GraphProt: Certified Black-Box Shielding Against Backdoored Graph Models
Authors: Xiao Yang, Yuni Lai, Kai Zhou, Gaolei Li, Jianhua Li, Hang Zhang
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: AI Ethics, Trust, Fairness (3/3)
Show Abstract
Graph learning models have been empirically proven to be vulnerable to backdoor threats, wherein adversaries submit trigger-embedded inputs to manipulate the model predictions.
Current graph backdoor defenses manifest several limitations: 1) dependence on model-related details, 2) necessitation of additional fine-tuning, and 3) reliance on extra explainability tools, all of which are infeasible under stringent privacy policies.
To address those limitations, we propose GraphProt, a certified black-box defense method to suppress backdoor attacks on GNN-based graph classifiers. Our GraphProt operates in a model-agnostic manner and solely leverages graph input.
Specifically, GraphProt first introduces designed topology-feature-filtration to mitigate graph anomalies. Subsequently, subgraphs are sampled via a formulated strategy integrating topology and features, followed by a robust model inference through a majority vote-based subgraph prediction ensemble.
Our results across benchmark attacks and datasets show GraphProt effectively reduces attack success rates while preserving regular graph classification accuracy.
4816: All Roads Lead to Rome: Exploring Edge Distribution Shifts for Heterophilic Graph Learning
Authors: Yi Wang, Changqin Huang, Ming Li, Tingyi Cai, Zhonglong Zheng, Xiaodi Huang
Location: Guangzhou | Day: TBD
Show Abstract
Heterophilic graph neural networks (GNNs) have gained prominence for their ability to learn effective representations in graphs with diverse, attribute-aware relationships. While existing methods leverage attribute inference during message passing to improve performance, they often struggle with challenging heterophilic graphs. This is due to edge distribution shifts introduced by diverse connection patterns, which blur attribute distinctions and undermine message-passing stability. This paper introduces H₂OGNN, a novel framework that reframes edge attribute inference as an out-of-distribution (OOD) detection problem. H₂OGNN introduces a simple yet effective symbolic energy regularization approach for OOD learning, ensuring robust classification boundaries between homophilic and heterophilic edge attributes. This design significantly improves the stability and reliability of GNNs across diverse connectivity patterns. Through theoretical analysis, we show that H₂OGNN addresses the graph denoising problem by going beyond feature smoothing, offering deeper insights into how precise edge attribute identification boosts model performance. Extensive experiments on nine benchmark datasets demonstrate that H₂OGNN not only achieves state-of-the-art performance but also consistently outperforms other heterophilic GNN frameworks, particularly on datasets with high heterophily.
4836: PAMol: Pocket-Aware Drug Design Method with Hypergraph Representation of Protein Pocket Structure and Feature Fusion
Authors: Xiaoli Lin, Xiongwei Liao, Jun Pang, Bo Li, Xiaolong Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Efficient generation of targeted drug molecules is crucial in the field of drug discovery. Most existing methods neglect the high-order information in the structure of protein pockets, limiting the performance of generated drug molecules. This paper proposes a pocket-aware drug design framework, namely PAMol, constructing the hypergraph to represent the spatial structure of protein pockets, effectively capturing high-order relations and neighborhood information within the pocket structures. This framework also fuses different modal embeddings from proteins and molecules, to generate high-quality molecules. In addition, a conditional molecule generation module uses the high-order structural information in protein pockets as constraints to more accurately generate molecules for specific targets. The performance of PAMol has been assessed by analyzing generated molecules in terms of vina score, high affinity, QED, SA, LogP, Lipinski, diversity, and time. Experimental results demonstrate the potential of PAMol for targeted drug design. The source code is available at https://github.com/YICHUANSYQ/PAMol.git.
4845: Multi-Omics Analysis for Cancer Subtype Inference via Unrolling Graph Smoothness Priors
Authors: Jielong Lu, Zhihao Wu, Jiajun Yu, Jiajun Bu, Haishuai Wang
Location: Guangzhou | Day: TBD
Show Abstract
Integrating multi-omics datasets through data-driven analysis offers a comprehensive understanding of the complex biological processes underlying various diseases, particularly cancer.
Graph Neural Networks (GNNs) have recently demonstrated remarkable ability to exploit relational structures in biological data, enabling advances in multi-omics integration for cancer subtype classification.
Existing approaches often neglect the intricate coupling between heterogeneous omics, limiting their capacity to resolve subtle cancer subtype heterogeneity critical for precision oncology.
To address these limitations, we propose a framework named Graph Transformer for Multi-omics Cancer Subtype Classification (GTMancer).
This framework builds upon the GNN optimization problem and extends its application to complex multi-omics data.
Specifically, our method leverages contrastive learning to embed multi-omics data into a unified semantic space.
We unroll the multiplex graph optimization problem in that unified space and introduce dual sets of attention coefficients to capture structural graph priors both within and among multi-omics data.
This approach enables global omics information to guide the refining of the representations of individual omics.
Empirical experiments on seven real-world cancer datasets demonstrate that GTMancer outperforms existing state-of-the-art algorithms.
4863: A Logic-based Framework for Decoding Enthymemes in Argument Maps Involving Implicitness in Premises and Claims
Authors: Victor David, Anthony Hunter
Location: Montreal | Day: August 19th | Time: 15:00 | Session: KRR: Argumentation
Show Abstract
Argument mining is a natural language processing technology aimed at identifying the explicit premises and claims of arguments in text, and the support and attack relationships between them. To better understand, and automatically analyse, the argument maps that are output from argument mining, it would be desirable to instantiate the arguments in the argument map with logical arguments. However, most real-world arguments are enthymemes (i.e. some of the premises and/or claim are implicit), which need to be decoded (i.e. the implicit aspects need to be identified). A key challenge is to decode enthymemes so as to respect the support and attack relationships in the argument map. addressing the problem of identifying the missing premises and/or claim, and discerning the relationships between them. To address this, we present a novel framework, based on default logic, for representing arguments including enthymemes. We show how decoding an enthymeme means identifying the default rules that are implicit in the premises and claims. We then show how choosing a decoding of the enthymemes in an argument map can be formalized as an optimization problem, and that a solution can be obtained using MaxSAT solvers.
4868: Learn Multi-task Anchor: Joint View Imputation and Label Generation for Incomplete Multi-view Clustering
Authors: Xinxin Wang, Yongshan Zhang, Yicong Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Anchor-based incomplete multi-view clustering methods utilize anchors to uncover clustering structures. However, relying on anchor graphs for producing final indicators is indirect, which can lead to information loss and suboptimal outcomes. Besides, most methods neglect the potential of anchors for imputing missing views. To address these limitations, we propose a Joint View Imputation and Label Generation (JVILG) method. JVILG comprises the Anchor-based tensorized Label Generation (ALG) module for generating clustering labels and the Anchor-based sparse regularized Subspace Correlation (ASC) module for recovering missing views. The ALG module explicitly connects data observations, the fine-grained anchor matrix, and soft label matrices within a reconstruction framework through a membership matrix, while imposing tensor Schatten p-norm regularization on the constructed label tensor to capture spatial correlations among views. Meanwhile, the ASC module directly uses fine-grained anchors to impute missing data in respective views. By integrating the ALG and ASC modules, JVILG enhances synergy between different tasks and mitigates the impact of missing information on clustering. Experimental results on six datasets demonstrate the effectiveness of JVILG compared to both shallow and deep state-of-the art methods.The code is available at https://github.com/W-Xinxin/JVILG.
4876: In-Context Meta LoRA Generation
Authors: Yihua Shao, Minxi Yan, Yang Liu, Siyu Chen, Wenjie Chen, Xinwei Long, Ziyang Yan, Lei Li, Chenyu Zhang, Nicu Sebe, Hao Tang, Yan Wang, Hao Zhao, Mengzhu Wang, Jingcai Guo
Location: Guangzhou | Day: TBD
Show Abstract
Low-rank Adaptation (LoRA) has demonstrated remarkable capabilities for task specific fine-tuning. However, in scenarios that involve multiple tasks, training a separate LoRA model for each one results in considerable inefficiency in terms of storage and inference. Moreover, existing parameter generation methods fail to capture the correlations among these tasks, making multi-task LoRA parameter generation challenging. To address these limitations, we propose In-Context Meta LoRA (ICM-LoRA), a novel approach that efficiently achieves task-specific customization of large language models (LLMs). Specifically, we use training data from all tasks to train a tailored generator, Conditional Variational Autoencoder (CVAE). CVAE takes task descriptions as inputs and produces task-aware LoRA weights as outputs. These LoRA weights are then merged with LLMs to create task-specialized models without the need for additional fine-tuning. Furthermore, we utilize in-context meta-learning for knowledge enhancement and task mapping, to capture the relationship between tasks and parameter distributions. As a result, our method achieves more accurate LoRA parameter generation for diverse tasks using CVAE. ICM-LoRA enables more accurate LoRA parameter reconstruction than current parameter reconstruction methods and is useful for implementing task-specific enhancements of LoRA parameters. At the same time, our method occupies 283MB, only 1% storage compared with the original LoRA. The code is available at https://github.com/YihuaJerry/ICM-LoRA.
4877: Inconsistency Handling in DatalogMTL
Authors: Meghyn Bienvenu, Camille Bourgaux, Atefe Khodadaditaghanaki
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: KR: ontologies
Show Abstract
In this paper, we explore the issue of inconsistency handling in DatalogMTL, an extension of Datalog with metric temporal operators. Since facts are associated with time intervals, there are different manners to restore consistency when they contradict the rules, such as removing facts or modifying their time intervals. Our first contribution is the definition of relevant notions of conflicts (minimal explanations for inconsistency) and repairs (possible ways of restoring consistency) for this setting and the study of the properties of these notions and the associated inconsistency-tolerant semantics. Our second contribution is a data complexity analysis of the tasks of generating a single conflict / repair and query entailment under repair-based semantics.
4895: Integrating Answer Set Programming and Large Language Models for Enhanced Structured Representation of Complex Knowledge in Natural Language
Authors: Mario Alviano, Lorenzo Grillo, Fabrizio Lo Scudo, Luis Angel Rodriguez Reiners
Location: Montreal | Day: August 21st | Time: 15:00 | Session: Knowledge Representation and Reasoning (4/4)
Show Abstract
Answer Set Programming (ASP) and Large Language Models (LLMs) have emerged as powerful tools in Artificial Intelligence, each offering unique capabilities in knowledge representation and natural language processing, respectively.
In this paper, we combine the strengths of the two paradigms with the aim of improving the structured representation of complex knowledge encoded in natural language.
In a nutshell, the structured representation is obtained by combining syntactic structures extracted by LLMs and semantic aspects encoded in the knowledge base.
The interaction between ASP and LLMs is driven by a YAML file specifying prompt templates and domain-specific background knowledge.
The proposed approach is evaluated using a set of benchmarks based on a dataset obtained from problems of ASP Competitions.
The results of our experiment show that ASP can sensibly improve the F1-score, especially when relatively small models are used.
4897: Diffusion Guided Propagation Augmentation for Popularity Prediction
Authors: Chaozhuo Li, Tianqi Yang, Litian Zhang, Xi Zhang
Location: Guangzhou | Day: TBD
Show Abstract
The prediction of information popularity propagation is critical for applications such as recommendation systems, targeted advertising, and social media trend analysis. Traditional approaches primarily rely on historical cascade data, often sacrificing timeliness for prediction accuracy. These methods capture aggregate diffusion patterns but fail to account for the complex temporal dynamics of early-stage propagation. In this paper, we introduce Diffusion Guided Propagation Augmentation(DGPA), a novel framework designed to improve early-stage popularity prediction. DGPA models cascade dynamics by leveraging a generative approach, where a temporal conditional interpolator serves as a noising process and forecasting as a denoising process. By iteratively generating cascade representations through a sampling procedure, DGPA effectively incorporates the evolving time steps of diffusion, significantly enhancing prediction timeliness and accuracy. Extensive experiments on benchmark datasets from Twitter, Weibo, and APS demonstrate that DGPA outperforms state-of-the-art methods in early-stage popularity prediction.
4900: Universal Backdoor Defense via Label Consistency in Vertical Federated Learning
Authors: Peng Chen, Haolong Xiang, Xin Du, Xiaolong Xu, Xuhao Jiang, Zhihui Lu, Jirui Yang, Qiang Duan, Wanchun Dou
Location: Guangzhou | Day: TBD
Show Abstract
Backdoor attacks in vertical federated learning (VFL) are particularly concerning as they can covertly compromise VFL decision-making, posing a severe threat to critical applications of VFL. Existing defense mechanisms typically involve either label obfuscation during training or model pruning during inference. However, the inherent limitations on the defender’s access to the global model and complete training data in VFL environments fundamentally constrain the effectiveness of these conventional methods. To address these limitations, we propose the Universal Backdoor Defense (UBD) framework. UBD leverages Label Consistent Clustering (LCC) to synthesize plausible latent triggers associated with the backdoor class. This synthesized information is then utilized for mitigating backdoor threats through Linear Probing (LP), guided by a constraint on Batch Normalization (BN) statistics. Positioned within a unified VFL backdoor defense paradigm, UBD offers a generalized framework for both detection and mitigation that critically does not necessitate access to the entire model or dataset. Extensive experiments across multiple datasets rigorously demonstrate the efficacy of the UBD framework, achieving state-of-the-art performance against diverse backdoor attack types in VFL, including both dirty-label and clean-label variants.
4910: Uncertainty-guided Graph Contrastive Learning from a Unified Perspective
Authors: Zhiqiang Li, Jie Wang, Jianqing Liang, Junbiao Cui, Xingwang Zhao, Jiye Liang
Location: Guangzhou | Day: TBD
Show Abstract
The success of current graph contrastive learning methods largely relies on the choice of data augmentation and contrastive objectives. However, most existing methods tend to optimize these two components independently, neglecting their potential interplay, which leads to suboptimal quality of the learned embeddings. To address this issue, we propose Uncertainty-guided Graph Contrastive Learning (UGCL) from a unified perspective. The core of our method is the introduction of sample uncertainty, a critical metric that quantifies the degree of class ambiguity within individual samples. On this basis, we design a novel multi-scale data augmentation strategy and a weighted graph contrastive loss function, both of which significantly enhance the quality of embeddings. Theoretically, we demonstrate that UGCL can coordinate overall optimization objectives through uncertainty, and through experiments, we show that it improves the performance of tasks such as node classification, node clustering, and link prediction, thereby verifying the effectiveness of our method.
4915: Interaction-Data-guided Conditional Instrumental Variables for Debiasing Recommender Systems
Authors: Zhirong Huang, Debo Cheng, Lin Liu, Jiuyong Li, Guangquan Lu, Shichao Zhang
Location: Guangzhou | Day: TBD
Show Abstract
It is often challenging to identify a valid instrumental variable (IV), although the IV methods have been regarded as effective tools of addressing the confounding bias introduced by latent variables. To deal with this issue, an Interaction-Data-guided Conditional IV (IDCIV) debiasing method is proposed for Recommender Systems, called IDCIV-RS. The IDCIV-RS automatically generates the representations of valid CIVs and their corresponding conditioning sets directly from interaction data, significantly reducing the complexity of IV selection while effectively mitigating the confounding bias caused by latent variables in recommender systems. Specifically, the IDCIV-RS leverages a variational autoencoder (VAE) to learn both the CIV representations and their conditioning sets from interaction data, followed by the application of least squares to derive causal representations for click prediction. Extensive experiments on two real-world datasets, Movielens-10M and Douban-Movie, demonstrate that IDCIV-RS successfully learns the representations of valid CIVs, effectively reduces bias, and consequently improves recommendation accuracy.
4923: Preference Elicitation for Multi-objective Combinatorial Optimization with Active Learning and Maximum Likelihood Estimation
Authors: Marianne Defresne, Jayanta Mandi, Tias Guns
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Constraint Satisfaction and Optimization (1/3)
Show Abstract
Real-life combinatorial optimization problems often involve several conflicting objectives, such as price, product quality and sustainability. A computationally-efficient way to tackle multiple objectives is to aggregate them into a single-objective function, such as a linear combination. However, defining the weights of the linear combination upfront is hard; alternatively, the use of interactive learning methods that ask users to compare candidate solutions is highly promising. The key challenges are to generate candidates quickly, to learn an objective function that leads to high-quality solutions and to do so with few user interactions. We build upon the Constructive Preference Elicitation framework and show how each of the three properties can be improved: to increase the interaction speed we investigate using pools of (relaxed) solutions, to improve the learning we adopt Maximum Likelihood Estimation of a Bradley-Terry preference model; and to reduce the number of user interactions, we select the pair of candidates to compare with an ensemble-based acquisition function inspired from Active Learning. Our careful experimentation demonstrates each of these improvements: on a PC configuration task and a realistic multi-instance routing problem, our method selects queries faster, needs fewer queries and synthesizes higher-quality combinatorial solutions than previous CPE methods.
4929: Learning Probabilistic Temporal Logic Specifications for Stochastic Systems
Authors: Rajarshi Roy, Yash Pote, Dave Parker, Marta Kwiatkowska
Location: Montreal | Day: August 20th | Time: 14:00 | Session: KR: Logic
Show Abstract
There has been substantial progress in the inference of formal behavioural specifications from sample trajectories, for example using Linear Temporal Logic (LTL). However, these techniques cannot handle specifications that correctly characterise systems with stochastic behaviour, which occur commonly in reinforcement learning and formal verification. We consider the passive learning problem of inferring a Boolean combination of probabilistic LTL (PLTL) formulas from a set of Markov chains, classified as either positive or negative. We propose a novel learning algorithm that infers concise PLTL specifications, leveraging grammar-based enumeration, search heuristics, probabilistic model checking and Boolean set-cover procedures. We demonstrate the effectiveness of our algorithm in two use cases: learning from policies induced by RL algorithms and learning from variants of a probabilistic model. In both cases, our method automatically and efficiently extracts PLTL specifications that succinctly characterize the temporal differences between the policies or model variants.
4930: InnateCoder: Learning Programmatic Options with Foundation Models
Authors: Rubens O. Moraes, Quazi Asif Sadmine, Hendrik Baier, Levi H. S. Lelis
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: LLM applications
Show Abstract
Outside of transfer learning settings, reinforcement learning agents start their learning process from a clean slate. As a result, such agents have to go through a slow process to learn even the most obvious skills required to solve a problem. In this paper, we present InnateCoder, a system that leverages human knowledge encoded in foundation models to provide programmatic policies that encode “innate skills” in the form of temporally extended actions, or options. In contrast to existing approaches to learning options, InnateCoder learns them from the general human knowledge encoded in foundation models in a zero-shot setting, and not from the knowledge the agent gains by interacting with the environment. Then, InnateCoder searches for a programmatic policy by combining the programs encoding these options into larger and more complex programs. We hypothesized that InnateCoder’s way of learning and using options could improve the sampling efficiency of current methods for learning programmatic policies. Empirical results in MicroRTS and Karel the Robot support our hypothesis, since they show that InnateCoder is more sample efficient than versions of the system that do not use options or learn them from experience.
4935: X-KAN: Optimizing Local Kolmogorov-Arnold Networks via Evolutionary Rule-Based Machine Learning
Authors: Hiroki Shiraishi, Hisao Ishibuchi, Masaya Nakata
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: S: Evolutionary computation (2/2)
Show Abstract
Function approximation is a critical task in various fields. However, existing neural network approaches struggle with locally complex or discontinuous functions due to their reliance on a single global model covering the entire problem space. We propose X-KAN, a novel method that optimizes multiple local Kolmogorov-Arnold Networks (KANs) through an evolutionary rule-based machine learning framework called XCSF. X-KAN combines KAN’s high expressiveness with XCSF’s adaptive partitioning capability by implementing local KAN models as rule consequents and defining local regions via rule antecedents. Our experimental results on artificial test functions and real-world datasets demonstrate that X-KAN significantly outperforms conventional methods, including XCSF, Multi-Layer Perceptron, and KAN, in terms of approximation accuracy. Notably, X-KAN effectively handles functions with locally complex or discontinuous structures that are challenging for conventional KAN, using a compact set of rules (average 7.2 rules). These results validate the effectiveness of using KAN as a local model in XCSF, which evaluates the rule fitness based on both accuracy and generality. Our X-KAN implementation and an extended version of this paper, including appendices, are available at https://doi.org/10.48550/arXiv.2505.14273.
4936: A Unifying Framework for Semiring-Based Constraint Logic Programming With Negation
Authors: Jeroen Spaans, Jesse Heyninck
Location: Montreal | Day: August 21st | Time: 11:30 | Session: Constraint Satisfaction and Optimization (3/3)
Show Abstract
Constraint Logic Programming (CLP) is a logic programming formalism used to solve problems requiring the consideration of constraints, like resource allocation and automated planning and scheduling.
It has previously been extended in various directions, for example to support fuzzy constraint satisfaction, uncertainty, or negation, with different notions of semiring being used as a unifying abstraction for these generalisations. None of these extensions have studied clauses with negation allowed in the body.
We investigate an extension of CLP which unifies many of these extensions and allows negation in the body. We provide semantics for such programs, using the framework of approximation fixpoint theory, and give a detailed overview of the impacts of properties of the semirings on the resulting semantics. As such, we provide a unifying framework that captures existing approaches and allows to extend them with a more expressive language.
4940: QiMeng-TensorOp: One-Line Prompt is Enough for High-Performance Tensor Operator Generation with Hardware Primitives
Authors: Xuzhi Zhang, Shaohui Peng, Qirui Zhou, Yuanbo Wen, Qi Guo, Ruizhi Chen, Xinguo Zhu, Weiqiang Xiong, Haixin Chen, Congying Ma, Ke Gao, Chen Zhao, Yanjun Wu, Yunji Chen, Ling Li
Location: Guangzhou | Day: TBD
Show Abstract
Computation-intensive tensor operators constitute over 90% of the computations in Large Language Models (LLMs) and Deep Neural Networks. Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability. LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators.

We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to 1291× performance improvement. Even compared with human experts, QiMeng-TensorOp could reach 251% of OpenBLAS on RISC-V CPUs, and 124% of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by 200× compared with human experts.
4942: Spatial-Spectral Similarity-Guided Fusion Network for Pansharpening
Authors: Jiazhuang Xiong, Yongshan Zhang, Xinxin Wang, Lefei Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Pansharpening fuses lower-resolution multispectral (LRMS) images with high-resolution panchromatic (PAN) images to generate high-resolution multispectral (HRMS) images that preserves both spatial and spectral information. Most deep pansharpening methods face challenges in cross-modal feature extraction and fusion, as well as in exploring the similarities between the fused image and both PAN and LRMS images. In this paper, we propose a spatial-spectral similarity-guided fusion network (S3FNet) for pansharpening. This architecture is composed of three parts. Specifically, a shallow feature extraction layer learns initial spatial, spectral and fused features from PAN and LRMS images. Then, a multi-branch asymmetric encoder, consisting of spatial, spectral and fusion branches, generates corresponding high-level features at different scales. A multi-scale reconstruction decoder, equipped with a well-designed cross-feature multi-head attention fusion block, processes the intermediate feature maps to generate HRMS images. To ensure HRMS images retain maximum spatial and spectral information, a similarity-constrained loss is defined for network training. Extensive experiments demonstrate the effectiveness of our S3FNet over state-of-the-art methods. The code is released at https://github.com/ZhangYongshan/S3FNet.
4944: Improving Prediction Certainty Estimation for Reliable Early Exiting via Null Space Projection
Authors: Jianing He, Qi Zhang, Duoqian Miao, Yi Kun, Shufeng Hao, Hongyun Zhang, Zhihua Wei
Location: Guangzhou | Day: TBD
Show Abstract
Early exiting has demonstrated great potential in accelerating the inference of pre-trained language models (PLMs) by enabling easy samples to exit at shallow layers, eliminating the need for executing deeper layers. However, existing early exiting methods primarily rely on class-relevant logits to formulate their exiting signals for estimating prediction certainty, neglecting the detrimental influence of class-irrelevant information in the features on prediction certainty. This leads to an overestimation of prediction certainty, causing premature exiting of samples with incorrect early predictions. To remedy this, we define an NSP score to estimate prediction certainty by considering the proportion of class-irrelevant information in the features. On this basis, we propose a novel early exiting method based on the Certainty-Aware Probability (CAP) score, which integrates insights from both logits and the NSP score to enhance prediction certainty estimation, thus enabling more reliable exiting decisions. The experimental results on the GLUE benchmark show that our method can achieve an average speed-up ratio of 2.19× across all tasks with negligible performance degradation, surpassing the state-of-the-art (SOTA) ConsistentEE by 28%, yielding a better trade-off between task performance and inference efficiency. The code is available at https://github.com/He-Jianing/NSP.git.
4956: Bimodal Depth-First Search for Scalable GAC for AllDifferent
Authors: Sulian Le Bozec-Chiffoleau, Nicolas Beldiceanu, Charles Prud’homme, Gilles Simonin, Xavier Lorca
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Constraint Satisfaction and Optimization (2/3)
Show Abstract
We propose a version of DFS designed for Constraint Programming, called bimodal DFS, that scales to both sparse and dense graphs. It runs in O(n + ~m) time, where ~m is the sum, for each vertex v, of the minimum between the numbers of successors and non-successors of v.
Integrating it into Régin’s GAC algorithm for the AllDifferent constraint results in faster performance as the problem size increases, outperforming a GPU-accelerated version.
In the vast majority of our tests, GAC now performs similarly to BC in terms of speed, but is able to solve more problems.
4959: Argument-based Multi-Issue Negotiation
Authors: Thalya Fossey, Jean-Guy Mailly, Pavlos Moraitis
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Agent-based and Multi-agent Systems (2/3)
Show Abstract
Automated negotiation aims at finding agreements between agents with conflicting goals. Existing utility-based approaches guarantee agents satisfaction with negotiation outcomes, especially in multi-issue negotiations
where concession mechanisms lead to win-win results. However, they lack explainability and do not consider agents’ beliefs. On the other hand, argument-based approaches provide reasons for accepting or rejecting offers but do not include utility modeling for offers or enable concession mechanisms in multi-issue settings. We propose a novel hybrid approach combining both types of approaches.
The utility-based component enables agents to make concessions on complex negotiation objects to achieve win-win outcomes, while the argumentation component ensures that accepted offers align with the agents’ personal argumentation theories. These theories represent their beliefs, encoding various profiles, ethical considerations, social norms, or legal principles.
4969: Exploiting Position Information in Convolutional Kernels for Structural Re-parameterization
Authors: Tianxiang Hao, Hui Chen, Guiguang Ding
Location: Guangzhou | Day: TBD
Show Abstract
In order to boost the performance of a convolutional neural network (CNN), several approaches have shown the benefit of enhancing the spatial encoding of feature maps. However, few works paid attention to the positional properties of convolutional kernels. In this paper, we demonstrate that different kernel positions are of different importance, which depends on the task, dataset and architecture, and adaptively emphasizing the informative parts in convolutional kernels can lead to considerable improvement. Therefore, we propose a novel structural re-parameterization Position Boosting Convolution (PBConv) to exploit and enhance the position information in the convolutional kernel. PBConv consists of several concurrent small convolutional kernels, which can be equivalently converted to the original kernel and bring no extra inference cost. Different from existing structural re-parameterization methods, PBconv searches for the optimal re-parameterized structure by a fast heuristic algorithm based on the dispersion of kernel weights. Such heuristic search is efficient yet effective, well adapting the varying kernel weight distribution. As a result, PBConv can significantly improve the representational power of a model, especially its ability to extract fine-grained low-level features. Importantly, PBConv is orthogonal to procedural re-parameterization methods and can further boost performance based on them. Code is available at github.
4971: Dynamic and Adaptive Feature Generation with LLM
Authors: Xinhao Zhang, Jinghan Zhang, Banafsheh Rekabdar, Yuanchun Zhou, Pengfei Wang, Kunpeng Liu
Location: Guangzhou | Day: TBD
Show Abstract
The representation of feature space is a crucial environment where data points get vectorized and embedded for subsequent modeling. Thus, the efficacy of machine learning (ML) algorithms is closely related to the quality of feature engineering. As one of the most important techniques, feature generation transforms raw data into an optimized feature space conducive to model training and further refines the space. Despite the advancements in automated feature engineering and feature generation, current methodologies often suffer from three fundamental issues: lack of explainability, limited applicability, and inflexible strategy. These shortcomings frequently hinder and limit the deployment of ML models across varied scenarios. Our research introduces a novel approach adopting large language models (LLMs) and feature-generating prompts to address these challenges. We propose a dynamic and adaptive feature generation method that enhances the interpretability of the feature generation process. Our approach broadens the applicability across various data types and tasks and offers advantages in terms of strategic flexibility. A broad range of experiments showcases that our approach is significantly superior to existing methods.
4984: Base-Detail Feature Learning Framework for Visible-Infrared Person Re-Identification
Authors: Zhihao Gong, Lian Wu, Yong Xu
Location: Guangzhou | Day: TBD
Show Abstract
Visible-infrared person re-identification (VIReID) provides a solution for ReID tasks in 24-hour scenarios; however, significant challenges persist in achieving satisfactory performance due to the substantial discrepancies between visible (VIS) and infrared (IR) modalities. Existing methods inadequately leverage information from different modalities, primarily focusing on digging distinguishing features from modality-shared information while neglecting modality-specific details. To fully utilize differentiated minutiae, we propose a Base-Detail Feature Learning Framework (BDLF) that enhances the learning of both base and detail knowledge, thereby capitalizing on both modality-shared and modality-specific information. Specifically, the proposed BDLF mines detail and base features through a lossless detail feature extraction module and a complementary base embedding generation mechanism, respectively, supported by a novel correlation restriction method that ensures the features gained by BDLF enrich both detail and base knowledge across VIS and IR features. Comprehensive experiments conducted on the SYSU-MM01, RegDB, and LLCM datasets validate the effectiveness of BDLF.
4987: EnergyCompress: A General Case Base Learning Strategy
Authors: Fadi Badra, Esteban Marquer, Marie-Jeanne Lesot, Miguel Couceiro, David Leake
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Knowledge Representation and Reasoning (2/4)
Show Abstract
Case-based prediction (CBP) methods do not learn a model of the target decision function but instead perform an inference process that depends on two similarity measures and a reference case base. This paper proposes a strategy, called EnergyCompress, to learn an effective case base by selecting relevant cases from an initial set. Use of EnergyCompress decreases CBP inference time, through case base compression, and also increases prediction performance, for a wide variety of CBP algorithms. EnergyCompress relies on the proposition of a general formulation of the CBP task in the framework of energy-based models, which leads to a new and valuable characterization of the notion of competence in case-based reasoning, in particular at the source case level. Extensive experimental results on 18 benchmark datasets comparing EnergyCompress to 5 reference algorithms for case base maintenance support the benefit of the proposed strategy.
4997: From End-to-end to Step-by-step: Learning to Abstract via Abductive Reinforcement Learning
Authors: Zilong Wang, Jiongda Wang, Xiaoyong Chen, Meng Wang, Ming Ma, ZhiPeng Wang, Zhenyu Zhou, Tianming Yang, Wang-Zhou Dai
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Machine Learning (3/4)
Show Abstract
Abstraction is a critical technique in general problem-solving, allowing complex tasks to be decomposed into smaller, manageable sub-tasks. While traditional symbolic planning relies on predefined primitive symbols to construct structured abstractions, its reliance on formal representations limits applicability to real-world tasks. On the other hand, reinforcement learning excels at learning end-to-end policies directly from sensory inputs in unstructured environments but struggles with compositional generalization in complex tasks with delayed rewards. In this paper, we propose Abductive Abstract Reinforcement Learning (A2RL), a novel neuro-symbolic RL framework bridging the two paradigms based on Abductive Learning (ABL), enabling RL agents to learn abstractions directly from raw sensory inputs without predefined symbols.
A2RL induces a finite state machine to represent high-level, step-by-step procedures, where each abstract state corresponds to a sub-algebra of the original Markov Decision Process (MDP). This approach not only bridges the gap between symbolic abstraction and sub-symbolic learning but also provides a natural mechanism for the emergence of new symbols. Experiments show that A2RL can mitigate the delayed reward problem and improve the generalization capability compared to traditional end-to-end RL methods.
4998: Language-Based Bayesian Optimization Research Assistant (BORA)
Authors: Abdoulatif Cissé, Xenophon Evangelopoulos, Vladimir V. Gusev, Andrew I. Cooper
Location: Montreal | Day: August 20th | Time: 14:00 | Session: ML: Machine Learning 6/8
Show Abstract
Many important scientific problems involve multivariate optimization coupled with slow and laborious experimental measurements. These high-dimensional searches can be defined by complex, non-convex optimization landscapes that resemble needle-in-a-haystack surfaces, leading to entrapment in local minima. Contextualizing optimizers with human domain knowledge is a powerful approach to guide searches to localized fruitful regions. However, this approach is susceptible to human confirmation bias. It is also challenging for domain experts to keep track of the rapidly expanding scientific literature. Here, we propose the use of Large Language Models (LLMs) for contextualizing Bayesian optimization (BO) via a hybrid optimization framework that intelligently and economically blends stochastic inference with domain knowledge-based insights from the LLM, which is used to suggest new, better-performing areas of the search space for exploration. Our method fosters user engagement by offering real-time commentary on the optimization progress, explaining the reasoning behind the search strategies. We validate the effectiveness of our approach on synthetic benchmarks with up to 15 variables and demonstrate the ability of LLMs to reason in four real-world experimental tasks where context-aware suggestions boost optimization performance substantially.
5005: Synthesising Minimum Cost Dynamic Norms
Authors: Natasha Alechina, Brian Logan, Giuseppe Perelli
Location: Montreal | Day: August 21st | Time: 10:00 | Session: MAS: Formal verification, validation and synthesis
Show Abstract
A key problem in the design of normative multi-agent systems is the cost of enforcing a norm (for the system operator) or complying with the norm (for the system users). If the cost is too high, ensuring compliant behavior may be uneconomic, or users may be deterred from participating in the MAS. In this paper, we consider the problem of synthesizing minimum cost dynamic norms to satisfy a system-level objective specified in Alternating Time Temporal Logic with Strategy Contexts (ATLsc∗). We show that synthesizing a dynamic norm under a bound on the cost of any prohibited set of actions has the same complexity as synthesizing arbitrary norms. We also show that synthesizing norms that minimize the average cost of the prohibited set of actions is unsolvable; however, synthesizing ε-optimal norms is possible.
5007: Asset Pricing with Contrastive Adversarial Variational Bayes
Authors: Ruirui Liu, Huichou Huang, Johannes Ruf
Location: Guangzhou | Day: TBD
Show Abstract
Machine learning techniques have gained considerable attention in the field of empirical asset pricing. Conditioning on a broad set of firm characteristics, one of the most popular no-arbitrage workhorses is a nonlinear conditional asset pricing model that consists of two modules within a neural network structure, i.e., factor and beta estimates, for which we propose a novel contrastive adversarial variational Bayes (CAVB) framework. To exploit the factor structure, we employ adversarial variational Bayes that transforms the maximum-likelihood problem into a zero-sum game between a variational autoencoder (VAE) and a generative adversarial network (GAN), where an auxiliary discriminative network brings in arbitrary expressiveness to the inference model. To tackle the problem of learning indistinguishable feature representations in the beta network, we introduce a contrastive loss to learn distinctive hidden features of the factor loadings in correspondence to conditional quantiles of return distributions. CAVB establishes a robust relation between the cross-section of asset returns and the common latent factors with nonlinear factor loadings. Extensive experiments show that CAVB not only significantly outperforms prominent models in the existing literature in terms of total and predictive R-squares, but also delivers superior Sharpe ratios after transaction costs for both long-only and long-short portfolios.
5010: Think Twice Before Adaptation: Improving Adaptability of DeepFake Detection via Online Test-Time Adaptation
Authors: Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac
Location: Montreal | Day: August 21st | Time: 10:00 | Session: MTA: Security and privacy
Show Abstract
Deepfake (DF) detectors face significant challenges when deployed in real-world environments, particularly when encountering test samples deviated from training data through either postprocessing manipulations or distribution shifts. We demonstrate postprocessing techniques can completely obscure generation artifacts presented in DF samples, leading to performance degradation of DF detectors. To address these challenges, we propose Think Twice before Adaptation (T2A), a novel online test-time adaptation method that enhances the adaptability of detectors during inference without requiring access to source training data or labels. Our key idea is to enable the model to explore alternative options through an Uncertainty-aware Negative Learning objective rather than solely relying on its initial predictions as commonly seen in entropy minimization (EM)-based approaches. We also introduce an Uncertain Sample Prioritization strategy and Gradients Masking technique to improve the adaptation by focusing on important samples and model parameters. Our theoretical analysis demonstrates that the proposed negative learning objective exhibits complementary behavior to EM, facilitating better adaptation capability. Empirically, our method achieves state-of-the-art results compared to existing test-time adaptation (TTA) approaches and significantly enhances the resilience and generalization of DF detectors during inference.
5011: Fast Explanations via Policy Gradient-Optimized Explainer
Authors: Deng Pan, Nuno Moniz, Nitesh V. Chawla
Location: Guangzhou | Day: TBD
Show Abstract
The challenge of delivering efficient explanations is a critical barrier that prevents the adoption of model explanations in real-world applications. Existing approaches often depend on extensive model queries for sample-level explanations or rely on expert’s knowledge of specific model structures that trade general applicability for efficiency. To address these limitations, this paper introduces a novel framework Fast EXplanation (FEX) that represents attribution-based explanations via probability distributions, which are optimized by leveraging the policy gradient method. The proposed framework offers a robust, scalable solution for real-time, large-scale model explanations, bridging the gap between efficiency and applicability.
We validate our framework on image and text classification tasks and the experiments demonstrate that our method reduces inference time by over 97 percent and memory usage by 70 percent compared to traditional model-agnostic approaches while maintaining high-quality explanations and broad applicability.
5018: Randomised Optimism via Competitive Co-Evolution for Matrix Games with Bandit Feedback
Authors: Shishen Lin
Location: Guangzhou | Day: TBD
Show Abstract
Learning in games is a fundamental problem in machine learning and artificial intelligence, with numerous applications. This work investigates two-player zero-sum matrix games with an unknown payoff matrix and bandit feedback, where each player observes their actions and the corresponding noisy payoff. Prior studies have proposed algorithms for this setting, demonstrating the effectiveness of deterministic optimism (e.g., UCB for matrix games) in achieving sublinear regret. However, the potential of randomised optimism in matrix games remains theoretically unexplored.

We propose Competitive Co-evolutionary Bandit Learning (CoEBL), a novel algorithm that integrates evolutionary algorithms (EAs) into the bandit framework to implement randomised optimism through EA variation operators. We prove that CoEBL achieves sublinear regret, matching the performance of deterministic optimism-based methods. To the best of our knowledge, this is the first theoretical regret analysis of an evolutionary bandit learning algorithm in matrix games.

Empirical evaluations on diverse matrix game benchmarks demonstrate that CoEBL not only achieves sublinear regret but also consistently outperforms classical bandit algorithms, including EXP3, the variant EXP3-IX, and UCB. These results highlight the potential of evolutionary bandit learning, particularly the efficacy of randomised optimism via evolutionary algorithms in game-theoretic settings.
5026: Online Housing Market
Authors: Julien Lesca
Location: Montreal | Day: August 21st | Time: 15:00 | Session: GTEP: Computational social choice (2/2)
Show Abstract
We study an online variant of the celebrated housing market problem, where each agent owns a single house and seeks to exchange it based on her preferences. In this online setting, agents may arrive and depart at any time, meaning not all agents are present in the housing market simultaneously. We extend the well-known serial dictatorship and top trading cycle mechanisms to the online scenario, aiming to retain their desirable properties, such as Pareto efficiency, individual rationality, and strategy-proofness. These extensions also seek to prevent agents from strategically delaying their arrivals or advancing their departures. We demonstrate that achieving all these properties simultaneously is impossible and present several variants that achieve different subsets of these properties.
5050: Unsupervised Feature Transformation via In-context Generation, Generator-critic LLM Agents, and Duet-play Teaming
Authors: Nanxu Gong, Xinyuan Wang, Wangyang Ying, Haoyue Bai, Sixun Dong, Haifeng Chen, Yanjie Fu
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Data Mining
Show Abstract
Feature transformation involves generating a new set of features from the original dataset to enhance the data’s utility. In certain domains like material performance screening, dimensionality is large and collecting labels is expensive and lengthy. It highly necessitates transforming feature spaces efficiently and without supervision to enhance data readiness and AI utility. However, existing methods fall short in efficient navigation of a vast space of feature combinations, and are mostly designed for supervised settings. To fill this gap, our unique perspective is to leverage a generator-critic duet-play teaming framework using LLM agents and in-context learning to derive pseudo-supervision from unsupervised data. The framework consists of three interconnected steps: (1) Critic agent diagnoses data to generate actionable advice, (2) Generator agent produces tokenized feature transformations guided by the critic’s advice, and (3) Iterative refinement ensures continuous improvement through feedback between agents. The generator-critic framework can be generalized to human-agent collaborative generation, by replacing the critic agent with human experts. Extensive experiments demonstrate that the proposed framework outperforms even supervised baselines in feature transformation efficiency, robustness, and practical applicability across diverse datasets. Our code is publicly available at https://github.com/NanxuGong/LPFG.
5065: Fast and Stronger Lower Bounds for Planar Euclidean Shortest Paths
Authors: Stefan Funke, Daniel Koch, Claudius Proissl, Christian Staib, Felix Weitbrecht
Location: Montreal | Day: August 21st | Time: 11:30 | Session: Planning and Scheduling (4/5)
Show Abstract
We consider the problem of quickly providing strong lower bounds for the planar Euclidean shortest path (ESP) problem. Such lower bounds are crucial for guiding the search in A* type approaches or for proving quality guarantees for algorithms that compute approximate solutions.
Our contributions are two-fold: we show how to simplify ESP instances such that computing and storing a visibility graph becomes feasible while distances within the simplified instance are guaranteed to constitute lower bounds for the original problem instance. Furthermore we show how to precompute a space efficient data structure that allows to perform distance queries on visibility graphs within few microseconds with negligible space overhead.
5079: MS-DPPs: Multi-Source Determinantal Point Processes for Contextual Diversity Refinement of Composite Attributes in Text to Image Retrieval
Authors: Naoya Sogi, Takashi Shibata, Makoto Terao, Masanori Suganuma, Takayuki Okatani
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Computer Vision (1/3)
Show Abstract
Result diversification (RD) is a crucial technique in Text-to-Image Retrieval for enhancing the efficiency of a practical application. Conventional methods focus solely on increasing the diversity metric of image appearances. However, the diversity metric and its desired value vary depending on the application, which limits the applications of RD. This paper proposes a novel task called CDR-CA (Contextual Diversity Refinement of Composite Attributes). CDR-CA aims to refine the diversities of multiple attributes, according to the application’s context. To address this task, we propose Multi-Source DPPs, a simple yet strong baseline that extends the Determinantal Point Process (DPP) to multi-sources. We model MS-DPP as a single DPP model with a unified similarity matrix based on a manifold representation. We also introduce Tangent Normalization to reflect contexts.
Extensive experiments demonstrate the effectiveness of the proposed method.
5084: Multimodal Inverse Attention Network with Intrinsic Discriminant Feature Exploitation for Fake News Detection
Authors: Tianlin Zhang, En Yu, Yi Shao, Jiande Sun
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal fake news detection has garnered significant attention due to its profound implications for social security. While existing approaches have contributed to understanding cross-modal consistency, they often fail to leverage modal-specific representations and explicit discrepant features. To address these limitations, we propose a Multimodal Inverse Attention Network (MIAN), a novel framework that explores intrinsic discriminative features based on news content to advance fake news detection. Specifically, MIAN introduces a hierarchical learning module that captures diverse intra-modal relationships through local-to-global and local-to-local interactions, thereby generating enhanced unimodal representations to improve the identification of fake news at the intra-modal level. Additionally, a cross-modal interaction module employs a co-attention mechanism to establish and model dependencies between the refined unimodal representations, facilitating seamless semantic integration across modalities. To explicitly extract inconsistency features, we propose an inverse attention mechanism that effectively highlights the conflicting patterns and semantic deviations introduced by fake news in both intra- and inter-modality. Extensive experiments on benchmark datasets demonstrate that MIAN significantly outperforms state-of-the-art methods, underscoring its pivotal contribution to advancing social security through enhanced multimodal fake news detection.
5090: LLM4VKG: Leveraging Large Language Models for Virtual Knowledge Graph Construction
Authors: Guohui Xiao, Lin Ren, Guilin Qi, Haohan Xue, Marco Di Panfilo, Davide Lanti
Location: Guangzhou | Day: TBD
Show Abstract
Virtual Knowledge Graphs (VKGs) provide an effective solution for data integration but typically require significant expertise for their construction. This process, involving ontology development, schema analysis, and mapping creation, is often hindered by naming ambiguities and matching issues, which traditional rule-based methods struggle to address. Large language models (LLMs), with their ability to process and generate contextually relevant text, offer a potential solution. In this work, we introduce LLM4VKG, a novel framework that leverages LLMs to automatize VKG construction. Experimental evaluation on the RODI benchmark demonstrates that LLM4VKG surpasses state-of-the-art methods, achieving an average F1-score improvement of +17% and a peak gain of +39%. Moreover, LLM4VKG proves robust against incomplete ontologies and can handle complex mappings where current methods fail.
5099: Constrained Preferential Bayesian Optimization and Its Application in Banner Ad Design
Authors: Koki Iwai, Yusuke Kumagae, Yuki Koyama, Masahiro Hamasaki, Masataka Goto
Location: Montreal | Day: August 20th | Time: 14:00 | Session: ML: Machine Learning 6/8
Show Abstract
Preferential Bayesian optimization (PBO) is a variant of Bayesian optimization that observes relative preferences (e.g., pairwise comparisons) instead of direct objective values, making it especially suitable for human-in-the-loop scenarios. However, real-world optimization tasks often involve inequality constraints, which existing PBO methods have not yet addressed. To fill this gap, we propose constrained preferential Bayesian optimization (CPBO), an extension of PBO that incorporates inequality constraints for the first time. Specifically, we present a novel acquisition function for this purpose. Our technical evaluation shows that our CPBO method successfully identifies optimal solutions by focusing on exploring feasible regions. As a practical application, we also present a designer-in-the-loop system for banner ad design using CPBO, where the objective is the designer’s subjective preference, and the constraint ensures a target predicted click-through rate. We conducted a user study with professional ad designers, demonstrating the potential benefits of our approach in guiding creative design under real-world constraints.
5100: Fast Guaranteed Tensor Recovery with Adaptive Tensor Nuclear Norm
Authors: Jiangjun Peng, Hailin Wang, Xiangyong Cao, Shuang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Real-world datasets like multi-spectral images and videos are naturally represented as tensors. However, limitations in data acquisition often lead to corrupted or incomplete tensor data, making tensor recovery a critical challenge. Solving this problem requires exploiting inherent structural patterns, with the low-rank property being particularly vital. An important category of existing low-rank tensor recovery methods relies on the tensor nuclear norms. However, these methods struggle with either computational inefficiency or weak theoretical guarantees for large-scale data. To address these issues, we propose a fast guaranteed tensor recovery framework based on a new tensor nuclear norm. Our approach adaptively extracts a column-orthogonal matrix from the data, reducing a large-scale tensor into a smaller subspace for efficient processing. This dimensionality reduction enhances speed without compromising accuracy. The recovery theories of two typical models are established by introducing an adjusted incoherence condition. Extensive experiments demonstrate the effectiveness of the proposed method, showing improved accuracy and speed over existing approaches. Our code and supplementary material are available at https://github.com/andrew-pengjj/adaptive_tensor_nuclear_norm.
5103: Efficient Multi-view Clustering via Reinforcement Contrastive Learning
Authors: Qianqian Wang, Haiming Xu, Zihao Zhang, Zhiqiang Tao, Quanxue Gao
Location: Guangzhou | Day: TBD
Show Abstract
Contrastive multi-view clustering has demonstrated remarkable potential in complex data analysis, yet existing approaches face two critical challenges: difficulty in constructing high-quality positive and negative pairs and high computational overhead due to static optimization strategies. To address these challenges, we propose an innovative efficient Multi-View Clustering framework with Reinforcement Contrastive Learning (EMVCRCL). Our key innovation is developing a reinforcement contrastive learning paradigm for dynamic clustering optimization. First, we leverage multi-view contrastive learning to obtain latent features, which are then sent to the reinforcement learning module to refine low-quality features. Specifically, it selects high-confident features to guide the positive/negative pair construction of contrastive learning. For the low-confident features, it utilizes the prior balanced distribution to adjust their assignment. Extensive experimental results showcase the effectiveness and superiority of our proposed method on multiple benchmark datasets.
5109: Fair Incomplete Multi-View Clustering via Distribution Alignment
Authors: Qianqian Wang, Haiming Xu, Meiling Liu, Wei Feng, Xiangdong Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Incomplete multi-view clustering (IMVC) extracts consistent and complementary information from multi-source/modality data with missing views, aiming to partition the data into different clusters. It can effectively address the problem of unsupervised multi-source data analysis in complex environments and has gained considerable attention. However, the fairness of IMVC remains underexplored, particularly when data contains sensitive features ({e.g.}, gender, marital status, and age). To tackle the problem, this work presents a novel Fair Incomplete Multi-View Clustering (FIMVC) method. The proposed FIMVC introduces fairness constraints to ensure clustering results are independent of sensitive features. Additionally, it learns consensus representations to enhance clustering performance by maximizing mutual information and aligning the distributions of different views. Experimental results on three datasets containing sensitive features demonstrate that our method improves the fairness of clustering results while outperforming state-of-the-art IMVC methods in clustering performance.
5124: MIRROR: Multi-agent Intra- and Inter-Reflection for Optimized Reasoning in Tool Learning
Authors: Zikang Guo, Benfeng Xu, Xiaorui Wang, Zhendong Mao
Location: Guangzhou | Day: TBD
Show Abstract
Complex tasks involving tool integration pose significant challenges for Large Language Models (LLMs), leading to the emergence of multi-agent workflows as a promising solution. Reflection has emerged as an effective strategy for correcting erroneous trajectories in agentic workflows. However, existing approaches only exploit such capability in the post-action stage, where the agent observes the execution outcomes. We argue that, like humans, LLMs can also engage in reflection before action execution: the agent can anticipate undesirable outcomes from its own decisions, which not only provides a necessarily complementary perspective to evaluate the decision but also prevents the propagation of errors throughout the trajectory. In this paper, we propose MIRROR, a framework that consists of both intra-reflection, which critically assesses intended actions before execution, and inter-reflection, which further adjusts the trajectory based on observations. This design systematically leverages LLM reflection capabilities to eliminate and rectify erroneous actions on a more comprehensive scope. Evaluations on both the StableToolBench and TravelPlanner benchmarks demonstrate MIRROR’s superior performance, achieving state-of-the-art results compared to existing approaches.
5129: Learning Neural Vocoder from Range-Null Space Decomposition
Authors: Andong Li, Tong Lei, Zhihang Sun, Rilin Chen, Erwei Yin, Xiaodong Li, Chengshi Zheng
Location: Guangzhou | Day: TBD
Show Abstract
Despite the rapid development of neural vocoders in recent years, they usually suffer from some intrinsic challenges like opaque modeling, and parameter-performance trade-off. In this study, we propose an innovative time-frequency (T-F) domain-based neural vocoder to resolve the above-mentioned challenges. To be specific, we bridge the connection between the classical signal range-null decomposition (RND) theory and vocoder task, and the reconstruction of target spectrogram can be decomposed into the superimposition between the range-space and null-space, where the former is enabled by a linear domain shift from the original mel-scale domain to the target linear-scale domain, and the latter is instantiated via a learnable network for further spectral detail generation. Accordingly, we propose a novel dual-path framework, where the spectrum is hierarchically encoded/decoded, and the cross- and narrow-band modules are elaborately devised for efficient sub-band and sequential modeling. Comprehensive experiments are conducted on the LJSpeech and LibriTTS benchmarks. Quantitative and qualitative results show that while enjoying lightweight network parameters, the proposed approach yields state-of-the-art performance among existing advanced methods. Our code and the pretrained model weights are available at https://github.com/Andong-Li-speech/RNDVoC.
5135: Boosting Few-Shot Open-Set Object Detection via Prompt Learning and Robust Decision Boundary
Authors: Zhaowei Wu, Binyi Su, Qichuan Geng, Hua Zhang, Zhong Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Few-shot Open-set Object Detection (FOOD) poses a challenge in many open-world scenarios. It aims to train an open-set detector to detect known objects while rejecting unknowns with scarce training samples. Existing FOOD methods are subject to limited visual information, and often exhibit an ambiguous decision boundary between known and unknown classes. To address these limitations, we propose the first prompt-based few-shot open-set object detection framework, which exploits additional textual information and delves into constructing a robust decision boundary for unknown rejection. Specifically, as no available training data for unknown classes, we select pseudo-unknown samples with Attribution-Gradient based Pseudo-unknown Mining (AGPM), which leverages the discrepancy in attribution gradients to quantify uncertainty. Subsequently, we propose Conditional Evidence Decoupling (CED) to decouple and extract distinct knowledge from selected pseudo-unknown samples by eliminating opposing evidence. This optimization process can enhance the discrimination between known and unknown classes. To further regularize the model and form a robust decision boundary for unknown rejection, we introduce Abnormal Distribution Calibration (ADC) to calibrate the output probability distribution of local abnormal features in pseudo-unknown samples. Our method achieves superior performance over previous state-of-the-art approaches, improving the average recall of unknown class by 7.24% across all shots in VOC10-5-5 dataset settings and 1.38% in VOC-COCO dataset settings. Our source code is available at https://gitee.com/VR_NAVE/ced-food.
5140: Why the Agent Made that Decision: Contrastive Explanation Learning for Reinforcement Learning
Authors: Rui Zuo, Simon Khan, Zifan Wang, Garrett Ethan Katz, Qinru Qiu
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: ETF: Explainability and interpretability
Show Abstract
Reinforcement learning (RL) has demonstrated remarkable success in solving complex decision-making problems, yet its adoption in critical domains is hindered by the lack of interpretability in its decision-making processes. Existing explainable AI (xAI) approaches often fail to provide meaningful explanations for RL agents, particularly because they overlook the contrastive nature of human reasoning—answering "why this action instead of that one?" To address this gap, we propose a novel framework of contrastive learning to explain RL selected actions, named VisionMask. VisionMask is trained to generate explanations by explicitly contrasting the agent’s chosen action with alternative actions in a given state using a self-supervised manner. %It is trained using a contrastive self-supervised learning manner, leveraging the relationships between state features and action dynamics to produce intuitive and actionable insights.
We demonstrate the efficacy of our method through experiments across diverse RL environments, evaluating it in terms of faithfulness, robustness and complexity. Our results show that VisionMask significantly improve human understanding of agent behavior while maintaining accuracy and fidelity. Furthermore, we present examples illustrating how VisionMask can be used for counterfactual analysis. This work bridges the gap between RL and xAI, paving the way for safer and more interpretable RL systems.
5141: Causal-aware Large Language Models: Enhancing Decision-Making Through Learning, Adapting and Acting
Authors: Wei Chen, Jiahao Zhang, Haipeng Zhu, Boyan Xu, Zhifeng Hao, Keli Zhang, Junjian Ye, Ruichu Cai
Location: Guangzhou | Day: TBD
Show Abstract
Large language models (LLMs) have shown great potential in decision-making due to the vast amount of knowledge stored within the models.However, these pre-trained models are prone to lack reasoning abilities and are difficult to adapt to new environments, further hindering their application to complex real-world tasks. To address these challenges, inspired by the human cognitive process, we propose Causal-Aware LLMs, which integrate the structural causal model (SCM) into the decision-making process to model, update, and utilize structured knowledge of the environment in a "learning-adapting-acting" paradigm.Specifically, in the learning stage, we first utilize an LLM to extract the environment-specific causal entities and their causal relations to initialize a structured causal model of the environment. Subsequently, in the adapting stage, we update the structured causal model through external feedback about the environment, via an idea of causal intervention. Finally, in the acting stage, Causal-Aware LLMs exploit structured causal knowledge for more efficient policy-making through the reinforcement learning agent. The above processes are performed iteratively to learn causal knowledge, ultimately enabling the causal-aware LLM to achieve a more accurate understanding of the environment and make more efficient decisions. Experimental results across 22 diverse tasks within the open-world game "Crafter" validate the effectiveness of our proposed method.
5148: STAMImputer: Spatio-Temporal Attention MoE for Traffic Data Imputation
Authors: Yiming Wang, Hao Peng, Senzhang Wang, Haohua Du, Chunyang Liu, Jia Wu, Guanlin Wu
Location: Guangzhou | Day: TBD
Show Abstract
Traffic data imputation is fundamentally important to support various applications in intelligent transportation systems such as traffic flow prediction. However, existing time-to-space sequential methods often fail to effectively extract features in block-wise missing data scenarios. Meanwhile, the static graph structure for spatial feature propagation significantly constrains the model’s flexibility in handling the distribution shift issue for the nonstationary traffic data. To address these issues, this paper proposes a Spatio-Temporal Attention Mixture of experts network named STAMImputer for traffic data imputation. Specifically, we introduce a Mixture of Experts (MoE) framework to capture latent spatio-temporal features and their influence weights, effectively imputing block missing. A novel Low-rank guided Sampling Graph ATtention (LrSGAT) mechanism is designed to dynamically balance the local and global correlations across road networks. The sampled attention vectors are utilized to generate dynamic graphs that capture real-time spatial correlations. Extensive experiments are conducted on four traffic datasets for evaluation. The result shows STAMImputer achieves significantly performance improvement compared with existing SOTA approaches. Our codes are available at https://github.com/RingBDStack/STAMImupter.
5152: SEP: A General Lossless Compression Framework with Semantics Enhancement and Multi-Stream Pipelines
Authors: Meng Wan, Rongqiang Cao, Yanghao Li, Jue Wang, Zijian Wang, Qi Su, Lei Qiu, Peng Shi, Yangang Wang, Chong Li
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Data Mining
Show Abstract
Deep-learning-based lossless compression is of immense importance in real-world applications, such as cold data persistence, sensor data collection, and astronomical data transmission. However, existing compressors typically model data using single-byte symbols as tokens, which makes it hard to capture the inherent correlations and cannot effectively utilize the parallel capabilities of GPU and multi-core CPU. This paper proposes SEP, a novel lossless compression framework for most time-series backbone neural networks. We first introduce a semantic enhancement module to capture the complex intra-patch relationships of binary byte streams. To improve the compression speed, we design multi-stream pipelines that dynamically assign parallel tasks to GPU streams and multi-cores. We further propose a novel GPU memory optimization strategy, which reuses GPU memory by a shared pool across streams. We conduct experiments on seven real-world datasets and the results demonstrate that our SEP framework outperforms state-of-the-art compressors with an average speed improvement of 30.0% and an average compression ratio gain of 5.1%, which is further elevated to 7.6% with the use of pre-training models. The GPU memory footprint is reduced by as high as 63.1% and by an average of 36.2%. The source code is available at: https://github.com/damonwan1/SEP.
5154: A Hybrid Multi-Factor Network with Dynamic Sequence Modeling for Early Warning of Intraoperative Hypotension
Authors: Mingyue Cheng, Jintao Zhang, Zhiding Liu, Chunli Liu
Location: Guangzhou | Day: TBD
Show Abstract
Intraoperative hypotension (IOH) prediction using past physiological signals is crucial, as IOH may lead to inadequate organ perfusion and significantly elevate the risk of severe complications and mortality. However, current methods often rely on static modeling, overlooking the complex temporal dependencies and the inherently non-stationary nature of physiological signals. We propose a Hybrid Multi-Factor (HMF) network that formulates IOH prediction as a dynamic sequence forecasting task, explicitly capturing both temporal dependencies and physiological non-stationarity. We represent signal dynamics as multivariate time series and decompose them into trend and seasonal components, enabling separate modeling of long-term and periodic variations. Each component is encoded with a patch-based Transformer to balance computational efficiency and feature representation. To address distributional drift from evolving signals, we introduce a symmetric normalization mechanism. Experiments on both public and real-world clinical datasets show that HMF significantly outperforms competitive baselines. We hope HMF offers new insights into IOH prediction and ultimately promotes safer surgical care. Our code is available at https://github.com/Mingyue-Cheng/HMF.
5163: In-context Learning Demonstration Generation with Text Distillation
Authors: Wuyuqing Wang, Erkun Yang, Zilan Zhou, Cheng Deng
Location: Montreal | Day: August 21st | Time: 15:00 | Session: ML: Large Language Models
Show Abstract
In-context learning (ICL), a paradigm derived from large language models (LLMs), holds significant promise but is notably sensitive to the choice of input demonstrations. While numerous methodologies have been developed to select the optimal demonstrations from existing datasets, our work alternatively proposes to generate representative demonstrations through a Distillation-based Demonstration Generation (DDG) framework.
Specifically, our approach aims to generate demonstrations that encapsulate the essential attributes of the target dataset. Rather than optimizing these demonstrations directly, we design a generative model and try to refine it by minimizing the discrepancies between the calculative models trained on generated demonstrations and the original datasets respectively. Additionally, we leverage a teacher-student framework to stabilize the training process and improve the quality of the synthesized samples. Extensive experiments conducted across ten prevalent text datasets demonstrate that our DDG method substantially outperforms existing state-of-the-art methodologies. Our code will be available at https://github.com/wwyq1/DDG.
5170: ABNet: Mitigating Sample Imbalance in Anomaly Detection Within Dynamic Graphs
Authors: Yifan Hong, Muhammad Asif Ali, Huan Wang, Junyang Chen, Di Wang
Location: Guangzhou | Day: TBD
Show Abstract
In dynamic graphs, detecting anomalous nodes faces challenges due to sample imbalance, stemming from the scarcity of anomalous samples and feature representation bias. Existing methods often use unsupervised or semi-supervised learning to extract anomalous samples from unlabeled data, but struggle to obtain enough anomalous instances due to their low occurrence. Moreover, GNN-based approaches often prioritize normal samples, neglecting rare anomalies. To address these issues, we propose the Anomaly Balance Network (ABNet), designed to alleviate sample imbalance and enhance anomaly detection. ABNet includes three key components: a feature extractor that compares node features across time points to avoid bias, an anomaly augmenter that amplifies anomaly details and generates diverse anomalous samples, and an anomaly detector using meta-learning to adapt to graph evolution. Experimental results show that ABNet outperforms existing methods on three real-world datasets, effectively addressing sample imbalance.
5172: Understanding Visual Detail Hallucinations of Large Vision-Language Models
Authors: Xiaoxi Sun, Jianxin Liang, Yueqian Wang, Huishuai Zhang, Dongyan Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Understanding small visual objects is crucial in fields such as video surveillance, remote sensing, and autonomous driving. In this paper, we investigate the capability of advanced large vision-language models (LVLMs) to recognize and interpret small objects in visual data. To this end, we curate a specialized dataset for evaluating fine-grained visual hallucinations, incorporating two object categories and three types of hallucinations.
First, we assess 11 state-of-the-art LVLMs, yielding several key insights, as anticipated, LVLMs perform significantly worse on queries related to small objects compared to regular-sized ones, with performance on regular objects proving to be an unreliable predictor of that on small objects. This finding underscores the need for dedicated research on fine-grained visual hallucinations. Second, we evaluate three training-free methods: Scaffold, Chain of Thought (CoT), and Image Resizing, all of which result in varying degrees of improvement. Furthermore, we conduct a series of detailed ablation studies on the visual encoders of Eagle-X5, examining their performance across fine-grained visual hallucination tasks. Our findings reveal that ConvNeXt architecture is critical for object existence recognition tasks. In contrast, for mitigating other types of hallucinations, integrating information from multiple visual encoders is significantly more effective than relying on a single encoder.
These results highlight several promising directions for advancing small object recognition with LVLMs.
5186: Learning Causally Disentangled Representations for Fair Personality Detection
Authors: Yangfu Zhu, Meiling Li, Yuting Wei, Di Liu, Yuqing Li, Bin Wu
Location: Guangzhou | Day: TBD
Show Abstract
Personality detection aims to identify the personality traits implied in social posts. Existing methods mainly focus on learning the mapping between user-generated posts and personality trait labels but inevitably suffer from potential harm caused by individual bias, as these posts are written by authors from different backgrounds. Learning such spurious associations between posts and traits may lead to the formation of stereotypes, ultimately restricting the detection of personality in different kind of individual. To tackle the issue, we first investigate individual bias in personality detection from the causality perspective. We propose an Interventional Personality Detection Network (IPDN) to learn implicit confounders in user-generated posts and exploit the true causal effect to train the detection model. Specifically, our IPDN disentangled the causal and biased features behind user-generated posts, and then the biased features are accumulatively clustered as confounder prototypes as the training iterations increase. In parallel, the reconstruction network is reused to approximate backdoor adjustment on raw posts, ensuring that traits see each confounder equally before detection. Extensive experiments conducted on three real-world datasets demonstrate that our IPDN outperforms state-of-the-art methods in personality detection.
5197: Conditional Causal Representation Learning for Heterogeneous Single-cell RNA Data Integration and Prediction
Authors: Jiayi Dong, Jiahao Li, Fei Wang
Location: Guangzhou | Day: TBD
Show Abstract
Single-cell sequencing technology provides deep insights into gene activity at the individual cell level, facilitating the study of gene regulatory mechanisms. However, observed gene expression are often influenced by confounding factors such as batch effects, perturbations, and spatial position, which obscure the true gene regulatory network that governs the cell’s intrinsic state. To address these challenges, we propose scConCRL, a novel conditionally causal representation learning framework designed to extract the true gene regulatory relationships independent of confounding information. By considering both fine-grained molecular gene variables and coarse-grained latent domain variables, scConCRL not only uncovers the intrinsic biological signals but also models the complex relationships between these variables. This dual function enables the separation of genuine cellular states from domain information, providing valuable insights for downstream analyses and biological discovery. We demonstrate the effectiveness of our model on multi-domain datasets from different platforms and perturbation conditions, showing its ability to accurately disentangle confounding influences and discover novel gene relationships. Extensive comparisons across various scenarios illustrate the superior performance of scConCRL in several tasks compared to existing methods.
5201: Two-stage Risk Control with Application to Ranked Retrieval
Authors: Yunpeng Xu, Mufang Ying, Wenge Guo, Zhi Wei
Location: Guangzhou | Day: TBD
Show Abstract
Practical machine learning systems often operate in multiple sequential stages, as seen in ranking and recommendation systems, which typically include a retrieval phase followed by a ranking phase. Effectively assessing prediction uncertainty and ensuring effective risk control in such systems pose significant challenges due to their inherent complexity. To address these challenges, we developed two-stage risk control methods based on the recently proposed learn-then-test (LTT) and conformal risk control (CRC) frameworks. Unlike the methods in prior work that address multiple risks, our approach leverages the sequential nature of the problem, resulting in reduced computational burden. We provide theoretical guarantees for our proposed methods and design novel loss functions tailored for ranked retrieval tasks. The effectiveness of our approach is validated through experiments on two large-scale, widely-used datasets: MSLR- Web and Yahoo LTRC.
5217: A Cross-Modal Densely Guided Knowledge Distillation Based on Modality Rebalancing Strategy for Enhanced Unimodal Emotion Recognition
Authors: Shuang Wu, Heng Liang, Yong Zhang, Yanlin Chen, Ziyu Jia
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal emotion recognition has garnered significant attention for its ability to integrate data from multiple modalities to enhance performance. However, physiological signals like electroencephalogram are more challenging to acquire than visual data due to higher collection costs and complexity. This limits the practical application of multimodal networks. To address this issue, this paper proposes a cross-modal knowledge distillation framework for emotion recognition. The framework aims to leverage the strengths of a multimodal teacher network to enhance the performance of a unimodal student network using only the visual modality as input. Specifically, we design a prototype-based modality rebalancing strategy, which dynamically adjusts the convergence rates of different modalities to mitigate modality imbalance issue. It enables the teacher network to better integrate multimodal information. Building upon this, we develop a Cross-Modal Densely Guided Knowledge Distillation (CDGKD) method, which effectively transfers knowledge extracted by the multimodal teacher network to the unimodal student network. Our CDGKD uses multi-level teacher assistant networks to bridge the teacher-student gap and employs dense guidance to reduce error accumulation during knowledge transfer. Experimental results demonstrate that the proposed framework outperforms existing methods on two public emotion datasets, providing an effective solution for emotion recognition in modality-constrained scenarios.
5223: Divide and Conquer: Coordinating Multiplex Mixture of Graph Learners to Handle Multi-Omics Analysis
Authors: Zhihao Wu, Jielong Lu, Jiajun Yu, Sheng Zhou, Yueyang Pi, Haishuai Wang
Location: Guangzhou | Day: TBD
Show Abstract
Graph learning has shown significant advantages in organizing and leveraging complex data, making it promising for numerous real-world applications with heterogeneous information, particularly multi-omics data analysis. Despite its potential in such scenarios, existing methods are still in their infancy, lacking architectural potential and struggling to handle such complex data. In this paper, we propose the Multiplex Mixture of Graph Learners (MMoG) framework. MMoG first conducts fine-grained processing of consensus and unique information, constructing consistent features and multiplex graph structures. Then, a macroscopically shared group of sub-GNNs with diverse orders and architectures synergistically learn representations, providing a foundation for strong interaction between different views. Inspired by the mixture of experts (MoE), each sample in different omics adaptively determines the neighborhood ranges and architectures for information aggregation, while blocking unsuitable sub-GNNs. MMoG treats the complex multi-omics analysis as a multi-view learning problem, and essentially decomposes it into multiple sub-problems, allowing each omics/view to solve intersecting yet unique sub-problem groups. Additionally, we introduce mutual information-driven orthogonal loss and balancing loss to avoid view collapse. Extensive experiments on multi-omics data across multiple cancer types highlight MMoG’s superiority.
5234: SCVBench: A Benchmark with Multi-turn Dialogues for Story-Centric Video Understanding
Authors: Sisi You, Bowen Yuan, Bing-Kun Bao
Location: Guangzhou | Day: TBD
Show Abstract
Video understanding seeks to enable machines to interpret visual content across three levels: action, event, and story. Existing models are limited in their ability to perform high-level long-term story understanding, due to (1) the oversimplified treatment of temporal information and (2) the training bias introduced by action/event-centric datasets. To address this, we introduce SCVBench, a novel benchmark for story-centric video understanding. SCVBench evaluates LVLMs through an event ordering task decomposed into sub-questions leading to a final question, quantitatively measuring historical dialogue exploration. We collected 1,253 final questions and 6,027 sub-question pairs from 925 videos, constructing continuous multi-turn dialogues. Experimental results show that while closed-source GPT-4o outperforms other models, most open-source LVLMs struggle with story-centric video understanding. Additionally, our StoryCoT model significantly surpasses open-source LVLMs on SCVBench. SCVBench aims to advance research by comprehensively analyzing LVLMs’ temporal reasoning and comprehension capabilities. Code can be accessed at https://github.com/yuanrr/SCVBench.
5240: Query-Based and Unnoticeable Graph Injection Attack from Neighborhood Perspective
Authors: Chang Liu, Hai Huang, Xingquan Zuo
Location: Guangzhou | Day: TBD
Show Abstract
The robustness of Graph Neural Networks (GNNs) has become an increasingly important topic due to their expanding range of applications. Various attack methods have been proposed to explore the vulnerabilities of GNNs, ranging from Graph Modification Attacks (GMA) to the more practical and flexible Graph Injection Attacks (GIA). However, existing methods face two key challenges: (i) their reliance on surrogate models, which often leads to reduced attack effectiveness due to structural differences and prior biases, and (ii) existing GIA methods often sacrifice attack success rates in undefended settings to bypass certain defense models, thereby limiting their overall effectiveness. To overcome these limitations, we propose QUGIA, a Query-based and Unnoticeable Graph Injection Attack. QUGIA injects nodes by first selecting edges based on victim node connections and then generating node features using a Bayesian framework. This ensures that the injected nodes are similar to the original graph nodes, implicitly preserving homophily and making the attack more unnoticeable. Unlike previous methods, QUGIA does not rely on surrogate models, thereby avoiding performance degradation and achieving better generalization. Extensive experiments on six real-world datasets with diverse characteristics demonstrate that QUGIA achieves unnoticeable attacks and outperforms state-of-the-art attackers. Our code is available at: https://anonymous.4open.science/r/QUGIA-588E/.
5246: Federated Multi-view Graph Clustering with Incomplete Attribute Imputation
Authors: Wei Feng, Zeyu Bi, Qianqian Wang, Bo Dong
Location: Guangzhou | Day: TBD
Show Abstract
Federated Multi-View Clustering (FedMVC) aims to uncover consistent clustering structures from distributed multi-view data for clustering while preserving data privacy. However, existing FedMVC methods under vertical settings either ignore the ubiquitous incomplete view issue or require uploading data features, which may lead to privacy leakage or induce high communication costs. To mitigate the view incompleteness issue and simultaneously maintain privacy and efffciency, we propose a novel Federated Multiview Graph Clustering with Incomplete Attribute Imputation (FMVC-IAI). This method constructs a consensus graph structure through complementary multi-view data and then utilizes a non-parametric graph neural network (GNN) to impute missing features. Additionally, it utilizes the adjacency graph as the knowledge carrier to share and fuse the multi-view information. To alleviate the high communication cost due to graph sharing, we proposed to share the anchor graph for global adjacency graph construction, which reduces communication cost and also helps to reduce privacy leakage risk. Extensive experiments demonstrate the superiority of our method in FedMVC tasks with incomplete views.
5250: Tensorial Multi-view Clustering with Deep Anchor Graph Projection
Authors: Wei Feng, Dongyuvan Wei, Qianqian Wang, Bo Dong
Location: Guangzhou | Day: TBD
Show Abstract
Multi-view clustering (MVC) has emerged as an important unsupervised multi-view learning method that leverages consistent and complementary information to enhance clustering performance. Recently, tensorized MVC, which processes multi-view data as a tensor to capture their cross-view information, has received considerable attention.
However, existing tensorized MVC methods generally overlook deep structures within each view and rely on post-processing to derive clustering results, leading to potential information loss and degraded performance. To address these issues, we develop Tensorial Multi-view Clustering with Deep Anchor Graph Projection (TMVC-DAGP), which performs deep projection on the anchor graph, thus improving model scalability. Besides, we utilize a sparsity regularization to eliminate the redundancy and enforce the projected anchor graph to retain a clear clustering structure. Furthermore, TMVC-DAGP leverages weighted Tensor Schatten $p$-norm to exploit the consistent and complementary information. Extensive experiments on multiple datasets demonstrate TMVC-DAGP’s effectiveness and superiority.
5268: CRAFT: Time Series Forecasting with Cross-Future Behavior Awareness
Authors: Yingwei Zhang, Ke Bu, Zhuoran Zhuang, Tao Xie, Yao Yu, Dong Li, Yang Guo, Detao Lv
Location: Guangzhou | Day: TBD
Show Abstract
The past decades witness the significant advancements in time series forecasting (TSF) across various real-world domains, including e-commerce and disease spread prediction. However, TSF is usually constrained by the uncertainty dilemma of predicting future data with limited past observations. To settle this question, we explore the use of Cross-Future Behavior (CFB) in TSF, which occurs before the current time but takes effect in the future. We leverage CFB features and propose the CRoss-Future Behavior Awareness based Time Series Forecasting method (CRAFT). The core idea of CRAFT is to utilize the trend of cross-future behavior to mine the trend of time series data to be predicted. Specifically, to settle the sparse and partial flaws of cross-future behavior, CRAFT employs the Koopman Predictor Module to extract the key trend and the Internal Trend Mining Module to supplement the unknown area of the cross-future behavior matrix. Then, we introduce the External Trend Guide Module with a hierarchical structure to acquire more representative trends from higher levels. Finally, we apply the demand-constrained loss to calibrate the distribution deviation of prediction results. We conduct experiments on real-world dataset. Experiments on both offline large-scale dataset and online A/B test demonstrate the effectiveness of CRAFT. Our dataset and code are available at https://github.com/CRAFTinTSF/CRAFT.
5275: Improving Consistency Identification in Task-oriented Dialogue Through Multi-Agent Collaboration
Authors: Peng Wang, Shuo Li, Ruoxi Zhou, Qiguang Chen, Xiao Xu, Hao Fei, Dagang Li, Wanxiang Che, Libo Qin
Location: Guangzhou | Day: TBD
Show Abstract
Consistency identification in task-oriented dialog (CI-ToD) typically consists of three sub-tasks: User Query Inconsistency (QI) identification, Dialogue History Inconsistency (HI) identification, and Knowledge Base Inconsistency (KBI) identification, which aim to determine inconsistent relationships between system response and user query, dialogue history, and knowledge base. Previous approaches focus on the exploration of deep learning models for CI-ToD. While these models achieve remarkable progress, they still rely on large amounts of labeled data, which is hard to achieve in real-world scenarios. Motivated by this, in the paper, we aim to explore large language models for CI-ToD, which do not require any training data. In addition, we further introduce a multi-agent collaboration framework (MAC-CIToD) to model the interaction across three sub-tasks in CI-ToD, including (1) Full Connection paradigm, (2) Cycle Connection paradigm, and (3) Central Connection paradigm, which effectively builds interaction across QI, HI, and KBI. Experiments on the standard benchmark reveal that our framework achieves superior performance. Additionally, we compare MAC-CIToD with the most advanced trained approaches and find that its zero-shot performance on most metrics even surpasses that of models after training on the CI-ToD dataset.
5295: MMNet: Missing-Aware and Memory-Enhanced Network for Multivariate Time Series Imputation
Authors: Xiaoye Miao, Han Shi, Yi Yuan, Daozhan Pan, Yangyang Wu, Xiaohua Pan
Location: Guangzhou | Day: TBD
Show Abstract
Multivariate time series (MTS) data in real-world scenarios are often incomplete, which hinders effective data analysis. Therefore, MTS imputation has been widely studied to facilitate various MTS tasks. Existing imputation methods primarily initialize missing values with zeros in order to perform effective incomplete MTS encoding, which impede the model’s capacity to precisely discern the missing distribution. Moreover, these methods often overlook the global similarity in time series but are limited in the use of local information within the sample. To this end, we propose a novel multivariate time series imputation network model, named MMNet. MMNet introduces a Missing-Aware Embedding (MAE) approach to adaptively represent incomplete MTS, allowing the model to better distinguish between missing and observed data. Furthermore, we design a Memory-Enhanced Encoder (MEE) aimed at modeling prior knowledge through memory mechanism, enabling better utilization of the global similarity within the time series. Building upon this, MMNet incorporates a Multi-scale Mixing architecture (MSM) that leverages information from multiple scales to enhance the final imputation. Extensive experiments on four public real-world datasets demonstrate that, MMNet yields a more than 25% gain in performance, compared with the state-of-the-art methods.
5299: WenyanGPT: A Large Language Model for Classical Chinese Tasks
Authors: Xinyu Yao, Mengdi Wang, Bo Chen, Xiaobing Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Classical Chinese, as the core carrier of Chinese culture, plays a crucial role in the inheritance and study of ancient literature. However, existing natural language processing models primarily optimize for Modern Chinese, resulting in inadequate performance on Classical Chinese. This paper presents a comprehensive solution for Classical Chinese language processing. By continuing pre-training and instruction fine-tuning on the LLaMA3-8B-Chinese model, we construct a large language model, WenyanGPT, which is specifically designed for Classical Chinese tasks. Additionally, we develop an evaluation benchmark dataset, WenyanBENCH. Experimental results on WenyanBENCH demonstrate that WenyanGPT significantly outperforms current advanced LLMs in various Classical Chinese tasks. We make the model’s training data, instruction fine-tuning data, and evaluation benchmark dataset publicly available to promote further research and development in the field of Classical Chinese processing.
5304: Generic Adversarial Attack Framework Against Vertical Federated Learning
Authors: Yimin Liu, Peng Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Vertical federated learning (VFL) enables feature-level collaboration by incorporating scattered attributes from aligned samples, and allows each party to contribute its personalized input to joint training and inference. The injection of adversarial inputs can mislead the joint inference towards the attacker’s will, forcing other benign parties to make negligible contributions and losing rewards regarding the importance of their contributions. However, most attacks require server model queries, subsets of complete test samples, or labeled auxiliary images from the training domain. These extra requirements are not practical for real-world VFL applications. In this paper, we propose PGAC, a novel and practical attack framework for crafting adversarial inputs to dominate joint inference, which does not rely on the above requirements. PGAC advances prior attacks by requiring only access to auxiliary images from non-training domains. PGAC learns generalized label-indicative embeddings and estimates class-transferable probabilities across domains to generate a proxy model that closely approximates the server model. PGAC then augments images by emphasizing salient regions with class activation maps, creating a diverse shadow input set that resembles influential test inputs. With proxy fidelity and input diversity, PGAC crafts transferable adversarial inputs. Evaluation on diverse model architectures confirms the effectiveness of PGAC.
5317: ReplayCAD: Generative Diffusion Replay for Continual Anomaly Detection
Authors: Lei Hu, Zhiyong Gan, Ling Deng, Jinglin Liang, Lingyu Liang, Shuangping Huang, Tianshui Chen
Location: Guangzhou | Day: TBD
Show Abstract
Continual Anomaly Detection (CAD) enables anomaly detection models in learning new classes while preserving knowledge of historical classes. CAD faces two key challenges: catastrophic forgetting and segmentation of small anomalous regions. Existing CAD methods store image distributions or patch features to mitigate catastrophic forgetting, but they fail to preserve pixel-level detailed features for accurate segmentation. To overcome this limitation, we propose ReplayCAD, a novel diffusion-driven generative replay framework that replay high-quality historical data, thus effectively preserving pixel-level detailed features. Specifically, we compress historical data by searching for a class semantic embedding in the conditional space of the pre-trained diffusion model, which can guide the model to replay data with fine-grained pixel details, thus improving the segmentation performance. However, relying solely on semantic features results in limited spatial diversity. Hence, we further use spatial features to guide data compression, achieving precise control of sample space, thereby generating more diverse data. Our method achieves state-of-the-art performance in both classification and segmentation, with notable improvements in segmentation: 11.5% on VisA and 8.1% on MVTec. Our source code is available at https://github.com/HULEI7/ReplayCAD.
5324: ListenNet: A Lightweight Spatio-Temporal Enhancement Nested Network for Auditory Attention Detection
Authors: Cunhang Fan, Xiaoke Yang, Hongyu Zhang, Ying Chen, Lu Li, Jian Zhou, Zhao Lv
Location: Guangzhou | Day: TBD
Show Abstract
Auditory attention detection (AAD) aims to identify the direction of the attended speaker in multi-speaker environments from brain signals, such as Electroencephalography (EEG) signals. However, existing EEG-based AAD methods overlook the spatio-temporal dependencies of EEG signals, limiting their decoding and generalization abilities. To address these issues, this paper proposes a Lightweight Spatio-Temporal Enhancement Nested Network (ListenNet) for AAD. The ListenNet has three key components: Spatio-temporal Dependency Encoder (STDE), Multi-scale Temporal Enhancement (MSTE), and Cross-Nested Attention (CNA). The STDE reconstructs dependencies between consecutive time windows across channels, improving the robustness of dynamic pattern extraction. The MSTE captures temporal features at multiple scales to represent both fine-grained and long-range temporal patterns. In addition, the CNA integrates hierarchical features more effectively through novel dynamic attention mechanisms to capture deep spatio-temporal correlations. Experimental results on three public datasets demonstrate the superiority of ListenNet over state-of-the-art methods in both subject-dependent and challenging subject-independent settings, while reducing the trainable parameter count by approximately 7 times. Code is available at:https://github.com/fchest/ListenNet.
5328: NS4S: Neighborhood Search for Scheduling Problems Via Large Language Models
Authors: Junjie Zhang, Canhui Luo, Zhouxing Su, Qingyun Zhang, Zhipeng Lü, Junwen Ding, Yan Jin
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) have emerged as a promising technology for solving combinatorial optimization problems. However, their direct application to scheduling problems remains limited due to the inherent complexity of these problems. This paper proposes an LLMs-based neighborhood search method that leverages LLMs to tackle the job shop scheduling problem (JSP) and its variants.
The main contributions of this work are threefold. First, we introduce a novel LLMs-guided neighborhood evaluation strategy that guides local search by dynamically adjusting operation weights. Second, we develop a verification evolution (VeEvo) framework to mitigate the hallucination effects of LLMs, enabling the generation of high-quality heuristics for weight updates. Third, we integrate this framework with the weighted neighborhood evaluation strategy to effectively guide the search towards promising regions.
Extensive experiments are conducted on 349 benchmark instances across three classical scheduling problems. The results demonstrate that our algorithm significantly outperforms existing state-of-the-art methods. For JSP, our algorithm reduces the average optimality gap from 10.46% to 1.35% on Taillard’s instances compared to reinforced adaptive staircase curriculum learning. For flexible JSP (FJSP), it reduces the gap from 13.24% to 0.05% on Brandimarte’s instances compared to deep reinforcement learning methods. Furthermore, for FJSP with sequence dependent setup time, our algorithm updates 9 upper bounds for benchmark instances.
5338: Federated Stochastic Bilevel Optimization with Fully First-Order Gradients
Authors: Yihan Zhang, Rohit Dhaipule, Chiu C. Tan, Haibin Ling, Hongchang Gao
Location: Montreal | Day: August 19th | Time: 15:00 | Session: ML: Federated Learning
Show Abstract
Federated stochastic bilevel optimization has been actively studied in recent years due to its widespread applications in machine learning. However, most existing federated stochastic bilevel optimization algorithms require the computation of second-order Hessian and Jacobian matrices, which leads to longer running times in practice. To address these challenges, we propose a novel federated stochastic variance-reduced bilevel gradient descent algorithm that relies solely on first-order oracles. Specifically, our approach does not require the computation of second-order Hessian and Jacobian matrices, significantly reducing running time. Furthermore, we introduce a novel learning rate mechanism, i.e., a constant single-time-scale learning rate, to coordinate the update of different variables. We also present a new strategy to establish the convergence rate of our algorithm. Finally, the extensive experimental results confirm the efficacy of our proposed algorithm.
5355: Explainable Graph Neural Networks via Structural Externalities
Authors: Lijun Wu, Dong Hao, Zhiyi Fan
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) have achieved outstanding performance across a wide range of graph-related tasks. However, their "black-box" nature poses significant challenges to their explainability, and existing methods often fail to effectively capture the intricate interaction patterns among nodes within the network. In this work, we propose a novel explainability framework, GraphEXT, which leverages cooperative game theory and the concept of social externalities. GraphEXT partitions graph nodes into coalitions, decomposing the original graph into independent subgraphs. By integrating graph structure as an externality and incorporating the Shapley value under externalities, GraphEXT quantifies node importance through their marginal contributions to GNN predictions as the nodes transition between coalitions. Unlike traditional Shapley value-based methods that primarily focus on node attributes, our GraphEXT places greater emphasis on the interactions among nodes and the impact of structural changes on GNN predictions. Experimental studies on both synthetic and real-world datasets show that GraphEXT outperforms existing baseline methods in terms of fidelity across diverse GNN architectures , significantly enhancing the explainability of GNN models.
5361: Witnesses for Answer Sets of Basic Logic Programs
Authors: Yisong Wang, Xianglong Wang, Zhongtao Xie, Thomas Eiter
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: KRR: Logic programming
Show Abstract
Explanation plays an important role in the decisions of both symbolic and neural network-based AI systems. Logic programs under answer set semantics (ASP) have been a typical declarative reasoning and problem-solving paradigm that has extensive applications in various AI domains. In this paper, we consider the issue of explanation for logic programs with abstract constraint atoms (c-atoms) under SPT-answer set semantics. Such c-atoms are general enough to capture complex constructors of logic programs, including aggregates, and the SPT-answer sets exclude circular justifications that other semantics have. We propose a minimal reduct for logic programs with c-atoms that yields a new semantic characterization of SPT-answer sets, and then introduce an extension of resolution for clauses with c-atoms. As we show, every atom in an SPT-answer set enjoys an extended resolution proof from the minimal reduct of its logic program. Finally, we present minimal sufficient subsets of logic programs
(witnesses) to structure such an extended resolution proof for an atom in an SPT-answer set. Our results contribute to the justification of answer sets and provide a basis for explainability of ASP-based applications.
5377: CABIN: Debiasing Vision-Language Models Using Backdoor Adjustments
Authors: Bo Pang, Tingrui Qiao, Caroline Walker, Chris Cunningham, Yun Sing Koh
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: AI Ethics, Trust, Fairness (3/3)
Show Abstract
Vision-language models (VLMs) have demonstrated strong zero-shot inference capabilities but may exhibit stereotypical biases toward certain demographic groups. Consequently, downstream tasks leveraging these models may yield unbalanced performance across different target social groups, potentially reinforcing harmful stereotypes. Mitigating such biases is critical for ensuring fairness in practical applications. Existing debiasing approaches typically rely on curated face-centric datasets for fine-tuning or retraining, risking overfitting and limiting generalisability. To address this issue, we propose a novel framework, CABIN (Causal Adjustment Based INtervention). It leverages a causal framework to identify sensitive attributes in images as confounding factors. Employing a learned mapper, which is trained on general large-scale image-text pairs rather than face-centric datasets, CABIN may use text to adjust sensitive attributes in the image embedding, ensuring independence between these sensitive attributes and image embeddings. This independence enables a backdoor adjustment for unbiased inference without the drawbacks of additional fine-tuning or retraining on narrowly tailored datasets. Through comprehensive experiments and analyses, we demonstrate that CABIN effectively mitigates biases and improves fairness metrics while preserving the zero-shot strengths of VLMs. The code is available at: https://github.com/ipangbo/causal-debias
5383: App2Exa: Accelerating Exact kNN Search via Dynamic Cache-Guided Approximation
Authors: Ke Li, Leong Hou U, Shuo Shang
Location: Guangzhou | Day: TBD
Show Abstract
The k-nearest neighbor (kNN) query is a cornerstone of similarity-based applications across various domains. While prior work has enhanced kNN search efficiency, it typically focuses on approximate methods for high-dimensional data or exact methods for low-dimensional data, often assuming static query and data distributions. This creates a significant gap in accelerating exact kNN search for low-to-medium dimensional data with dynamic query distributions. To fill this gap, we propose App2Exa, a cache-guided framework that integrates approximate and exact kNN search. App2Exa utilizes a dynamically maintained cache graph index to retrieve approximate results, which subsequently guide exact search using a VP-Tree with a best-first strategy. A benefit-driven caching mechanism further optimizes performance by prioritizing vectors based on frequency, recency, and computational cost. Experimental results demonstrate that App2Exa significantly boosts efficiency, providing a robust and scalable solution for evolving query patterns and enabling exact kNN search to support higher dimensionality more effectively.
5386: Constrained Sequential Inference in Machine Learning Using Constraint Programming
Authors: Virasone Manibod, David Saikali, Gilles Pesant
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Constraint Satisfaction and Optimization (1/3)
Show Abstract
Sequence models in machine learning often struggle to exhibit long-term structure. We consider this problem at inference time in the context of enforcing constraints that are not necessarily featured in the dataset on which the generative model was trained. The difficulty lies in imposing previously-unseen structure while staying close to the training dataset. It is particularly hard for long-term structure, which requires balancing foresight over many yet-to-be generated tokens and the immediacy of next-token predictions from the sequence model. We address this problem by introducing our neurosymbolic framework GeAI-BLAnC. The learned probabilities of the sequence model are mixed in with the marginal probabilities computed from a constraint programming / belief propagation framework applied to a constraint programming model expressing the desired structure.
The next predicted token is then selected from the resulting probability distribution. Experiments in the context of molecule and music generation show that we can achieve the structure imposed post-training without straying too much from the structure of the dataset learned during training.
5401: HPDM: A Hierarchical Popularity-aware Debiased Modeling Approach for Personalized News Recommender
Authors: Xiangfu He, Qiyao Peng, Minglai Shao, Hongtao Liu
Location: Guangzhou | Day: TBD
Show Abstract
News recommender systems face inherent challenges from popularity bias, where user interactions concentrate heavily on a small subset of popular news. While existing debiasing methods have made progress in recommendation, they often overlook two critical aspects: the different granularity of news popularity (across titles, categories, etc.) and how hierarchical popularity levels distinctly influence user interest modeling. Hence, in this paper, we propose a hierarchical causal debiasing framework that effectively captures genuine user interests while mitigating popularity bias at different granularity levels. Our framework incorporates two key components during training: (1) a hierarchical popularity-aware user modeling module to capture user interests by distinguishing popular and unpopular interactions at different granularity news content; and (2) a dual-view structure combining counterfactual reasoning for popular-view news with inverse propensity weighting for unpopular-view news to model user genuine interests. During inference, our framework removes popularity-induced effects to predict relatedness between user and candidate news. Extensive experiments on two widely-used datasets, MIND and Adressa, demonstrate that our framework significantly outperforms existing baseline approaches in addressing both the long-tail distribution challenge. Our code is available at \url{https://github.com/hexiangfu123/HPDM}.
5402: Community-Aware Graph Transformer for Brain Disorder Identification
Authors: Shengbing Pei, Jiajun Ma, Zhao Lv, Chao Zhang, Jihong Guan
Location: Guangzhou | Day: TBD
Show Abstract
Abnormal brain functional network is an effective biomarker for brain disease diagnosis. Most existing methods focus on mining discriminative information from whole-brain connectivity patterns. However, multi-level collaboration is the foundation of efficient brain function, in addition to the whole-brain network, there are multiple sub-networks that can quickly integrate and process specific cognitive functions, forming the modular community structure of the brain. To address this gap, we propose a novel method, community-aware graph Transformer (CAGT), that integrates the community information of sub-networks and the topological information of brain graph into the Transformer architecture for better brain disorder identification. CAGT enhances information exchange within and between functional communities through dual-scale feature fusion, capturing interactive information across various scales. Additionally, it incorporates prior knowledge to design brain region position encoding and guide the self-attention, thereby enhancing the spatial awareness of the Transformer and aligning it with the brain’s natural information transfer process. Experimental results indicate that our proposed method significantly improves performance on both large and small datasets, and can reliably capture the interactions between sub-networks, demonstrating its generalization and interpretability.
5405: Feint and Attack: Jailbreaking and Protecting LLMs via Attention Distribution Modeling
Authors: Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, Zaisheng Ye
Location: Guangzhou | Day: TBD
Show Abstract
Most jailbreak methods for large language models (LLMs) focus on superficially improving attack success through manually defined rules. However, they fail to uncover the underlying mechanisms within target LLMs that explain why an attack succeeds or fails. In this paper, we propose investigating the phenomenon of jailbreaks and defenses for LLMs from the perspective of attention distributions within the models. A preliminary experiment reveals that the success of a jailbreak is closely linked to the LLM’s attention on sensitive words.Inspired by this interesting finding, we propose incorporating critical signals derived from internal attention distributions within LLMs, namely Attention Intensity on Sensitive Words and Attention Dispersion Entropy, to guide both attacks and defenses. Drawing inspiration from the concept of "Feint and Attack", we introduce an attention-guided jailbreak model, ABA, which redirects the model’s attention to benign contexts, and an attention-based defense model, ABD, designed to detect attacks by analyzing internal attention entropy. Experimental results demonstrate the superiority of our proposal when compared to SOTA baselines.
5413: MGCA-Net: Multi-Graph Contextual Attention Network for Two-View Correspondence Learning
Authors: Shuyuan Lin, Mengtin Lo, Haosheng Chen, Yanjie Liang, Qiangqiang Wu
Location: Guangzhou | Day: TBD
Show Abstract
Two-view correspondence learning is a key task in computer vision, which aims to establish reliable matching relationships for applications such as camera pose estimation and 3D reconstruction. However, existing methods have limitations in local geometric modeling and cross-stage information optimization, which make it difficult to accurately capture the geometric constraints of matched pairs and thus reduce the robustness of the model. To address these challenges, we propose a Multi-Graph Contextual Attention Network (MGCA-Net), which consists of a Contextual Geometric Attention (CGA) module and a Cross-Stage Multi-Graph Consensus (CSMGC) module. Specifically, CGA dynamically integrates spatial position and feature information via an adaptive attention mechanism and enhances the capability to capture both local and global geometric relationships. Meanwhile, CSMGC establishes geometric consensus via a cross-stage sparse graph network, ensuring the consistency of geometric information across different stages. Experimental results on two representative YFCC100M and SUN3D datasets show that MGCA-Net significantly outperforms existing SOTA methods in the outlier rejection and camera pose estimation tasks. Source code is available at http://www.linshuyuan.com.
5414: Optimal Planning to Coordinate Science Data Collection and Downlink for a Constellation of Agile Satellites with Limited Storage
Authors: Richard Levinson, Vinay Ravindra, Sreeja Roy-Singh
Location: Guangzhou | Day: TBD
Show Abstract
We present a novel Mixed Integer Linear Program formulation that produces optimal plans for a constellation of remote sensing satellites. The generalized formulation is applied to an operational NASA constellation to improve wildfire danger prediction. The planner generates integrated data collection and downlink plans for multiple agile satellites with limited storage capacity, minimum energy requirements, and temporal constraints. Observation targets and modes are associated with science rewards. The planner maximizes the aggregate rewards collected for all observations on all satellites.

Our generalized model for integrated data collection and downlink uses a novel interval-based abstraction called Data Cycles, without time-indexed variables. Data cycles organize the multitude of observation and downlink opportunities from 1 second granularity into sequences of data collection and downlink intervals. Experiments using large-scale real-world data yield optimal 24-hr plans for an eight satellite constellation, which capture 99% of the ~23,000 available targets and 99.9% of available science rewards.
5415: Counterfactual Explanations for Continuous Action Reinforcement Learning
Authors: Shuyang Dong, Shangtong Zhang, Lu Feng
Location: Montreal | Day: August 21st | Time: 10:00 | Session: ML: Explainable/Interpretable machine learning
Show Abstract
Reinforcement Learning (RL) has shown great promise in domains like healthcare and robotics but often struggles with adoption due to its lack of interpretability. Counterfactual explanations, which address “what if” scenarios, provide a promising avenue for understanding RL decisions but remain underexplored for continuous action spaces. We propose a novel approach for generating counterfactual explanations in continuous action RL by computing alternative action sequences that improve outcomes while minimizing deviations from the original sequence. Our approach leverages a distance metric for continuous actions and accounts for constraints such as adhering to predefined policies in specific states. Evaluations in two RL domains, Diabetes Control and Lunar Lander, demonstrate the effectiveness, efficiency, and generalization of our approach, enabling more interpretable and trustworthy RL applications.
5416: Consistency-Aware Padding for Incomplete Multi-Modal Alignment Clustering Based on Self-Repellent Greedy Anchor Search
Authors: Shubin Ma, Liang Zhao, Mingdong Lu, Yifan Guo, Bo Xu
Location: Guangzhou | Day: TBD
Show Abstract
Multi-modal representation is faithful and highly effective in describing real-world data samples’ characteristics by describing their complementary information. However, the collected data often exhibits incomplete and misaligned characteristics due to factors such as inconsistent sensor frequencies and device malfunctions. Existing research has not effectively addressed the issue of filling missing data in scenarios where multiview data are both imbalanced and misaligned. Instead, it relies on class-level alignment of the available data. Thus, it results in some data samples not being well-matched, thereby affecting the quality of data fusion. In this paper, we propose the Consistency-Aware Padding for Incomplete Multi-Modal Alignment Clustering Based on Self-Repellent Greedy Anchor Search(CAPIMAC) to tackle the problem of filling imbalanced and misaligned data in multi-modal datasets. Specifically, we propose a self-repellent greedy anchor search module(SRGASM), which employs a self-repellent random walk combined with a greedy algorithm to identify anchor points for re-representing incomplete and misaligned multi-modal data. Subsequently, based on noise-contrastive learning, we design a consistency-aware padding module (CAPM) to effectively interpolate and align imbalanced and misaligned data, thereby improving the quality of multi-modal data fusion. Experimental results demonstrate the superiority of our method over benchmark datasets. The code will be publicly released at https://github.com/bestow09090/-CAPIMAC.git.
5429: Electron Density-enhanced Molecular Geometry Learning
Authors: Hongxin Xiang, Jun Xia, Xin Jin, Wenjie Du, Li Zeng, Xiangxiang Zeng
Location: Guangzhou | Day: TBD
Show Abstract
Electron density (ED), which describes the probability distribution of electrons in space, is crucial for accurately understanding the energy and force distribution in molecular force fields (MFF).
Existing machine learning force fields (MLFF) focus on mining appropriate physical quantities from the atom-level conformation to enhance the molecular geometry representation while ignoring the unique information from microscopic electrons. In this work, we propose an efficient Electronic Density representation framework to enhance molecular Geometric learning (called EDG), which leverages images rendered from ED to boost molecular geometric representations in MLFF. Specifically, we construct a novel image-based ED representation, which consists of 2 million 6-view images with RGB-D channels, and design an ED representation learning model, called ImageED, to learn ED-related knowledge from these images. We further propose an efficient ED-aware teacher and introduce a cross-modal distillation strategy to transfer knowledge from the image-based teacher to the geometry-based students. Extensive experiments on QM9 and rMD17 demonstrate that EDG can be directly integrated into existing geometry-based models and significantly improves the capabilities of these models (e.g., SchNet, EGNN, SphereNet, ViSNet) for geometry representation learning in MLFF with a maximum average performance increase of 33.7%. Code and appendix are available at https://github.com/HongxinXiang/EDG
5440: Denoise-then-Retrieve: Text-Conditioned Video Denoising for Video Moment Retrieval
Authors: Weijia Liu, Jiuxin Cao, Bo Miao, Zhiheng Fu, Xuelin Zhu, Jiawei Ge, Bo Liu, Mehwish Nasim, Ajmal Mian
Location: Guangzhou | Day: TBD
Show Abstract
Current text-driven Video Moment Retrieval (VMR) methods encode all video clips, including irrelevant ones, disrupting multimodal alignment and hindering optimization. To this end, we propose a denoise-then-retrieve paradigm that explicitly filters text-irrelevant clips from videos and then retrieves the target moment using purified multimodal representations. Following this paradigm, we introduce the Denoise-then-Retrieve Network (DRNet), comprising Text-Conditioned Denoising (TCD) and Text-Reconstruction Feedback (TRF) modules. TCD integrates cross-attention and structured state space blocks to dynamically identify noisy clips and produce a noise mask to purify multimodal video representations. TRF further distills a single query embedding from purified video representations and aligns it with the text embedding, serving as auxiliary supervision for denoising during training. Finally, we perform conditional retrieval using text embeddings on purified video representations for accurate VMR. Experiments on Charades-STA and QVHighlights demonstrate that our approach surpasses state-of-the-art methods on all metrics. Furthermore, our denoise-then-retrieve paradigm is adaptable and can be seamlessly integrated into advanced VMR models to boost performance.
5449: BridgeVoC: Neural Vocoder with Schrödinger Bridge
Authors: Tong Lei, Zhiyu Zhang, Rilin Chen, Meng Yu, Jing Lu, Chengshi Zheng, Dong Yu, Andong Li
Location: Guangzhou | Day: TBD
Show Abstract
While previous diffusion-based neural vocoders typically follow a noise-to-data generation pipe-line, the linear-degradation prior of the mel-spectrogram is often neglected, resulting in limited generation quality. By revisiting the vocoding task and excavating its connection with the signal restoration task, this paper proposes a time-frequency (T-F) domain-based neural vocoder with the Schrödinger Bridge, called BridgeVoC, which is the first to follow the data-to-data generation paradigm. Specifically, the mel-spectrogram can be projected into the target linear-scale domain and regarded as a degraded spectral representation with a deficient rank distribution. Based on this, the Schrödinger Bridge is leveraged to establish a connection between the degraded and target data distributions. During the inference stage, starting from the degraded representation, the target spectrum can be gradually restored rather than generated from a Gaussian noise process. Quantitative experiments on LJSpeech and LibriTTS show that BridgeVoC achieves faster inference and surpasses existing diffusion-based vocoder baselines, while also matching or exceeding non-diffusion state-of-the-art methods across evaluation metrics.
5463: Test-Time Adaptation on Recommender System with Data-Centric Graph Transformation
Authors: Yating Liu, Xin Zheng, Yi Li, Yanqing Guo
Location: Guangzhou | Day: TBD
Show Abstract
Distribution shifts in recommender systems between training and testing in user-item interactions lead to inaccurate recommendations. Despite the promising performance of test-time adaptation technology in various domains, it still faces challenges in recommender systems due to the impracticality of fine-tuning models and the infeasibility of obtaining test-time labels. To address these challenges, we first propose a Test-Time Adaptation framework for Graph-based Recommender system, named TTA-GREC, to dynamically adapt user-item graphs at test time in a data-centric way, handling distribution shifts effectively. Specifically, our TTA-GREC targets KG-enhanced GNN-based recommender systems with three core components: (1) Pseudo-label guided UI graph transformation for adaptive improvement; (2) Rationale score guided KG graph revision for semantic enhancement; and (3) Sampling-based self-supervised adaptation for contrastive learning. Experiments demonstrate TTA-GREC’s superiority at test time and provide new data-centric insights on test-time adaptation for better recommender system inference.
5470: Predicting Spectral Information for Self-Supervised Signal Classification
Authors: Yi Xu, Shuang Wang, Hantong Xing, Chenxu Wang, Dou Quan, Rui Yang, Dong Zhao, Luyang Mei
Location: Guangzhou | Day: TBD
Show Abstract
Deep learning methods have demonstrated remarkable performance across various communication signal processing tasks. However, most signal classification methods require a substantial amount of labeled samples for training, posing significant challenges in the field of communication signals, as labeling necessitates expert knowledge. This paper proposes a novel self-supervised signal classification method called Spectral-Guided Self-Supervised Signal Classification (SGSSC). Specifically, to leverage frequency-domain information with modulation semantics as prior knowledge for the model, we design a previously unexplored pretext task tailored to the format of signal data. This task involves predicting spectral information from masked time-domain signals, enabling the model to learn implicit signal features through cross-domain pattern transformation. Furthermore, the pretext task in the SGSSC method is relevant to the downstream classification task, and using traditional fine-tuning strategies on the downstream task may lead to the loss of certain features associated with the pretext task. Therefore, we propose an attention mechanism-based fine-tuning strategy that adaptively integrates pre-trained features from different levels. Extensive experimental results validate the superiority of the SGSSC method. For instance, when the proportion of labeled samples is only 0.5%, our method achieves an average improvement of 2.3% in downstream classification tasks compared to the best-performing self-supervised training strategies.
5476: HLMTrans: A Sim-to-Real Transfer Framework for Spatial Crowdsourcing with Human-Guided Language Models
Authors: Qingshun Wu, Yafei Li, Lulu Li, Yuanyuan Jin, Shuo He, Mingliang Xu
Location: Guangzhou | Day: TBD
Show Abstract
Reinforcement Learning (RL), trained via trial and error in simulators, has been proven to be an effective approach for addressing task assignment problems in spatial crowdsourcing. However, a performance gap still exists when transferring the simulator-trained RL Models (RLMs) to real-world settings due to the misalignment of travel time. Existing works mostly focus on using data-driven and learning-based methods to predict travel time; unfortunately, these approaches are limited in achieving accurate predictions by requiring a large amount of real-world data covering the entire state distribution. In this paper, we propose a Sim-to-Real Transfer with Human-guided Language Models framework called HLMTrans, which comprises three core modules: RLMs decision for task assignment, sim-to-real transfer with Large Language Models (LLMs), and preference learning from human feedback. HLMTrans first leverages the zero-shot chain-of-thought reasoning capability of LLMs to estimate travel time by capturing the real-world dynamics. This estimation is then input as domain knowledge into the forward model of Grounded Action Transformation (GAT) to enhance the action transformation of RLMs. Further, we design a human preference learning mechanism to fine-tune LLMs, improving their generation quality and enabling RLMs learn a more realistic policy. We evaluate the proposed HLMTrans on two real-world datasets, and the experimental results demonstrate that HLMTrans outperforms the SOTA methods in terms of effectiveness and efficiency.
5477: DIIN: Diffusion Iterative Implicit Networks for Arbitrary-scale Super-resolution
Authors: Tao Dai, Song Wang, Hang Guo, Jianping Wang, Zexuan Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Implicit neural representation (INR) aims to represent continuous domain signals via implicit neural functions and has achieved great success in arbitrary-scale image super-resolution (SR). However, most existing INR-based SR methods focus on learning implicit features from independent coordinate, while neglecting interactions of neighborhood coordinates, thus resulting in limited contextual awareness. In this paper, we rethink the forward process of implicit neural functions as a signal diffusion process, we propose a novel Diffusion Iterative Implicit Network (DIIN) for arbitrary-scale SR to promote global signal flow with neighborhood interactions. The DIIN framework mainly consists of stacked Diffusion Iteration Layers with dictionary cross-attention block to enrich the iterative update process with supplementary information. Besides, we develop the Position-Aware Embedding Block to strengthen spatial dependencies between consecutive input samples.Extensive experiments on public datasets demonstrate that our method achieves state-of-the-art or competitive performance, highlighting its effectiveness and efficiency for arbitrary-scale SR. Our code is available at https://github.com/Song-1205/DIIN.
5482: Preference Identification by Interaction Overlap for Bundle Recommendation
Authors: Fei-Yao Liang, Wu-Dong Xi, Xing-Xing Xing, Wei Wan, Chang-Dong Wang, Hui-Yu Zhou
Location: Guangzhou | Day: TBD
Show Abstract
In the digital age, recommendation systems are crucial for enhancing user experiences, with bundle recommendations playing a key role by integrating complementary products. However, existing methods fail to accurately identify user preferences for specific items within bundles, making it difficult to design bundles containing more items of interest to users. Additionally, these methods do not leverage similar preferences among users of the same category, resulting in unstable and incomplete preference expressions. To address these issues, we propose Preference Identification by Interaction Overlap for Bundle Recommendation (PIIO). The data augmentation module analyzes the overlap between bundle-item inclusions and user-item interactions to calculate the interaction probability of non-interacted bundles, selecting the bundle with the highest probability as a positive sample to enrich user-bundle interactions and uncover user preferences for items within bundles. The preference aggregation module utilizes the overlap in user-item interactions to select similar users, aggregates preferences using an autoencoder, and constructs comprehensive preference profiles. The optimization module predicts user-bundle matching scores based on a user interest boundary loss function. The proposed PIIO model is applied to two bundle recommendation datasets, and experiments demonstrate the effectiveness of the PIIO model, surpassing state-of-the-art models.
5484: Adaptive Graph Unlearning
Authors: Pengfei Ding, Yan Wang, Guanfeng Liu, Jiajie Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Graph unlearning, which deletes graph elements such as nodes and edges from trained graph neural networks (GNNs), is crucial for real-world applications where graph data may contain outdated, inaccurate, or privacy-sensitive information. However, existing methods often suffer from (1) incomplete or over unlearning due to neglecting the distinct objectives of different unlearning tasks, and (2) inaccurate identification of neighbors affected by deleted elements across various GNN architectures. To address these limitations, we propose AGU, a novel Adaptive Graph Unlearning framework that flexibly adapts to diverse unlearning tasks and GNN architectures. AGU ensures the complete forgetting of deleted elements while preserving the integrity of the remaining graph. It also accurately identifies affected neighbors for each GNN architecture and prioritizes important ones to enhance unlearning performance. Extensive experiments on seven real-world graphs demonstrate that AGU outperforms existing methods in terms of effectiveness, efficiency, and unlearning capability.
5494: Fully Test-Time Adaptation for Feature Decrement in Tabular Data
Authors: Zi-Jian Cheng, Zi-Yi Jia, Kun-Yang Yu, Zhi Zhou, Lan-Zhe Guo
Location: Guangzhou | Day: TBD
Show Abstract
Tabular data is widely adopted in various machine learning tasks. Current tabular data learning mainly focuses on closed environments, while in real-world applications, open environments are often encountered, where distribution shifts and feature decrements occur, leading to severe performance degradation. Previous studies have primarily focused on addressing distribution shifts, while feature decrements, a unique challenge in tabular data learning, have received relatively little attention. In this paper, we present the first comprehensive study on the problem of Fully Test-Time Adaptation for Feature Decrement in Tabular Data. Through empirical analysis, we identify the suboptimality of existing missing-feature imputation methods and the limited applicability of missing-feature adaptation approaches. To address these challenges, we propose a novel method, LLM-IMPUTE, which leverages Large Language Models (LLMs) to impute missing features without relying on training data. Furthermore, we introduce Augmented-Training LLM (ATLLM), a method designed to enhance the robustness of feature decrements by simulating feature-decrement scenarios during the training phase to address tasks that can not be imputed by LLM-IMPUTE. Extensive experimental results demonstrate that our proposal significantly improves both performance and robustness in missing feature imputation and adaptation scenarios.
5495: HGMP: Heterogeneous Graph Multi-Task Prompt Learning
Authors: Pengfei Jiao, Jialong Ni, Di Jin, Xuan Guo, Huan Liu, Hongjiang Chen, Yanxian Bi
Location: Guangzhou | Day: TBD
Show Abstract
The pre-training and fine-tuning methods have gained widespread attention in the field of heterogeneous graph neural networks due to their ability to leverage large amounts of unlabeled data during the pre-training phase, allowing the model to learn rich structural features. However, these methods face the issue of a mismatch between the pre-trained model and downstream tasks, leading to suboptimal performance in certain application scenarios. Prompt learning methods have emerged as a new direction in heterogeneous graph tasks, as they allow flexible adaptation of task representations to address target inconsistency. Building on this idea, this paper proposes a novel multi-task prompt framework for the heterogeneous graph domain, named HGMP. First, to bridge the gap between the pre-trained model and downstream tasks, we reformulate all downstream tasks into a unified graph-level task format. Next, we address the limitations of existing graph prompt learning methods, which struggle to integrate contrastive pre-training strategies in the heterogeneous graph domain. We design a graph-level contrastive pre-training strategy to better leverage heterogeneous information and enhance performance in multi-task scenarios. Finally, we introduce heterogeneous feature prompts, which enhance model performance by refining the representation of input graph features. Experimental results on public datasets show that our proposed method adapts well to various tasks and significantly outperforms baseline methods.
5522: Integration of Old and New Knowledge for Generalized Intent Discovery: A Consistency-driven Prototype-Prompting Framework
Authors: Xiao Wei, Xiaobao Wang, Ning Zhuang, Chenyang Wang, Longbiao Wang, Jianwu Dang
Location: Guangzhou | Day: TBD
Show Abstract
Intent detection aims to identify user intents from natural language inputs, where supervised methods rely heavily on labeled in-domain (IND) data and struggle with out-of-domain (OOD) intents, limiting their practical applicability. Generalized Intent Discovery (GID) addresses this by leveraging unlabeled OOD data to discover new intents without additional annotation. However, existing methods focus solely on clustering unsupervised data while neglecting domain adaptation. Therefore, we propose a consistency-driven prototype-prompting framework for GID from the perspective of integrating old and new knowledge, which includes a prototype-prompting framework for transferring old knowledge from external sources, and a hierarchical consistency constraint for learning new knowledge from target domains. We conducted extensive experiments and the results show that our method significantly outperforms all baseline methods, achieving state-of-the-art results, which strongly demonstrates the effectiveness and generalization of our methods. Our source code is publicly available at https://github.com/smileix/cpp.
5529: Odyssey : Empowering Minecraft Agents with Open-World Skills
Authors: Shunyu Liu, Yaoru Li, Kongcheng Zhang, Zhenyu Cui, Wenkai Fang, Yuxuan Zheng, Tongya Zheng, Mingli Song
Location: Guangzhou | Day: TBD
Show Abstract
Recent studies have delved into constructing generalist agents for open-world environments like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic programmatic tasks, e.g., material collection and tool-crafting following the Minecraft tech-tree, treating the ObtainDiamond task as the ultimate goal. This limitation stems from the narrowly defined set of actions available to agents, requiring them to learn effective long-horizon strategies from scratch. Consequently, discovering diverse gameplay opportunities in the open world becomes challenging. In this work, we introduce Odyssey, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. Odyssey comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills. (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki. (3) A new agent capability benchmark includes the long-term planning task, the dynamic-immediate planning task, and the autonomous exploration task. Extensive experiments demonstrate that the proposed Odyssey framework can effectively evaluate different capabilities of LLM-based agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.
5539: MultiDreamer3D: Multi-concept 3D Customization with Concept-Aware Diffusion Guidance
Authors: Wooseok Song, Seunggyu Chang, Jaejun Yoo
Location: Montreal | Day: August 19th | Time: 15:00 | Session: CV: Difusion models
Show Abstract
While single-concept customization has been studied in 3D, multi-concept customization remains largely unexplored. To address this, we propose MultiDreamer3D that can generate coherent multi-concept 3D content in a divide-and-conquer manner. First, we generate 3D bounding boxes using an LLM-based layout controller. Next, a selective point cloud generator creates coarse point clouds for each concept. These point clouds are placed in the 3D bounding boxes and initialized into 3D Gaussian Splatting with concept labels, enabling precise identification of concept attributions in 2D projections. Finally, we refine 3D Gaussians via concept-aware interval score matching, guided by concept-aware diffusion. Our experimental results show that MultiDreamer3D not only ensures object presence and preserves the distinct identities of each concept but also successfully handles complex cases such as property change or interaction. To the best of our knowledge, we are the first to address the multi-concept customization in 3D.
5544: Preference-based Deep Reinforcement Learning for Historical Route Estimation
Authors: Boshen Pan, Yaoxin Wu, Zhiguang Cao, Yaqing Hou, Guangyu Zou, Qiang Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Recent Deep Reinforcement Learning (DRL) techniques have advanced solutions to Vehicle Routing Problems (VRPs). However, many of these methods focus exclusively on optimizing distance-oriented objectives (i.e., minimizing route length), often overlooking the implicit drivers’ preferences for routes. These preferences, which are crucial in practice, are challenging to model using traditional DRL approaches. To address this gap, we propose a preference-based DRL method characterized by its reward design and optimization objective, which is specialized to learn historical route preferences. Our experiments demonstrate that the method aligns generated solutions more closely with human preferences. Moreover, it exhibits strong generalization performance across a variety of instances, offering a robust solution for different VRP scenarios.
5546: On the Power of Optimism in Constrained Online Convex Optimization
Authors: Haobo Zhang, Hengquan Guo, Xin Liu
Location: Montreal | Day: August 20th | Time: 14:00 | Session: ML: Machine Learning 6/8
Show Abstract
This paper studies the constrained online convex optimization problem (COCO) where the learner makes sequential decisions within a constrained set. We present Optimistic-COCO, an adaptive gradient-based algorithm that incorporates optimistic design with the Lyapunov optimization technique. The proposed algorithm achieves strong theoretical guarantees: 1) Optimistic-COCO provides a tight gradient-variation regret bound and constant constraint violation; 2) Optimistic-COCO is environment-agnostic, utilizing adaptive learning rates that rely solely on causal information. These results resolve an open question posed in prior work regarding whether an adaptive algorithm can achieve problem-dependent regret and constant constraint violation in COCO. We establish these robust guarantees through carefully designed adaptive parameters and a refined multi-step Lyapunov drift analysis. Experimental results further validate our theoretical findings, demonstrating the practical efficacy of the proposed algorithm.
5551: A SAT-based Method for Counting All Singleton Attractors in Boolean Networks
Authors: Rei Higuchi, Takehide Soh, Daniel Le Berre, Morgan Magnin, Mutsunori Banbara, Naoyuki Tamura
Location: Montreal | Day: August 21st | Time: 11:30 | Session: Constraint Satisfaction and Optimization (3/3)
Show Abstract
Boolean networks (BNs) are widely used to model biological regulatory networks. Attractors here hold significant meaning as they represent long-term behaviors such as homeostasis and the results of cell differentiation. As such, computing attractors is of critical importance to guarantee the validity of a model or to assess its stability and robustness. However, this problem is quite challenging when it comes to large real-world models. To overcome the limits of state-of-the-art BDD-based or ASP-based enumeration approaches, we introduce a SAT-based approach to compute fixed points (singleton attractors) of BN and exhibit its merits for counting the number of singleton attractors of large-scale benchmarks well established in the literature.
5554: Wrapped Partial Label Dimensionality Reduction via Dependence Maximization
Authors: Xiang-Ru Yu, Deng-Bao Wang, Min-Ling Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Partial label learning induces classifier from data with ambiguous supervision, where each instance is associated with a set of candidate labels but only one of which is valid. As a classic data preprocessing strategy, dimensionality reduction contributes to enhance the generalization capabilities of learning algorithms. Due to the ambiguity of supervision, existing works on partial label dimensionality reduction are confined to two separate stages: dimensionality reduction and partial label disambiguation. However, the decoupling of dimensionality reduction from partial label disambiguation can lead to severe performance degradation. In this paper, we present a novel approach called Wrapped Partial Label Dimensionality Reduction (WPLDR) to address this challenge. Specifically, WPLDR integrates the dimensionality reduction and partial label disambiguation within a unified framework, employing alternating optimization to concurrently perform dimensionality reduction and partial label disambiguation. WPLDR maximizes the interdependence between features in the embedded space and confidence-based label information, while simultaneously ensuring the manifold consistency between the embedded feature space and label space. Extensive experiments over a broad range of synthetic and real-world partial label data sets validate that the performance of well-established partial label learning algorithms can be significantly improved by the proposed WPLDR.
5567: On Definite Iterated Belief Revision with Belief Algebras
Authors: Hua Meng, Zhiguo Long, Michael Sioutis, Zhengchun Zhou
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Knowledge Representation and Reasoning (1/4)
Show Abstract
Traditional logic-based belief revision research focuses on designing rules to constrain the behavior of revision operators. Frameworks have been proposed to characterize iterated revision rules, but they are often too loose, leading to multiple revision operators that all satisfy the rules under the same belief condition. In many practical applications, such as safety critical ones, it is important to specify a definite revision operator to enable agents to iteratively revise their beliefs in a deterministic way. In this paper, we propose a novel framework for iterated belief revision by characterizing belief information through preference relations. Semantically, both beliefs and new evidence are represented as belief algebras, which provide a rich and expressive foundation for belief revision. Building on traditional revision rules, we introduce additional postulates for revision with belief algebra, including an upper-bound constraint on the outcomes of revision. We prove that the revision result is uniquely determined given the current belief state and new evidence. Furthermore, to make the framework more useful in practice, we develop a particular algorithm for performing the proposed revision process. We argue that this approach may offer a more predictable and principled method for belief revision, making it suitable for real-world applications.
5581: Navigating Social Dilemmas with LLM-based Agents via Consideration of Future Consequences
Authors: Dung Nguyen, Hung Le, Kien Do, Sunil Gupta, Svetha Venkatesh, Truyen Tran
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Agent-based and Multi-agent Systems (1/3)
Show Abstract
Artificial agents with the aid of large language models (LLMs) are effective in various real-world scenarios but struggle to cooperate in social dilemmas. When making decisions under the strain of selecting between long-term consequences and short-term benefits in commonly shared resources, LLM-based agents often exploit the environment, leading to early depletion. Inspired by the concept of consideration of future consequences (CFC), which is well-known in social psychology, we propose a framework to enable the ability to consider future consequences for LLM-based agents, which results in a new kind of agent that we term the CFC-Agent. We enable the CFC-Agent to act toward different levels of consideration for future consequences. Our first set of experiments, where LLM is directly asked to make decisions, shows that agents considering future consequences exhibit sustainable behaviour and achieve high common rewards for the population. Extensive experiments in complex environments showed that the CFC-Agent can manage a sequence of calls to LLM for reasoning and engaging in communication to cooperate with others to resolve the common dilemma better. Finally, our analysis showed that considering future consequences not only affects the final decision but also improves the conversations between LLM-based agents toward a better resolution of social dilemmas.
5584: Instance Relation Learning Network with Label Knowledge Propagation for Few-shot Multi-label Intent Detection
Authors: Shiman Zhao, Shangyuan Li, Wei Chen, Tengjiao Wang, Jiahui Yao, Jiabin Zheng, Kam-Fai Wong
Location: Guangzhou | Day: TBD
Show Abstract
Few-shot Multi-label Intent Detection (MID) is crucial for dialogue systems, aiming to detect multiple intents of utterances in low-resource dialogue domains. Previous studies focus on a two-stage pipeline. They first learn representations of utterances with multiple labels and then use a threshold-based strategy to identify multi-label results. However, these methods rely on representation classification and ignore instance relations, leading to error propagation. To solve the above issues, we propose a multi-label joint learning method for few-shot MID in an end-to-end manner, which constructs an instance relation learning network with label knowledge propagation to eliminate error propagation. Concretely, we learn the interaction relations between instances with class information to propagate label knowledge between a few labeled (support set) and unlabeled (query set) instances. With label knowledge propagation, the relation strength between instances directly indicates whether two utterances belong to the same intent for multi-label prediction. Besides, a dual relation-enhanced loss is developed to optimize support- and query-level relation strength to improve performance. Experiments show that we outperform strong baselines by an average of 9.54% AUC and 11.19% Macro-F1 in 1-shot scenarios.
5589: Guiding Large Language Models in Modeling Optimization Problems via Question Partitioning
Authors: Xiaotian Pan, Junhao Fang, Feng Wu, Sijia Zhang, Yi-Xiang Hu, Shaoang Li, Xiang-Yang Li
Location: Guangzhou | Day: TBD
Show Abstract
Optimization problems are ubiquitous across various domains, such as resource scheduling, production planning, and sales management. Traditionally, they are modeled manually, leading to inefficiencies due to difficulties in communication and collaboration between modeling and domain experts. The emergence of Large Language Models (LLMs) has made automated modeling possible. However, real-world applications are often large-scale and have numerous variables and constraints, limiting the applicability of existing methods. To address this, we propose PaMOP, a novel modeling framework based on LLMs, to model optimization problems automatically, given only natural language descriptions. Specifically, we extract and partition the problems using a tree structure, guiding the LLMs to model each set of constraints with self-augmented prompts, thus reducing the demands on the LLM’s capabilities of large contents. The mathematical model is then iteratively corrected and validated through our correction procedures. The experiments demonstrate that our method improves performance on the common benchmark dataset NLP4LP, achieving an accuracy of 62.3% and a code executability rate of 86.8% when tested on GPT-4. Additionally, we demonstrate the effectiveness of our PaMOP in handling large real-world problems.
5600: A Weighted-Based Fast Local Search for α-Neighbor p-Center Problem
Authors: Qingyun Zhang, Zhipeng Lü, Junwen Ding, Zhouxing Su
Location: Guangzhou | Day: TBD
Show Abstract
The α-neighbor p-center problem (α-pCP) is an extension of the classical p-center problem. It aims to select p centers from a set of candidate centers to minimize the maximum distance between any client and its α service centers. In this paper, we propose a weighting-based fast local search algorithm called WFLS for solving α-pCP. First, WFLS converts the complex α-pCP into a series of decision subproblems by specifying the service radius, effectively mitigating the gradient vanishing issue during the search process, and introduces a new MIP model. Then, it addresses the simpliffed subproblems using a fast local search procedure with a swap-based neighborhood structure. WFLS adopts an efffcient weighting strategy, an incremental evaluation technique, a reffned-grained penaltybased neighborhood evaluation, and two scoring functions of neighborhood evaluation to accelerate and guide the search process. Computational experiments on 154 widely used public benchmark instances demonstrate that WFLS outperforms the state-of-the-art methods in the literature. Speciffcally, WFLS improves 69 previous best known results and matches the best know results for all the remaining ones in less time than other competitors.
5619: Negative Metric Learning for Graphs
Authors: Yiyang Zhao, Chengpei Wu, Lilin Zhang, Ning Yang
Location: Guangzhou | Day: TBD
Show Abstract
Graph contrastive learning (GCL) often suffers from false negatives, which degrades the performance on downstream tasks. The existing methods addressing the false negative issue usually rely on human prior knowledge, still leading GCL to suboptimal results. In this paper, we propose a novel Negative Metric Learning (NML) enhanced GCL (NML-GCL). NML-GCL employs a learnable Negative Metric Network (NMN) to build a negative metric space, in which false negatives can be distinguished better from true negatives based on their distance to anchor node. To overcome the lack of explicit supervision signals for NML, we propose a joint training scheme with bi-level optimization objective, which implicitly utilizes the self-supervision signals to iteratively optimize the encoder and the negative metric network. The solid theoretical analysis and the extensive experiments conducted on widely used benchmarks verify the superiority of the proposed method.
5641: FedAPA: Server-side Gradient-Based Adaptive Personalized Aggregation for Federated Learning on Heterogeneous Data
Authors: Yuxia Sun, Aoxiang Sun, Siyi Pan, Zhixiao Fu, Jingcai Guo
Location: Guangzhou | Day: TBD
Show Abstract
Personalized federated learning (PFL) tailors models to clients’ unique data distributions while preserving privacy. However, existing aggregation-weight-based PFL methods often struggle with heterogeneous data, facing challenges in accuracy, computational efficiency, and communication overhead. We propose FedAPA, a novel PFL method featuring a server-side, gradient-based adaptive aggregation strategy to generate personalized models, by updating aggregation weights based on gradients of client-parameter changes with respect to the aggregation weights in a centralized manner. FedAPA guarantees theoretical convergence and achieves superior accuracy and computational efficiency compared to 10 PFL competitors across three datasets, with competitive communication overhead. The code and full proofs are available at: https://github.com/Yuxia-Sun/FL_FedAPA.
5642: Self-supervised End-to-end ToF Imaging Based on RGB-D Cross-modal Dependency
Authors: Weihang Wang, Jun Wang, Fei Wen
Location: Guangzhou | Day: TBD
Show Abstract
Time-of-Flight (ToF) imaging systems are susceptible to various noise and degradation, which can severely affect image quality. Traditional sequential imaging pipelines often suffer from error accumulation due to separate multi-stage processing. Existing end-to-end methods typically rely on noisy-clean depth image pairs for supervised learning.
However, acquiring ground-truth is challenging in real-world scenarios due to factors such as Multi-Path Interference (MPI), phase wrapping, and complex noise patterns.
In this paper, we propose a self-supervised learning framework for end-to-end ToF imaging, which does not require any noisy-clean pairs yet generalizes well across various off-the-shelf cameras.
Our framework leverages the cross-modal dependencies between RGB and depth data as implicit supervision to effectively suppress noise and maintain image fidelity. Additionally, the loss function integrates the statistical characteristics of raw measurement data, enhancing robustness against noise and artifacts.
Extensive experiments on both synthetic and real-world data demonstrate that our approach achieves performance comparable to supervised methods, without requiring paired noisy-clean data for training.
Furthermore, our method consistently delivers strong performance across all evaluated cameras, highlighting its generalization capabilities. The code is available at https://github.com/WeihangWANG/RGBD_imaging.
5644: D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning
Authors: Jia Zhang, Chen-Xi Zhang, Yao Liu, Yi-Xuan Jin, Xiao-Wen Yang, Bo Zheng, Yi Liu, Lan-Zhe Guo
Location: Guangzhou | Day: TBD
Show Abstract
Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and efficiency of instruction tuning. In this paper, we first establish data selection criteria based on three distinct aspects of data value: diversity, difficulty, and dependability, and then propose the D3 method comprising two key steps of scoring and selection. Specifically, in the scoring step, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty to evaluate sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we integrate an external LLM for dependability assessment. In the selection step, we formulate the D3 weighted coreset objective, which jointly optimizes three aspects of data value to solve for the most valuable subset. The two steps of D3 can iterate multiple rounds, incorporating feedback to refine the selection focus adaptively. Experiments on both public datasets and the real-world Taobao Live application demonstrate the effectiveness of D3 in endowing LLMs with competitive or even superior instruction-following capabilities using less than 10% of the entire dataset.
5647: LiBOG: Lifelong Learning for Black-Box Optimizer Generation
Authors: Jiyuan Pei, Yi Mei, Jialin Liu, Mengjie Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Meta-Black-Box Optimization (MetaBBO) garners attention due to its success in automating the configuration and generation of black-box optimizers, significantly reducing the human effort required for optimizer design and discovering optimizers with higher performance than classic human-designed optimizers. However, existing MetaBBO methods conduct one-off training under the assumption that a stationary problem distribution with extensive and representative training problem samples is pre-available. This assumption is often impractical in real-world scenarios, where diverse problems following shifting distribution continually arise. Consequently, there is a pressing need for methods that can continuously learn from new problems encountered on-the-fly and progressively enhance their capabilities. In this work, we explore a novel paradigm of lifelong learning in MetaBBO and introduce LiBOG, a novel approach designed to learn from sequentially encountered problems and generate high-performance optimizers for Black-Box Optimization (BBO). LiBOG consolidates knowledge both across tasks and within tasks to mitigate catastrophic forgetting. Extensive experiments demonstrate LiBOG’s effectiveness in learning to generate high-performance optimizers in a lifelong learning manner, addressing catastrophic forgetting while maintaining plasticity to learn new tasks.
5650: SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree Search for Code Generation
Authors: Bin Xu, Yiguan Lin, Yinghao Li, Yang Gao
Location: Guangzhou | Day: TBD
Show Abstract
Large language models exhibit remarkable performance in simple code generation tasks. However, they encounter significant challenges when addressing complex problems that require reasoning and question decomposition. To tackle this, we propose a self-driven reasoning augmentation process, SRA-MCTS, which incorporates Monte Carlo Tree Search (MCTS) for reasoning data generation.
SRA-MCTS enables LLMs to self-generate intermediate reasoning steps and perform iterative self-evaluation, facilitating self-improvement. Specifically, it utilizes MCTS to produce diverse intermediate reasoning steps. During each iteration, MCTS generates a step and employs self-evaluation to guide the selection of subsequent branches, ultimately forming a sufficiently diverse reasoning path referred to as “thinking”. This thinking guides the model in generating corresponding code, and both are combined as training data for supervised fine-tuning.
Experimental results demonstrate that SRA-MCTS achieves consistent performance improvements across three model scales without additional supervisory assistance. Applied to the Meta-Llama-3.1-8B-Instruct model, it delivers an 11-point improvement on the MBPP-Complex dataset, underscoring the significant potential for model self-improvement. The code and data are available at https://github.com/DIRECT-BIT/SRA-MCTS.
5661: VidEvo: Evolving Video Editing through Exhaustive Temporal Modeling
Authors: Sizhe Dang, Huan Liu, Mengmeng Wang, Xin Lai, Guang Dai, Jingdong Wang
Location: Guangzhou | Day: TBD
Show Abstract
Text-guided video editing (TGVE) has become a recent hotspot due to its entertainment value and practical applications. To reduce overhead, existing methods primarily extend from text-to-image diffusion models and typically involve reconstruction and editing phases. However, challenges persist, particularly in enhancing temporal consistency of a video while adhering to textual alignment requirements. A crucial factor leading to the aforementioned issue is the inadequate and implicit tuning of the attention module within existing methods, which is specifically designed to capture temporal information. In light of this, we introduce VidEvo, a novel one-shot video editing method that leverages explicit cues derived from the original video to enhance temporal modeling. By integrating null-video embedding (NVE) and window-frame attention (WFA) components, VidEvo facilitates the smooth and coherent generation of videos from global and local perspectives simultaneously. To be specific, NVE learns a set of multi-scale temporal embeddings within the visual space during the reconstruction phase. These embeddings are subsequently directly injected into the attention module of the editing phase, explicitly augmenting the temporal consistency of the entire video. On the other hand, WFA enhances local temporal modeling by dynamically optimizing attention mechanisms between adjacent frames, which improves temporal coherence with reduced computational costs. Experimental evaluations show that VidEvo enhances frame-to-frame temporal consistency. Ablation studies confirm NVE and WFA’s effectiveness and their plug-and-play capability with other methods.
5670: New Algorithms for #2-SAT and #3-SAT
Authors: Junqiang Peng, Zimo Sheng, Mingyu Xiao
Location: Guangzhou | Day: TBD
Show Abstract
The #2-SAT and #3-SAT problems involve counting the number of satisfying assignments (also called models) for instances of 2-SAT and 3-SAT, respectively. In 2010, Zhou et al. (https://doi.org/10.1609/aaai.v24i1.7537) proposed an O*(1.1892^m)-time algorithm for #2-SAT and an efficient approach for #3-SAT, where m denotes the number of clauses. In this paper, we show that the weighted versions of #2-SAT and #3-SAT can be solved in O*(1.1082^m) and O*(1.4423^m) time, respectively. These results directly apply to the unweighted cases and achieve substantial improvements over the previous results. These advancements are enabled by the introduction of novel reduction rules, a refined analysis of branching operations, and the application of path decompositions on the primal and dual graphs of the formula.
5678: Tight Runtime Guarantees From Understanding the Population Dynamics of the GSEMO Multi-Objective Evolutionary Algorithm
Authors: Benjamin Doerr, Martin S. Krejca, Andre Opris
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: S: Evolutionary computation (2/2)
Show Abstract
The global simple evolutionary multi-objective optimizer (GSEMO) is a simple, yet often effective multi-objective evolutionary algorithm (MOEA). By only maintaining non-dominated solutions, it has a variable population size that automatically adjusts to the needs of the optimization process. The downside of the dynamic population size is that the population dynamics of this algorithm are harder to understand, resulting, e.g., in the fact that only sporadic tight runtime analyses exist. In this work, we significantly enhance our understanding of the dynamics of the GSEMO, in particular, for the classic CountingOnesCountingZeros (COCZ) benchmark. From this, we prove a lower bound of order Ω(n² log n), for the first time matching the seminal upper bounds known for over twenty years. We also show that the GSEMO finds any constant fraction of the Pareto front in time O(n²), improving over the previous estimate of O(n² log n) for the time to find the first Pareto optimum. Our methods extend to other classic benchmarks and yield, e.g., the first Ω(n^(k+1)) lower bound for the OJZJ benchmark in the case that the gap parameter is k ∈ {2,3}. We are therefore optimistic that our new methods will be useful in future mathematical analyses of MOEAs.
5681: Outstanding Orthodontist: No More Artifactual Teeth in Talking Face
Authors: Zibo Su, Ziqi Zhang, Kun Wei, Xu Yang, Cheng Deng
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Computer vision (2/3)
Show Abstract
Audio-driven talking face synthesis (TFS) enables the creation of realistic speaking videos by combining a single facial image with a speech audio clip. Unlike other facial features that naturally deform during speech, teeth represent unique rigid structures whose shape and size should remain constant throughout the video sequence. However, current methods often produce temporal inconsistencies and artifacts in the teeth region, resulting in a less realistic appearance of the generated videos. To address this, we propose OrthoNet, a plug-and-play framework designed to eliminate unrealistic teeth effects in audio-driven TFS. Our method introduces a Detail-oriented Teeth Aligner module, designed to preserve teeth details and adapt to their shape. It works with a Memory-guided Teeth Stabilizer that integrates a long-term memory bank for global teeth structure and a short-term memory module for local temporal dynamics. Through this framework, OrthoNet acts like an orthodontist for existing Audio2Video methods, ensuring that teeth maintain natural rigidity and temporal consistency even under varying degrees of teeth occlusion. Extensive experiments demonstrate that our method makes the teeth in generated videos appear more natural during speech, significantly enhancing the temporal consistency and structural stability of audio-driven video generation.
5696: HeTa: Relation-wise Heterogeneous Graph Foundation Attack Model
Authors: Yuling Wang, Zihui Chen, Pengfei Jiao, Xiao Wang
Location: Guangzhou | Day: TBD
Show Abstract
Heterogeneous Graph Neural Networks (HGNNs) are vulnerable, highlighting the need for tailored attacks to assess their robustness and ensure security. However, existing HGNN attacks often require complex retraining of parameters to generate specific perturbations for new scenarios. Recently, foundation models have opened new horizons for the generalization of graph neural networks by capturing shared semantics across various graph distributions. This leads us to ask: Can we design a foundation attack model for HGNNs that enables generalizable perturbations across different HGNNs, and quickly adapts to new heterogeneous graphs (HGs)? Empirical findings reveal that, despite significant differences in model design and parameter space, different HGNNs surprisingly share common vulnerability patterns from a relation-aware perspective. Therefore, we explore how to design foundation HGNN attack criteria by mining shared attack units. In this paper, we propose a novel relation-wise heterogeneous graph foundation attack model, HeTa. We introduce a foundation surrogate model to align heterogeneity and identify the importance of shared relation-aware attack units. Building on this, we implement a serialized relation-by-relation attack based on the identified relational weights. In this way, the perturbation can be transferred to various target HGNNs and easily fine-tuned for new HGs. Extensive experiments exhibit powerful attack performances and generalizability of our method.
5704: DiffECG: Diffusion Model-Powered Label-Efficient and Personalized Arrhythmia Diagnosis
Authors: Tianren Zhou, Zhenge Jia, Dongxiao Yu, Zhaoyan Shen
Location: Guangzhou | Day: TBD
Show Abstract
Arrhythmia diagnosis using electrocardiogram (ECG) is critical for preventing cardiovascular risks. However, existing deep learning-based methods struggle with label scarcity and contrastive learning-based methods suffer from false-negative samples, which lead to poor model generalization. Besides, due to inter-subject variability, pre-trained models cannot achieve evenly performance across individuals. Conducting model fine-tuning for each individual is computationally expensive and does not guarantee improvement. We propose DiffECG, a diffusion-based self-supervised learning framework for label-efficient and personalized arrhythmia detection. Our method utilizes a diffusion model to extract robust ECG representations, coupled with a novel feature extractor and a multi-modal feature fusion strategy to obtain a well-generalized model. Moreover, we propose an efficient model personalization mechanism based on zeroth-order optimization. It personalizes the model by tuning the noise-adding step t in the diffusion process, significantly reducing computational costs compared to model fine-tuning. Experimental results show that our proposed method outperforms the SOTA method by 37.9% and 23.9% in generalization and personalization performance, respectively. The source code is available at: https://github.com/Auguuust/DiffEC
5720: Gradient-based Causal Feature Selection
Authors: Zhaolong Ling, Mengxiang Guo, Xingyu Wu, Debo Cheng, Peng Zhou, Tianci Li, Zhangling Duan
Location: Guangzhou | Day: TBD
Show Abstract
Causal feature selection leverages causal discovery techniques to identify critical features associated with a target variable using observational data. Traditional methodologies primarily rely on constraint-based or score-based techniques, which are fraught with limitations. For example, conditional independence tests often yield unreliable results in the presence of noise and complex data generation processes, while the computational complexity of learning directed acyclic graphs increases exponentially with the number of variables involved. In light of recent advancements in deep learning, gradient-based methods have shown promise for global causal discovery. However, significant challenges arise when focusing on the identification of local causal features, particularly in defining the local causal constraint space to achieve both minimality and completeness. To address these issues, we introduce a novel gradient-based causal feature selection method (GCFS) that leverages an AutoEncoder to simultaneously model the target variable alongside other variables, thereby capturing of causal associations within a divide-and-conquer framework. Additionally, our approach incorporates a mask pruning strategy that transforms the search process into the minimization of a non-cyclic local reconstruction loss objective function. This function is then effectively optimized using a gradient-based method to accurately identify the causal features related to the target variable. Experimental results substantiate that GCFS surpasses existing methodologies across both synthetic and real datasets.
5745: SketchAgent: Generating Structured Diagrams from Hand-Drawn Sketches
Authors: Cheng Tan, Qi Chen, Jingxuan Wei, Gaowei Wu, Zhangyang Gao, Siyuan Li, Bihui Yu, Ruifeng Guo, Stan Z. Li
Location: Guangzhou | Day: TBD
Show Abstract
Hand-drawn sketches are a natural and efficient medium for capturing and conveying ideas. Despite significant advancements in controllable natural image generation, translating freehand sketches into structured, machine-readable diagrams remains a labor-intensive and predominantly manual task. The primary challenge stems from the inherent ambiguity of sketches, which lack the structural constraints and semantic precision required for automated diagram generation. To address this challenge, we introduce SketchAgent, a multi-agent system designed to automate the transformation of hand-drawn sketches into structured diagrams. SketchAgent integrates sketch recognition, symbolic reasoning, and iterative validation to produce semantically coherent and structurally accurate diagrams, significantly reducing the need for manual effort. To evaluate the effectiveness of our approach, we propose the Sketch2Diagram Benchmark, a comprehensive dataset and evaluation framework encompassing eight diverse diagram categories, such as flowcharts, directed graphs, and model architectures. The dataset comprises over 6,000 high-quality examples with token-level annotations, standardized preprocessing, and rigorous quality control. By streamlining the diagram generation process, SketchAgent holds great promise for applications in design, education, and engineering, while offering a significant step toward bridging the gap between intuitive sketching and machine-readable diagram generation.
5749: Categorical Attention: Fine-grained Language-guided Noise Filtering Network for Occluded Person Re-Identification
Authors: Minghui Chen, Dayan Wu, Chenxu Yang, Qinghang Su, Zheng Lin
Location: Guangzhou | Day: TBD
Show Abstract
Person Re-Identification (ReID) aims to match individuals across different camera views, but occlusions in real-world scenarios, such as vehicles or crowds, hinder feature extraction and matching. Current occluded ReID methodologies typically leverage visual augmentation techniques in an attempt to mitigate the disruptive effects of occlusion-induced noise. However, relying solely on visual data fail to effectively filter out occlusion noise. In this paper, we introduce the Fine-grained Language-guided Noise Filtering Network (FLaN-Net) for occluded ReID. FLaN-Net innovatively employs categorical attention mechanism to generate adaptive tokens that capture the following three distinct types of visual information: comprehensive descriptions of individuals, detailed visible attributes, and characteristics of occluding objects. Subsequently, a cross-attention mechanism aligns these prompts with the image, guiding the model to focus on relevant regions. To generate robust and discriminative features for occluded pedestrians, we further introduce a dynamic weighting fusion module that integrates visual, textual, and cross-attention features based on their reliability. Experimental results demonstrate that FLaN-Net outperforms existing methods on occluded ReID benchmarks, offering a robust solution for challenging real-world conditions.
5761: Conditional Denoising Meets Polynomial Modeling: A Flexible Decoupled Framework for Time Series Forecasting
Authors: Jintao Zhang, Mingyue Cheng, Xiaoyu Tao, Zhiding Liu, Daoyu Wang
Location: Guangzhou | Day: TBD
Show Abstract
Time series forecasting models are becoming increasingly prevalent due to their critical role in decision-making across various domains. However, most existing approaches represent the coupled temporal patterns, often neglecting the distinction between their specific components. In particular, fluctuating patterns and smooth trends within time series exhibit distinct characteristics. In this work, to model complicated temporal patterns, we propose a Conditional Denoising Polynomial Modeling (CDPM) framework, where probabilistic diffusion models and deterministic linear models are trained end-to-end. Instead of modeling the coupled time series, CDPM decomposes it into trend and seasonal components for modeling them separately. To capture the fluctuating seasonal component, we employ a probabilistic diffusion model based on statistical properties from the historical window. For the smooth trend component, a module is proposed to enhance linear models by incorporating historical dependencies, thereby preserving underlying trends and mitigating noise distortion. Extensive experiments conducted on six benchmarks demonstrate the effectiveness of our framework, highlighting the potential of combining probabilistic and deterministic models. Our code is available at https://github.com/zjt-gpu/CDPM.
5788: ChronoFact: Timeline-based Temporal Fact Verification
Authors: Anab Maulana Barik, Wynne Hsu, Mong Li Lee
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: Natural Language Processing (1/2)
Show Abstract
Temporal claims, often riddled with inaccuracies, are a significant challenge in the digital misinformation landscape. Fact-checking systems that can accurately verify such claims are crucial for combating misinformation. Current systems struggle with the complexities of evaluating the accuracy of these claims, especially when they include multiple, overlapping, or recurring events. We introduce a novel timeline-based fact verification framework that identify events from both claim and evidence and organize them into their respective chronological timelines. The framework systematically examines the relationships between the events in both claim and evidence to predict the veracity of each claim event and their chronological accuracy. This allows us to accurately determine the overall veracity of the claim. We also introduce a new dataset of complex temporal claims involving timeline-based reasoning for the training and evaluation of our proposed framework. Experimental results demonstrate the effectiveness of our approach in handling the intricacies of temporal claim verification.
5804: Beyond Fixed Length: Bucket Pre-training is All You Need
Authors: Qing Yang, Qiyao Peng, Hongtao Liu, Kai Liu, Bing Qin, Ting Liu
Location: Guangzhou | Day: TBD
Show Abstract
Large Language Models (LLMs) have demonstrated exceptional performance across various tasks, with pre-training stage serving as the cornerstone of their capabilities. However, the conventional fixed-length data composition strategy for pre-training presents several practical challenges. When using shorter sequences, documents are often truncated, potentially leading to information loss and affecting the model’s ability to capture long-range dependencies. Conversely, longer sequences require concatenation of multiple documents, which can introduce noise and affect the natural document boundaries and semantic coherence as well as require substantial computational overhead. To address these challenges, we first establish three quantitative metrics for evaluating data composition quality: padding ratio, truncation ratio, and concatenation ratio. Building upon these metrics, we propose a novel multi-bucket data composition method that transcends the fixed-length paradigm. Our approach adaptively organizes training data to achieve optimal composition quality as measured by the proposed metrics, offering a more flexible and efficient approach for pre-training. We conduct extensive experiments and the results demonstrate that our proposed method significantly enhances both the efficiency and effectiveness of LLM pre-training. Our proposed method has been adopted in the Du Xiaoman–XuanYuan series of financial large language models at https://github.com/Duxiaoman-DI/XuanYuan.
5810: Iterated Belief Change as Learning
Authors: Nicolas Schwind, Katsumi Inoue, Sébastien Konieczny, Pierre Marquis
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Knowledge Representation and Reasoning (1/4)
Show Abstract
In this work, we show how the class of improvement operators — a general class of iterated belief change operators — can be used to define a learning model. Focusing on binary classification, we present learning and inference algorithms suited to this learning model and we evaluate them empirically. Our findings highlight two key insights: first, that iterated belief change can be viewed as an effective form of online learning, and second, that the well-established axiomatic foundations of belief change operators offer a promising avenue for the axiomatic study of classification tasks.
5811: Learning Robust Multi-view Representation Using Dual-masked VAEs
Authors: Jiedong Wang, Kai Guo, Peng Hu, Xi Peng, Hao Wang
Location: Guangzhou | Day: TBD
Show Abstract
Most existing multi-view representation learning methods assume view-completeness and noise-free data. However, such assumptions are strong in real-world applications. Despite advances in methods tailored to view-missing or noise problems individually, a one-size-fits-all approach that concurrently addresses both remains unavailable. To this end, we propose a holistic method, called Dual-masked Variational Autoencoders (DualVAE), which aims at learning robust multi-view representation. The DualVAE exhibits an innovative amalgamation of dual-masked prediction, mixture-of-experts learning, representation disentangling, and a joint loss function in wrapping up all components. The key novelty lies in the dual-masked (view-mask and patch-mask) mechanism to mimic missing views and noisy data. Extensive experiments on four multi-view datasets show the effectiveness of the proposed method and its superior performance in comparison to baselines. The code is available at https://github.com/XLearning-SCU/2025-IJCAI-DualVAE.
5817: Lazy Testing of Machine-Learning Models
Authors: Anastasia Isychev, Valentin Wüstholz, Maria Christakis
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: MTA: Software engineering
Show Abstract
Checking the reliability of machine-learning models is a crucial, but challenging task. Nomos is an existing, automated framework for testing general, user-provided functional properties of models, including so-called hyperproperties expressed over more than one model execution. Nomos aims to find model inputs that expose “bugs”, that is, property violations. However, performing thousands of model invocations during testing is costly both in terms of time and money (for metered APIs, such as OpenAI’s).

We present LaZ (pronounced “lazy”), an extension of Nomos that automatically minimizes the number of model invocations to boost the test throughput and thereby find bugs more efficiently. During test execution, LaZ automatically identifies redundant invocations—invocations where the model output does not affect the final test outcome—and skips them, much like lazy evaluation in certain programming languages. This optimization enables a second one that dynamically reorders model invocations to skip the more expensive ones. As a result, LaZ finds the same number of bugs as Nomos, but does so median 33% and up to 60% faster.
5828: EchoGPT: An Interactive Cardiac Function Assessment Model for Echocardiogram Videos
Authors: Bo Xu, Quanhao Zhu, Qingchen Zhang, Mengmeng Wang, Liang Zhao, Hongfei Lin, Jing Ren, Feng Xia
Location: Guangzhou | Day: TBD
Show Abstract
With the development of wearable cardiac ultrasound devices, it is no longer sufficient to solely rely on doctors for diagnosing long-term echocardiogram videos. Automated diagnosis of echocardiogram videos has now become a research hotspot. Existing studies only analyze echocardiogram video through discriminative models, which have limited question-answering capabilities. Therefore, this study innovatively proposes a large language model with cardiac ultrasound diagnostic capabilities—EchoGPT. EchoGPT integrates the robust communication and comprehension capabilities of large language models (LLMs) with the diagnostic prowess of traditional medical models, empowering patients to obtain accurate medical indicator data and comprehend their health conditions through interactive questioning with the model. The model is capable of local deployment on personal computers, effectively safe guarding user privacy. EchoGPT operates through three main components: left ventricle segmentation, left ventricular ejection fraction LVEF prediction, and finetuning of video-text LLMs. Experimental results demonstrate EchoGPT’s superior accuracy in predicting LVEF compared to other models, and positive feedback from professional physicians through questionnaire surveys, validating its potential in practical applications. The demo is available at https://github.com/zhuqh19/EchoGPT.
5835: Explainable Graph Representation Learning via Graph Pattern Analysis
Authors: Xudong Wang, Ziheng Sun, Chris Ding, Jicong Fan
Location: Guangzhou | Day: TBD
Show Abstract
Explainable artificial intelligence (XAI) is an important area in the AI community, and interpretability is crucial for building robust and trustworthy AI models. While previous work has explored model-level and instance-level explainable graph learning, there has been limited investigation into explainable graph representation learning. In this paper, we focus on representation-level explainable graph learning and ask a fundamental question: What specific information about a graph is captured in graph representations? Our approach is inspired by graph kernels, which evaluate graph similarities by counting substructures within specific graph patterns. Although the pattern counting vector can serve as an explainable representation, it has limitations such as ignoring node features and being high-dimensional. To address these limitations, we introduce a framework (PXGL-GNN) for learning and explaining graph representations through graph pattern analysis. We start by sampling graph substructures of various patterns. Then, we learn the representations of these patterns and combine them using a weighted sum, where the weights indicate the importance of each graph pattern’s contribution. We also provide theoretical analyses of our methods, including robustness and generalization. In our experiments, we show how to learn and explain graph representations for real-world data using pattern analysis. Additionally, we compare our method against multiple baselines in both supervised and unsupervised learning tasks to demonstrate its effectiveness.
5859: Code-BT: A Code-Driven Approach to Behavior Tree Generation for Robot Tasks Planning with Large Language Models
Authors: Siyang Zhang, Bin Li, Jingtao Qi, Xueying Wang, Fu Li, Jianan Wang, En Zhu, Jinjing Sun
Location: Guangzhou | Day: TBD
Show Abstract
Behavior trees(BTs) provide a systematic and structured control architecture extensively employed in game AI and robotic behavior control, owing to their modularity, reactivity, and reusability. Nonetheless, manual BTs design requires significant expertise and becomes inefficient as task complexity increases. Recent automation technologies have avoided manual work, but often have high application barriers and face challenges in adapting to new tasks, making it difficult to easily configure them to specific requirements. Code-BT introduces a novel approach that utilizes large language models(LLMs) to automatically generate BTs, representing the task planning process as the process of coding and organizing sequences. By retrieving control flow information from the generated code, BTs can be efficiently constructed to address the complexity and diversity of task planning challenges. Rather than relying on manual design, Code-BT uses task instructions to guide the selection of relevant APIs, and then systematically assembles these APIs into modular code to align with the BTs structure. Finally, action sequences and control logic are extracted from the generated code to construct the BTs. Our approach not only ensures the automation of BTs generation but also guarantees the scalability and adaptability for long-term tasks. Experimental results demonstrate that Code-BT substantially improves LLM performance in BTs generation, achieving improvements ranging from16.67% to 29.17%.
5868: StarFT: Robust Fine-tuning of Zero-shot Models via Spuriosity Alignment
Authors: Younghyun Kim, Jongheon Jeong, Sangkyung Kwak, Kyungmin Lee, Juho Lee, Jinwoo Shin
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Machine Learning (3/4)
Show Abstract
Learning robust representations from data often requires scale, which has led to the success of recent zero-shot models such as CLIP. However, the obtained robustness can easily be deteriorated when these models are fine-tuned on other downstream tasks (e.g., of smaller scales). Previous works often interpret this phenomenon in the context of domain shift, developing fine-tuning methods that aim to preserve the original domain as much as possible. However, in a different context, fine-tuned models with limited data are also prone to learning features that are spurious to humans, such as background or texture. In this paper, we propose StarFT (Spurious Textual Alignment Regularization), a novel framework for fine-tuning zero-shot models to enhance robustness by preventing them from learning spuriosity. We introduce a regularization that aligns the output distribution for spuriosity-injected labels with the original zero-shot model, ensuring that the model is not induced to extract irrelevant features further from these descriptions. We leverage recent language models to get such spuriosity-injected labels by generating alternative textual descriptions that highlight potentially confounding features. Extensive experiments validate the robust generalization of StarFT and its emerging properties: zero-shot group robustness and improved zero-shot classification. Notably, StarFT boosts both worst-group and average accuracy by 14.30% and 3.02%, respectively, in the Waterbirds group shift scenario, where other robust fine-tuning baselines show even degraded performance.
5882: Token-Level Accept or Reject: A Micro Alignment Approach for Large Language Models
Authors: Yang Zhang, Yu Yu, Bo Tang, Yu Zhu, Chuxiong Sun, Wenqiang Wei, Jie Hu, Zipeng Xie, Zhiyu Li, Feiyu Xiong, Edward Chung
Location: Guangzhou | Day: TBD
Show Abstract
With the rapid development of Large Language Models (LLMs), aligning these models with human preferences and values is critical to ensuring ethical and safe applications. However, existing alignment techniques such as RLHF or DPO often require direct fine-tuning on LLMs with billions of parameters, resulting in substantial computational costs and inefficiencies. To address this, we propose Micro token-level Accept-Reject Aligning (MARA) approach designed to operate independently of the language models. MARA simplifies the alignment process by decomposing sentence-level preference learning into token-level binary classification, where a compact three-layer fully-connected network determines whether candidate tokens are “Accepted” or “Rejected” as part of the response. Extensive experiments across seven different LLMs and three open-source datasets show that MARA achieves significant improvements in alignment performance while reducing computational costs. The source code and implementation details are publicly available at https://github.com/IAAR-Shanghai/MARA, and the trained models are released at https://huggingface.co/IAAR-Shanghai/MARA_AGENTS.
5893: Non-Obvious Manipulability in Additively Separable and Fractional Hedonic Games
Authors: Diodato Ferraioli, Giovanna Varricchio
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Game Theory
Show Abstract
In this work, we consider the design of Non-Obviously Manipulable (NOM) mechanisms, mechanisms that bounded rational agents may fail to recognize as manipulable, for two relevant classes of succinctly representable Hedonic Games: Additively Separable and Fractional Hedonic Games. In these classes, agents have cardinal scores towards other agents, and their preferences over coalitions are determined by aggregating such scores.
This aggregation results in a utility function for each agent, which enables the evaluation of outcomes via the utilitarian social welfare.
We first prove that, when scores can be arbitrary, every optimal mechanism is NOM; moreover, when scores are limited in a continuous interval, an optimal mechanism that is NOM exists.
Given the hardness of computing optimal outcomes in these settings, we turn our attention to efficient and NOM mechanisms. To this aim, we first prove a characterization of NOM mechanisms that simplifies the class of mechanisms of interest. Then, we design a NOM mechanism returning approximations that asymptotically match the best-known approximation achievable in polynomial time.
Finally, we focus on discrete scores, where the compatibility of NOM with optimality depends on the specific values.
Therefore, we initiate a systematic analysis to identify which discrete values support this compatibility and which do not.
5898: Leveraging Personalized PageRank and Higher-Order Topological Structures for Heterophily Mitigation in Graph Neural Networks
Authors: Yumeng Wang, Zengyi Wo, Wenjun Wang, Xingcheng Fu, Minglai Shao
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) excel in node classification tasks but often assume homophily, where connected nodes share similar labels. This assumption does not hold in many real-world heterophilic graphs. Existing models for heterophilic graphs primarily rely on pairwise relationships, overlooking multi-scale information from higher-order structures. This leads to suboptimal performance, particularly under noise from conflicting class information across nodes. To address these challenges, we propose HPGNN, a novel model integrating Higher-order Personalized PageRank with Graph Neural Networks. HPGNN introduces an efficient high-order approximation of Personalized PageRank (PPR) to capture long-range and multiscale node interactions. This approach reduces computational complexity and mitigates noise from surrounding information. By embedding higher-order structural information into convolutional networks, HPGNN effectively models key interactions across diverse graph dimensions. Extensive experiments on benchmark datasets demonstrate HPGNN’s effectiveness. The model achieves better performance than five out of seven state-of-the-art methods on heterophilic graphs in downstream tasks while maintaining competitive performance on homophilic graphs. HPGNN’s ability to balance multi-scale information and robustness to noise makes it a versatile solution for real-world graph learning challenges. Codes are available at https://github.com/streetcorner/HPGNN.
5901: MC3D-AD: A Unified Geometry-aware Reconstruction Model for Multi-category 3D Anomaly Detection
Authors: Jiayi Cheng, Can Gao, Jie Zhou, Jiajun Wen, Tao Dai, Jinbao Wang
Location: Guangzhou | Day: TBD
Show Abstract
3D Anomaly Detection (AD) is a promising means of controlling the quality of manufactured products. However, existing methods typically require carefully training a task-specific model for each category independently, leading to high cost, low efficiency, and weak generalization. This study presents a novel unified model for Multi-Category 3D Anomaly Detection (MC3D-AD) that aims to utilize both local and global geometry-aware information to reconstruct normal representations of all categories. First, to learn robust and generalized features of different categories, we propose an adaptive geometry-aware masked attention module that extracts geometry variation information to guide mask attention. Then, we introduce a local geometry-aware encoder reinforced by the improved mask attention to encode group-level feature tokens. Finally, we design a global query decoder that utilizes point cloud position embeddings to improve the decoding process and reconstruction ability. This leads to local and global geometry-aware reconstructed feature tokens for the 3D AD task. MC3D-AD is evaluated on two publicly available Real3D-AD and Anomaly-ShapeNet datasets, and exhibits significant superiority over current state-of-the-art single-category methods, achieving 3.1% and 9.3% improvement in object-level AUROC over Real3D-AD and Anomaly-ShapeNet, respectively. The code is available at https://github.com/iCAN-SZU/MC3D-AD.
5903: Heterogeneous Temporal Hypergraph Neural Network
Authors: Huan Liu, Pengfei Jiao, Mengzhou Gao, Chaochao Chen, Di Jin
Location: Guangzhou | Day: TBD
Show Abstract
Graph representation learning (GRL) has emerged as an effective technique for modeling graph-structured data. When modeling heterogeneity and dynamics in real-world complex networks, GRL methods designed for complex heterogeneous temporal graphs (HTGs) have been proposed and have achieved successful applications in various fields. However, most existing GRL methods mainly focus on preserving the low-order topology information while ignoring higher-order group interaction relationships, which are more consistent with real-world networks. In addition, most existing hypergraph methods can only model static homogeneous graphs, limiting their ability to model high-order interactions in HTGs. Therefore, to simultaneously enable the GRL model to capture high-order interaction relationships in HTGs, we first propose a formal definition of heterogeneous temporal hypergraphs and P-uniform heterogeneous hyperedge construction algorithm that does not rely on additional information. Then, a novel Heterogeneous Temporal HyperGraph Neural network (HTHGN), is proposed to fully capture higher-order interactions in HTGs. HTHGN contains a hierarchical attention mechanism module that simultaneously performs temporal message-passing between heterogeneous nodes and hyperedges to capture rich semantics in a wider receptive field brought by hyperedges. Furthermore, HTHGN performs contrastive learning by maximizing the consistency between low-order correlated heterogeneous node pairs on HTG to avoid the low-order structural ambiguity issue. Detailed experimental results on three real-world HTG datasets verify the effectiveness of the proposed HTHGN for modeling high-order interactions in HTGs and demonstrate significant performance improvements.
5905: Performance Guaranteed Poisoning Attacks in Federated Learning: A Sliding Mode Approach
Authors: Huazi Pan, Yanjun Zhang, Leo Yu Zhang, Scott Adams, Abbas Kouzani, Suiyang Khoo
Location: Montreal | Day: August 19th | Time: 15:00 | Session: ML: Federated Learning
Show Abstract
Manipulation of local training data and local updates, i.e., the poisoning attack, is the main threat arising from the collaborative nature of the federated learning (FL) paradigm. Most existing poisoning attacks aim to manipulate local data/models in a way that causes denial-of-service (DoS) issues. In this paper, we introduce a novel attack method, named Federated Learning Sliding Attack (FedSA) scheme, aiming at precisely introducing the extent of poisoning in a subtle controlled manner. It operates with a predefined objective, such as reducing global model’s prediction accuracy by 10%.
FedSA integrates robust nonlinear control-Sliding Mode Control (SMC) theory with model poisoning attacks. It can manipulate the updates from malicious clients to drive the global model towards a compromised state, achieving this at a controlled and inconspicuous rate. Additionally, leveraging the robust control properties of FedSA allows precise control over the convergence bounds, enabling the attacker to set the global accuracy of the poisoned model to any desired level. Experimental results demonstrate that FedSA can accurately achieve a predefined global accuracy with fewer malicious clients while maintaining a high level of stealth and adjustable learning rates.
5912: Improving Efficiency of Answer Set Planning with Rough Solutions from Large Language Models for Robotic Task Planning
Authors: Xinrui Lin, Yangfan Wu, Huanyu Yang, Yuting Huang, Yu Zhang, Jianmin Ji, Yanyong Zhang
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: KRR: Logic programming
Show Abstract
Answer Set Programming (ASP) planning can be used to refine the rough solutions generated by Large Language Models (LLMs) to handle specific restrictions of actions, i.e., reconstruct the rough solutions to be executable, for robotic task planning. However, it is still challenging to efficiently solve ASP programs that have multiple variables with large domains, which prevents the above application of ASP planning from real-world task planning problems. In this paper, we consider how to reduce the domains of variables without losing possible solutions for ASP planning, while given these rough solutions from LLMs. Based on the above reduction, we introduce CLMASP, an approach that couples LLMs with ASP for robotic task planning. We evaluate CLMASP on the VirtualHome platform for common indoor tasks, demonstrating a significant improvement in the executable rate from under 10% to nearly 90% and reducing average ASP planning time from over 2 hours to under 5 seconds. Code is available at https://github.com/CLMASP/CLMASP.
5914: No Regret Reinforcement Learning Algorithms for Online Scheduling with Multi-Stage Tasks
Authors: Yongxin Xu, Hengquan Guo, Ziyu Shao, Xin Liu
Location: Guangzhou | Day: TBD
Show Abstract
We study online task scheduling problems where tasks arrive sequentially and are processed by the platform or server. The service processes for tasks are multi-stage and are modeled as episodic Markov Decision Processes (MDPs). While processing a task, the system acquires rewards by consuming resources. The goal of the platform is to maximize the reward-to-cost ratio over a sequence of K tasks.
Online scheduling with multi-stage tasks faces two major challenges: intra-dependence among the different stages within a task and inter-dependence among different tasks. These challenges are further exacerbated by the unknown rewards, costs, and task arrival distribution. To address these challenges, we propose the Robbins-Monro-based Value Iteration for Ratio Maximization (RM^2VI) algorithm. Specifically,RM^2VI addresses “intra-dependence” through optimistic value iteration and handles “inter-dependence” using the Robbins-Monro method. The algorithm has a greedy structure and achieves a sub-linear regret of O(K^(3/4)), establishing the no-regret property (per-task).
We test RM^2VI in two synthetic experiments of sale promotion in E-commerce and machine learning job training in cloud computing. The results show RM^2VI achieves the best reward-to-cost ratio compared with the baselines.
5920: Logarithmic Approximations for Fair k-Set Selection
Authors: Shi Li, Chenyang Xu, Ruilong Zhang
Location: Guangzhou | Day: TBD
Show Abstract
We study the fair k-set selection problem where we aim to select k sets from a given set system such that the (weighted) occurrence times that each element appears in these k selected sets are balanced, i.e., the maximum (weighted) occurrence times are minimized. By observing that a set system can be formulated into a bipartite graph G:=(L cup R, E), our problem is equivalent to selecting k vertices from R such that the maximum (weighted) number selected neighbors of vertices in L is minimized. The problem arises in a wide range of applications in various fields, such as machine learning, artificial intelligence, and operations research.

We first prove that the problem is NP-hard even if the maximum degree Delta of the input bipartite graph is 3, and the problem is in P when Delta=2. We then show that the problem is also in P when the input set system forms a laminar family. Based on intuitive linear programming, we show that two rounding algorithms achieve O(log n/(log log n))-approximation on general bipartite graphs, and an independent rounding algorithm achieves O(log(Delta))-approximation on bipartite graphs with a maximum degree Delta. We demonstrate that our analysis is almost tight by providing a hard instance for this linear programming.
5923: AttentionDrag: Exploiting Latent Correlation Knowledge in Pre-trained Diffusion Models for Image Editing
Authors: Biao Yang, Muqi Huang, Yuhui Zhang, Yun Xiong, Kun Zhou, Xi Chen, Shiyang Zhou, Huishuai Bao, Chuan Li, Feng Shi, Hualei Liu
Location: Guangzhou | Day: TBD
Show Abstract
Traditional point-based image editing methods rely on iterative latent optimization or geometric transformations, which are either inefficient in their processing or fail to capture the semantic relationships within the image. These methods often overlook the powerful yet underutilized image editing capabilities inherent in pre-trained diffusion models. In this work, we propose a novel one-step point-based image editing method, named \textbf{AttentionDrag}, which leverages the inherent latent knowledge and feature correlations within pre-trained diffusion models for image editing tasks. This framework enables semantic consistency and high-quality manipulation without the need for extensive re-optimization or retraining. Specifically, we reutilize the latent correlations knowledge learned by the self-attention mechanism in the U-Net module during the DDIM inversion process to automatically identify and adjust relevant image regions, ensuring semantic validity and consistency. Additionally, AttentionDrag adaptively generates masks to guide the editing process, enabling precise and context-aware modifications with friendly interaction. Our results demonstrate a performance that surpasses most state-of-the-art methods with significantly faster speeds, showing a more efficient and semantically coherent solution for point-based image editing tasks. Code is released at: https://github.com/GPlaying/AttentionDrag.
5931: Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation
Authors: Feizhen Huang, Yu Wu, Yutian Lin, Bo Du
Location: Guangzhou | Day: TBD
Show Abstract
Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.
5933: Active Multimodal Distillation for Few-shot Action Recognition
Authors: Weijia Feng, Yichen Zhu, Ruojia Zhang, Chenyang Wang, Fei Ma, Xiaobao Wang, Xiaobai Li
Location: Guangzhou | Day: TBD
Show Abstract
Owing to its rapid progress and broad application prospects, few-shot action recognition has attracted considerable interest. However, current methods are predominantly based on limited single-modal data, which does not fully exploit the potential of multimodal information. This paper presents a novel framework that actively identifies reliable modalities for each sample using task-specific contextual cues, thus significantly improving recognition performance. Our framework integrates an Active Sample Inference (ASI) module, which utilizes active inference to predict reliable modalities based on posterior distributions and subsequently organizes them accordingly. Unlike reinforcement learning, active inference replaces rewards with evidence-based preferences, making more stable predictions.
Additionally, we introduce an active mutual distillation module that enhances the representation learning of less reliable modalities by transferring knowledge from more reliable ones. Adaptive multimodal inference is employed during the meta-test to assign higher weights to reliable modalities. Extensive experiments across multiple benchmarks demonstrate that our method significantly outperforms existing approaches.
5956: MaskDGNN: Self-Supervised Dynamic Graph Neural Networks with Activeness-aware Temporal Masking
Authors: Yiming He, Xiang Li, Zhongying Zhao, Haobing Liu, Peilan He, Yanwei Yu
Location: Guangzhou | Day: TBD
Show Abstract
Integrating dynamics into graph neural networks (GNNs) provides deeper insights into the evolution of dynamic graphs, thereby enhancing the temporal representation in real-world dynamic network problems. Existing methods extracting critical information from dynamic graphs face two key challenges, either overlooking the negative impact of redundant information or struggling in addressing the distribution shifting issue in dynamic graphs. To address these challenges, we propose MaskDGNN, a novel dynamic GNN architecture that consists of two modules: First, self-supervised activeness-aware temporal masking mechanism selectively retains edges between highly active nodes while masking those with low activeness, effectively reducing redundancy. Second, adaptive frequency enhancing graph representation learner amplifies the frequency-domain features of nodes to capture intrinsic features under distribution shifting. Experiments on five real-world dynamic graph datasets demonstrate that MaskDGNN outperforms state-of-the-art methods, achieving an average improvement of 7.07% in accuracy and 13.87% in MRR for link prediction tasks.
5957: Enhanced Unsupervised Discriminant Dimensionality Reduction for Nonlinear Data
Authors: Qianqian Wang, Mengping Jiang, Wei Feng, Zhengming Ding
Location: Guangzhou | Day: TBD
Show Abstract
Linear Discriminant Analysis (LDA) is a classical supervised dimensionality reduction algorithm. However, LDA focuses more on global structure and overly depends on reliable data labels. For data with outliers and nonlinear structures, LDA cannot effectively capture the true structure of the data. Moreover, the subspace dimension learned by LDA must be smaller than cluster number, which limits its practical applications. To address these issues, we propose a novel unsupervised LDA method that combines centerless K-means and LDA. This method eliminates the need to calculate cluster centroids and improves model robustness. By fusing centerless K-means and LDA into a unified framework and deducing the connection between K-means and manifold learning, this method captures the local manifold structure and discriminative structure. Additionally, the dimensionality of the subspace is not restricted. This method not only overcomes the limitations of traditional LDA but also improves the model’s adaptability to complex data. Extensive experiments on seven datasets demonstrate the effectiveness of the proposed method.
5960: Solving Copyright Infringement on Short Video Platforms: Novel Datasets and an Audio Restoration Deep Learning Pipeline
Authors: Minwoo Oh, Minsu Park, Eunil Park
Location: Montreal | Day: August 21st | Time: 15:00 | Session: CV: videos
Show Abstract
Short video platforms like YouTube Shorts and TikTok face significant copyright compliance challenges, as infringers frequently embed arbitrary background music (BGM) to obscure original soundtracks (OST) and evade content originality detection. To tackle this issue, we propose a novel pipeline that integrates Music Source Separation (MSS) and cross-modal video-music retrieval (CMVMR). Our approach effectively separates arbitrary BGM from the original OST, enabling the restoration of authentic video audio tracks. To support this work, we introduce two domain-specific datasets: OASD-20K for audio separation and OSVAR-160 for pipeline evaluation. OASD-20K contains 20,000 audio clips featuring mixed BGM and OST pairs, while OSVAR160 is a unique benchmark dataset comprising 1,121 video and mixed-audio pairs, specifically designed for short video restoration tasks. Experimental results demonstrate that our pipeline not only removes arbitrary BGM with high accuracy but also restores OSTs, ensuring content integrity. This approach provides an ethical and scalable solution to copyright challenges in user-generated content on short video platforms.
5979: Learning to Explain: Towards Human-Aligned Explainability in Deep Reinforcement Learning via Attention Guidance
Authors: Bokai Ji, Guangxia Li, Yulong Shen, Gang Xiao
Location: Guangzhou | Day: TBD
Show Abstract
Recent advances in explainable deep reinforcement learning (DRL) have provided insights into the reasoning behind decisions made by DRL agents. However, existing methods often overlook the subjective nature of explanations and fail to consider human cognitive styles and preferences. Such ignorance tends to reduce the interpretability and relevance of the generated explanations from a human evaluator’s perspective. To address this issue, we introduce human cognition into the explaining procedure by integrating DRL with attention guidance in a novel manner. The proposed concept proximal policy optimization (Concept-PPO) learns to generate human-aligned explanations by jointly optimizing the DRL performance and the discrepancy between generated explanations and human annotations. Its key component is a specially designed spatial concept transformer that can enhance explaining efficiency by premasking decision-irrelevant information. Experiments on the ATARI benchmark demonstrate that Concept-PPO achieves better policies than its black-box counterparts, and user studies confirm its superiority in generating human-aligned explanations compared to existing explainable DRL methods.
5992: Fair Submodular Maximization over a Knapsack Constraint
Authors: Lijun Li, Chenyang Xu, Liuyi Yang, Ruilong Zhang
Location: Guangzhou | Day: TBD
Show Abstract
We consider fairness in submodular maximization subject to a knapsack constraint, a fundamental problem with various applications in economics, machine learning, and data mining. In the model, we are given a set of ground elements, each associated with a cost and a color, and a monotone submodular function defined over them. The goal is to maximize the submodular function while guaranteeing that the total cost does not exceed a specified budget (the knapsack constraint) and that the number of elements selected for each color falls within a designated range (the fairness constraint).

While there exists some recent literature on this topic, the existence of a non-trivial approximation for the problem — without relaxing either the knapsack or fairness constraints — remains a challenging open question. This paper makes progress in this direction. We demonstrate that when the number of colors is constant, there exists a polynomial-time algorithm that achieves a constant approximation with high probability. Additionally, we show that if either the knapsack or fairness constraint is relaxed only to require expected satisfaction, a tight approximation ratio of (1-1/e-epsilon) can be obtained in expectation for any epsilon >0.
5994: MSMAR-RL: Multi-Step Masked-Attention Recovery Reinforcement Learning for Safe Maneuver Decision in High-Speed Pursuit-Evasion Game
Authors: Yang Zhao, Wenzhe Zhao, Xuelong Li
Location: Guangzhou | Day: TBD
Show Abstract
Ensuring the safety of high-speed agent in dynamic adversarial environments, such as pursuit-evasion games with target-purchase and obstacle-avoidance, is a significant challenge. Existing reinforcement learning methods often fail to balance safety and reward under strict safety constraints and diverse environmental conditions. To address these limitations, this paper proposes a novel zero-constraint-violation recovery RL framework tailored for high-speed uav pursuit-evasion combat games. The framework includes three key innovations. (1) An extendable multi-step reach-avoid theory: we provide a zero-constraint-violation safety guarantee for multi-strategy reinforcement learning and enabling early danger detection in high speed game. (2) A masked-attention recovery strategy: we introduce a padding-mask attention architecture to handle spatiotemporal variations in dynamic obstacles with varying threat levels. (3) Experimental validation: we validate the framework in obstacle-rich pursuit-evasion scenarios, demonstrating its superiority through comparison with other algorithm and ablation studies. Our approach also shows potential for extension to other rapid-motion tasks and more complex hazardous scenarios. Details and code could be found at https://msmar-rl.github.io.
6013: Contrastive Cross-Course Knowledge Tracing via Concept Graph Guided Knowledge Transfer
Authors: Wenkang Han, Wang Lin, Liya Hu, Zhenlong Dai, Yiyun Zhou, Mengze Li, Zemin Liu, Chang Yao, Jingyuan Chen
Location: Guangzhou | Day: TBD
Show Abstract
Knowledge tracing (KT) aims to predict learners’ future performance based on historical learning interactions. However, existing KT models predominantly focus on data from a single course, limiting their ability to capture a comprehensive understanding of learners’ knowledge states. In this paper, we propose TransKT, a contrastive cross-course knowledge tracing method that leverages concept graph guided knowledge transfer to model the relationships between learning behaviors across different courses, thereby enhancing knowledge state estimation. Specifically, TransKT constructs a cross-course concept graph by leveraging zero-shot Large Language Model (LLM) prompts to establish implicit links between related concepts across different courses. This graph serves as the foundation for knowledge transfer, enabling the model to integrate and enhance the semantic features of learners’ interactions across courses. Furthermore, TransKT includes an LLM-to-LM pipeline for incorporating summarized semantic features, which significantly improves the performance of Graph Convolutional Networks (GCNs) used for knowledge transfer. Additionally, TransKT employs a contrastive objective that aligns single-course and cross-course knowledge states, thereby refining the model’s ability to provide a more robust and accurate representation of learners’ overall knowledge states. Our code and datasets are available at https://github.com/DQYZHWK/TransKT/.
6021: Dual Robust Unbiased Multi-View Clustering for Incomplete and Unpaired Information
Authors: Liang Zhao, Ziyue Wang, Chuanye He, Qingchen Zhang, Bo Xu
Location: Guangzhou | Day: TBD
Show Abstract
Recently, multi-view data has gradually attracted attention. However, real-world applications often face Partial View-aligned Problem (PVP) and Partially Sample-missing Problem (PSP) due to data loss or corruption. Existing methods addressing PVP typically focus only on learning from the information of aligned data, while ignoring unaligned data where samples exist but lack alignment relationships. This introduces PSP, which does not inherently exist in the data, leading to biased learning of the data’s information. For PSP, due to varying degrees of missing data, incomplete spatial structures can cause clustering centers-shifted problem, resulting in the model learning incorrect correspondences and biased spatial structures.To tackle them, we propose a novel method called Dual Robust Unbiased Multi-View Clustering for Incomplete and Unpaired Information (DRUMVC). To our knowledge, this is the first noise-robust and unbiased multi-view clustering method capable of simultaneously addressing both PVP and PSP. Specifically, DRUMVC leverages aligned and complete samples as a bridge to construct high-quality correspondences for samples lacking cross-view relationship information due to PVP or PSP. Additionally, we employ a dual noise-robust contrastive learning loss to mitigate the impact of noise potentially introduced during the pair construction. Experiments on several challenging datasets demonstrate the superiority of our proposed method.
6026: SAP: Privacy-Preserving Fine-Tuning on Language Models with Split-and-Privatize Framework
Authors: Xicong Shen, Yang Liu, Yi Liu, Peiran Wang, Huiqi Liu, Jue Hong, Bing Duan, Zirui Huang, Yunlong Mao, Ye Wu, Sheng Zhong
Location: Guangzhou | Day: TBD
Show Abstract
Pre-trained Language Models (PLM) have enabled a cost-effective approach to handling various downstream applications via Parameter-Efficient-Fine-Tuning (PEFT) techniques. In this context, service providers have introduced a popular fine-tuning-based product service known as Model-as-a-Service (MaaS). This service offers users access to extensive PLMs and training resources. With MaaS, users can fine-tune, deploy, and utilize their customized models seamlessly, leveraging a one-stop platform that allows them to work with their private datasets efficiently. However, this service paradigm has recently been exposed to the possibility of leaking user private data. To this end, we identify the data privacy leakage risks in MaaS-based PEFT and propose a Split-and-Privatize (SAP) framework, mitigating the privacy leakage by integrating split learning and differential privacy into MaaS PEFT. Furthermore, we propose Contributing-Token-Identification (CTI), a novel method to balance model utility degradation and privacy leakage. As a result, the proposed framework is comprehensively evaluated, demonstrating a 65% improvement in empirical privacy with only a 1% degradation in model performance on the Stanford Sentiment Treebank dataset, outperforming existing state-of-the-art baselines.
6044: Learning Neural Jump Stochastic Differential Equations with Latent Graph for Multivariate Temporal Point Processes
Authors: Yuchen Wang, Dongpeng Hou, Chao Gao, Xianghua Li
Location: Guangzhou | Day: TBD
Show Abstract
Multivariate Temporal Point Processes (MTPPs) play an important role in diverse domains such as social networks and finance for predicting event sequence data. In recent years, MTPPs based on Ordinary Differential Equations (ODEs) and Stochastic Differential Equations (SDEs) have demonstrated their strong modeling capabilities. However, these models have yet to thoroughly consider the underlying relationships among different event types to enhance their modeling capacity. Therefore, this paper introduces a method that uses neural SDEs with a jump process guided by the latent graph. Firstly, our proposed method employs multi-dimensional SDEs to capture the dynamics of the intensity function for each event type. Subsequently, a latent graph structure is integrated into the jump process without any encoder, aiming to enhance the modeling and predictive capabilities for MTPPs. Theoretical analysis guarantees the existence and uniqueness of the solution for our proposed method. The experiments conducted on multiple real-world datasets show that our approaches demonstrate significant competitiveness when compared to state-of-the-art neural point processes. Meanwhile, the trainable parameters of the latent graph also improve the model interpretability without any prior knowledge. Our code is available at https://github.com/cgao-comp/LNJSDE.
6053: Towards Automatic Sampling of User Behaviors for Sequential Recommender Systems
Authors: Hao Zhang, Mingyue Cheng, Zhiding Liu, Junzhe Jiang
Location: Guangzhou | Day: TBD
Show Abstract
Sequential recommender systems (SRS) have gained increasing popularity due to their remarkable proficiency in capturing dynamic user preferences. In the current setup of SRS, a common configuration is to uniformly consider each historical behavior as a positive interaction. However, this setting has the potential to yield sub-optimal performance as each individual item often have a different impact on shaping the user’s interests. Hence, in this paper, we propose a novel automatic sampling framework for sequential recommendation, named AutoSAM, to non-uniformly treat historical behaviors. Specifically, AutoSAM extends the conventional SRS framework by integrating an extra sampler to intelligently discern the skew distribution of the raw input, and then sample informative sub-sets to build more generalizable SRS. To tackle the challenges posed by non-differentiable sampling actions and to introduce multiple decision factors for sampling, we further design a novel reinforcement learning based method to guide the training of the sampler. Furthermore, we theoretically devise multi-objective sampling rewards including Future Prediction and Sequence Perplexity, and then optimize the whole framework in an end-to-end manner by combining the policy gradient. We conduct extensive experiments on benchmark recommendation models and four real-world datasets. The experimental results demonstrate the effectiveness of the proposed AutoSAM.
6089: Requirement Patterns for Engineering Multiagent Interaction Protocols
Authors: Amit K. Chopra, Samuel H. Christie V., Munindar P. Singh
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Agent-based and Multi-agent Systems (2/3)
Show Abstract
An interaction protocol specifies how the member agents of a decentralized multiagent system may communicate to satisfy their respective stakeholders’ requirements. We focus on information protocols, which are fully declarative specifications of interaction and support asynchronous communication. We offer Mambo, an approach for protocol design. Mambo identifies common patterns of requirements, provides a notation to express them, and a verification procedure. Mambo incorporates heuristics to generate small internal representations for efficiency. Experimental results demonstrate Mambo’s effectiveness on practical protocols.
6109: Enhancing Mixture of Experts with Independent and Collaborative Learning for Long-Tail Visual Recognition
Authors: Yanhao Chen, Zhongquan Jian, Nianxin Ke, Shuhao Hu, Junjie Jiao, Qingqi Hong, Qingqiang Wu
Location: Guangzhou | Day: TBD
Show Abstract
Deep neural networks (DNNs) face substantial challenges in Long-Tail Visual Recognition (LTVR) due to the inherent class imbalances in real-world data distributions.
The Mixture of Experts (MoE) framework has emerged as a promising approach to addressing these issues.
However, in MoE systems, experts are typically trained to optimize a collective objective, often neglecting the individual optimality of each expert. This individual optimality usually contributes to the overall performance, as the goals of different experts are not mutually exclusive.
We propose the Independent and Collaborative Learning (ICL) framework to optimize each expert independently while ensuring global optimality.
First, Diverse Optimization Learning (DOL) is introduced to enhance expert diversity and individual performance.
Then, we conceptualize experts as parallel circuit branches and introduce Competition and Collaboration Learning (CoL). Competition Learning amplifies the gradients of better-performing experts to preserve individual optimality, and Collaboration Learning encourages collaboration through mutual distillation to enhance optimal knowledge sharing.
ICL achieves state-of-the-art accuracy in experiments on CIFAR-100/10-LT, ImageNet-LT, and iNaturalist 2018, respectively. Our code is available at https://github.com/PolarisLight/ICL.
6116: Language-Guided Hybrid Representation Learning for Visual Grounding on Remote Sensing Images
Authors: Biao Liu, Xu Liu, Lingling Li, Licheng Jiao, Fang Liu, Xinyu Sun, Youlin Huang
Location: Guangzhou | Day: TBD
Show Abstract
Visual grounding (VG) refers to detecting the specific objects in images based on linguistic expressions, and it has profound significance in the advanced interpretation of natural images. In remote sensing image interpretation, visual grounding is limited by characteristics such as the complex scenes and diverse object sizes. To solve this problem, we propose a novel remote sensing visual grounding (RSVG) framework, named language-guided hybrid representation learning Transformer (LGFormer). Specifically, we designed a multimodal dual-encoder Transformer structure called the adaptive multimodal feature fusion module. This structure innovatively integrates text and visual features as hybrid queries, enabling early-stage decoding queries to perceive the target position accurately. Then, the different modal information from the dual encoders is aggregated by hybrid queries to obtain the final object embedding for coordinate regression. Besides, a multi-scale cross-modal feature enhancement module (MSCM) is designed to enhance the self-representation of the extracted text and visual features and align them semantically. As for the hybrid queries, we use linguistic guidance to select visual features as the visual part and sentence-level features as the textual part. Finally, the LGFormer model we designed achieved the best results compared to existing models on the DIOR-RSVG and OPT-RSVG datasets.
6148: DUQ: Dual Uncertainty Quantification for Text-Video Retrieval
Authors: Xin Liu, Shibai Yin, Jun Wang, Jiaxin Zhu, Xingyang Wang, Yee-Hong Yang
Location: Guangzhou | Day: TBD
Show Abstract
Text-video retrieval establishes accurate similarity relationships between text and video through feature enhancement and granularity alignment. However, relying solely on similarity to associate intra-pair features and distinguish inter-pair features is insufficient, \textit{e.g.}, when querying a multi-scene video with sparse text or selecting the most relevant video from many similar candidates. In this paper, we propose a novel Dual Uncertainty Quantification (DUQ) model that separately handles uncertainties in intra-pair interaction and inter-pair exclusion. Specifically, to enhance intra-pair interaction, we propose an intra-pair similarity uncertainty module to provide similarity-based trustworthy predictions and explicitly model this uncertainty. To increase inter-pair exclusion, we propose an inter-pair distance uncertainty module to construct a distance-based diversity probability embeding, thereby widening the gap between similar features. The two components work synergistically, jointly improving the calculation of similarity between features. We evaluate our model on six benchmark datasets: MSRVTT (51.2%), DiDeMo, MSVD, LSMDC, Charades, and VATEX, achieving state-of-the-art retrieval performance.
6149: A Fast and Accurate ANN-SNN Conversion Algorithm with Negative Spikes
Authors: Xu Wang, Dongchen Zhu, Jiamao Li
Location: Guangzhou | Day: TBD
Show Abstract
Spiking neural network (SNN) is an event-driven neural network that can greatly reduce the power consumption of the conventional artificial neural networks (ANN). Many ANN models can be converted to SNN models when the activation function is ReLU. For ANN models with other activation functions, such as the Leaky ReLU function, the converted SNN models either suffer from serious accuracy degradation or require a long time step. In this paper, we propose a fast and accurate ANN-SNN conversion algorithm for models with the Leaky ReLU function. We design a novel neuron model that supports negative spikes. To address the problem of long tail distribution in the activation values, we propose a threshold optimization algorithm based on the variance of the activation values. To avoid the problem of error accumulation, we jointly calibrate all layers in the SNN model with adaptive weighting. Experiment results verify the effectiveness of the proposed algorithm.
6153: RobustHAR: Multi-scale Spatial-temporal Masked Self-supervised Pre-training for Robust Human Activity Recognition
Authors: Xiao Liu, Guan Yuan, Yanmei Zhang, Shang Liu, Qiuyan Yan
Location: Guangzhou | Day: TBD
Show Abstract
Human activity recognition (HAR) is prone to performance degradation in real-world applications due to data missing between intra-sensor and inter-sensor channels. Masked modeling, as one mainstream paradigm of self-supervised pre-training, can learn robust representations across sensors in the data missing scenario by reconstructing the masked content based on the unmasked part. However, the existing methods predominantly emphasize the temporal dynamics of human activities, which limits their ability to effectively capture the spatial interdependencies among multiple sensors. Besides, different human activities often span across various spatial-temporal scales, which results in activity recognizer failing to capture intricate spatial-temporal semantic information. To address these issues, we propose RobustHAR, a new HAR model with multi-scale spatial-temporal masked self-supervised pre-training designed to improve model performance on the data missing context. RobustHAR involves three main steps: (1) RobustHAR constructs location-inspired spatial-temporal 3D-variation modeling to capture spatial-temporal correlated information in human activity data. (2) RobustHAR then designs multi-scale spatial-temporal masked self-supervised pre-training with semantic-consistent multi-scale feature co-learning for learning robust features at different scales. (3) Finally, RobustHAR fine-tunes the pretraining model with adaptive multi-scale feature fusion for human activity recognition. Extensive experiments on three public multi-sensor datasets demonstrate that RobustHAR outperforms existing state-of-the-art methods.
6154: Stackelberg vs. Nash in the Lottery Colonel Blotto Game
Authors: Yan Liu, Bonan Ni, Weiran Shen, Zihe Wang, Jie Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Resource competition problems are often modeled using Colonel Blotto games, where players take simultaneous actions. However, many real-world scenarios involve sequential decision-making rather than simultaneous moves.

To model these dynamics, we represent the Lottery Colonel Blotto game as a Stackelberg game, in which one player, the leader, commits to a strategy first, and the other player, the follower, responds. We derive the Stackelberg equilibrium for this game, formulating the leader’s strategy as a bi-level optimization problem.

To solve this, we develop a constructive method based on iterative game reductions, which allows us to efficiently compute the leader’s optimal commitment strategy in polynomial time. Additionally, we identify the conditions under which the Stackelberg equilibrium coincides with the Nash equilibrium. Specifically, this occurs when the budget ratio between the leader and the follower equals a certain threshold, which we can calculate in closed form. In some instances, we observe that when the leader’s budget exceeds this threshold, both players achieve higher utilities in the Stackelberg equilibrium compared to the Nash equilibrium.
Lastly, we show that, in the best case, the leader can achieve an infinite utility improvement by making an optimal first move compared to the Nash equilibrium.
6159: First-Order Coalition Logic
Authors: Davide Catta, Rustam Galimullin, Aniello Murano
Location: Montreal | Day: August 20th | Time: 14:00 | Session: KR: Logic
Show Abstract
We introduce First-Order Coalition Logic (FOCL), which combines key intuitions behind Coalition Logic (CL) and Strategy Logic (SL). Specifically, FOCL allows for arbitrary quantification over actions of agents. FOCL is interesting for several reasons. First, we show that FOCL is strictly more expressive than existing coalition logics. Second, we provide a sound and complete axiomatisation of FOCL, which, to the best of our knowledge, is the first axiomatisation of any variant of SL in the literature. Finally, while discussing the satisfiability problem for FOCL, we reopen the question of the recursive axiomatisability of SL.
6162: DToMA: Training-free Dynamic Token MAnipulation for Long Video Understanding
Authors: Bowen Yuan, Sisi You, Bing-Kun Bao
Location: Guangzhou | Day: TBD
Show Abstract
Video Large Language Models (VideoLLMs) often require thousands of visual tokens to process long videos, leading to substantial computational costs, further exacerbated by visual token inefficiency. Existing token reduction and alternative video representation methods improve efficiency but often compromise comprehension abilities. In this work, we analyze the reasoning processes of VideoLLMs in multi-choice VideoQA task, identifying three reasoning stages—shallow, intermediate, and deep stages—that closely mimic human cognitive processing. Our analysis reveals specific inefficiencies at each stage: in shallow layers, VideoLLMs attempt to memorize all video details without prioritizing relevant content; in intermediate layers, models fail to re-examine uncertain content dynamically; and in deep layers, they continue processing video even when sufficiently confident. To bridge this gap, we propose DToMA, a training-free Dynamic Token MAnipulation method inspired by human adjustment mechanisms in three aspects: 1) Text-guided keyframe-aware reorganization to prioritize keyframes and reduce redundancy, 2) Uncertainty-based visual injection to revisit content dynamically, and 3) Early-exit pruning to halt visual tokens when confident. Experiments on 6 long video understanding benchmarks show that DToMA enhances both efficiency and comprehension, outperforming state-of-the-art methods and generalizing well across 3 VideoLLM architectures and sizes. Code is available at https://github.com/yuanrr/DToMA.
6170: Conditional Independent Test in the Presence of Measurement Error with Causal Structure Learning
Authors: Hongbin Zhang, Kezhou Chen, Nankai Lin, Aimin Yang, Zhifeng Hao, Zhengming Chen
Location: Guangzhou | Day: TBD
Show Abstract
Testing conditional independence is a critical task, particularly in causal discovery and learning in Bayesian networks. However, in many real-world scenarios, variables are often measured with errors, such as those introduced by insufficient measurement accuracy, complicating the testing process. This paper focuses on testing conditional independence in the linear non-Gaussian measurement error model, under the condition that measurement error noise follows a Gaussian distribution. By leveraging high-order cumulants, we derive rank constraints on the cumulant matrix and establish their role in effectively assessing conditional independence, even in the presence of measurement errors. Based on these theoretical results, we leverage the rank constraints of the cumulant matrix as a tool for conditional independence testing and incorporate it into the PC algorithm, resulting in the PC-ME algorithm — a method designed to learn causal structures from observed data while accounting for measurement errors. Experimental results demonstrate that the proposed method outperforms existing approaches, particularly in cases other methods encounter difficulties.
6174: Empowering Multimodal Road Traffic Profiling with Vision Language Models and Frequency Spectrum Fusion
Authors: Haolong Xiang, Xiaolong Xu, Guangdong Wang, Xuyun Zhang, Xiaoyong Li, Qi Zhang, Amin Beheshti, Wei Fan
Location: Guangzhou | Day: TBD
Show Abstract
With the rapid urbanization in the modern era, smart traffic profiling based on multimodal sources of data has been playing a significant role in ensuring safe travel, reducing traffic congestion and optimizing urban mobility. Most existing methods for traffic profiling on the road level usually utilize single-modality data, i.e., they mainly focus on image processing with deep vision models or auxiliary analysis on the textual data. However, the joint modeling and multimodal fusion of the textual and visual modalities have been rarely studied in road traffic profiling, which largely hinders the accurate prediction or classification of traffic conditions. To address this issue, we propose a novel multimodal learning and fusion framework for road traffic profiling, named TraffiCFUS. Specifically, given the traffic images, our TraffiCFUS framework first introduces Vision Language Models (VLMs) to generate text and then creates tailored prompt instructions for refining this text according to the specific scene requirements of road traffic profiling. Next, we apply the discrete Fourier transform to convert multimodal data from the spatial domain to the frequency domain and perform a cross-modal spectrum transform to filter out irrelevant information for traffic profiling. Furthermore, the processed spatial multimodal data is combined to generate fusion loss and interaction loss with contrastive learning. Finally, extensive experiments on four real-world datasets illustrate superior performance compared with the state-of-the-art approaches.
6179: Optimal Distributed Training With Co-Adaptive Data Parallelism in Heterogeneous Environments
Authors: Lifang Chen, Zhichao Chen, Liqi Yan, Yanyu Cheng, Fangli Guan, Pan Li
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Agent-based and Multi-agent Systems (1/3)
Show Abstract
The computational power required for training deep learning models has been skyrocketing in the past decade as they scale with big data, and has become a very expensive and scarce resource. Therefore, distributed training, which can leverage distributed available computational power, is vital for efficient large-scale model training. However, most previous distributed training frameworks like DDP and DeepSpeed are primarily designed for co-located clusters under homogeneous computing and communication conditions, and hence cannot account for geo-distributed clusters with both computing and communication heterogeneity. To address this challenge, we develop a new data parallel based distributed training framework called Co-Adaptive Data Parallelism (C-ADP). First, we consider a data owner and parameter server that distributes data to and coordinates the collaborative learning across all the computing devices. We employ local training and delayed parameter synchronization to reduce communication costs. Second, we formulate a data parallel scheduling optimization problem to minimize the training time by optimizing data distribution. Third, we devise an efficient algorithm to solve this scheduling problem, and formally prove that the obtained solution is optimal in the asymptotic sense. Experiments on the ImageNet100 dataset demonstrate that C-ADP achieves fast convergence in heterogeneous distributed training environments. Compared to Distributed Data Parallel (DDP) and DeepSpeed, C-ADP achieves 21.6 times and 26.3 times improvements in FLOPS, respectively, and a reduction in training time of about 72% and 47%, respectively.
6181: DGCPL: Dual Graph Distillation for Concept Prerequisite Relation Learning
Authors: Miao Zhang, Jiawei Wang, Jinying Han, Kui Xiao, Zhifei Li, Yan Zhang, Hao Chen, Shihui Wang
Location: Guangzhou | Day: TBD
Show Abstract
Concept prerequisite relations determine the learning order of knowledge concepts in one domain, which has an important impact on teachers’ course design and students’ personalized learning. Current research usually predicts concept prerequisite relations from the perspective of knowledge, and rarely pays attention to the role of learners’ learning behavior. We propose a Dual Graph Distillation Method for Concept Prerequisite Relation Learning (DGCPL). Specifically, DGCPL constructs a dual graph structure from both the knowledge and learning behavior perspectives, and captures the high-order knowledge features and learning behavior features through the concept-resource hypergraph and the learning behavior graph respectively. In addition, we introduce a gated knowledge distillation to fuse the structural information of concept nodes in the two graphs, so as to obtain a more comprehensive concept embedding representation and achieve accurate prediction of prerequisite relations. On three public benchmark datasets, we compare DGCPL with eight graph-based baseline methods and five traditional classification baseline methods. The experimental results show that DGCPL achieves state-of-the-art performance in learning concept prerequisite relations. Our code is available at https://github.com/wisejw/DGCPL.
6184: FedCCH: Automatic Personalized Graph Federated Learning for Inter-Client and Intra-Client Heterogeneity
Authors: Pengfei Jiao, Zian Zhou, Meiting Xue, Huijun Tang, Zhidong Zhao, HuaMing Wu
Location: Guangzhou | Day: TBD
Show Abstract
Graph federated learning (GFL) is increasingly utilized in domains such as social network analysis and recommendation systems, where non-IID data exist extensively and necessitate a strong emphasis on personalized learning. However, existing methods focus only on the personality among different clients instead of the personality within a client which widely exists in the real social networks, where intra-client personality addresses the heterogeneity of known data, while inter-client personality always tackle client heterogeneity under privacy constraint. In this paper, we propose a novel automatic personalized graph federated learning (PGFL) scheme named FedCCH to capture both inter-client and intra-client heterogeneity. For intra-client heterogeneity, we innovatively propose the learnable Personalized Factor (PF) to automatically normalize each graph representation within clients by learnable parameters, which weakens the impact of non-IID data distribution. For inter-client heterogeneity, we propose a novel hash-based similarity clustering method to generate the hash signature for each client, and then group similar clients for joint training among different clients. Ultimately, we collaboratively train intra-client and inter-client modules to improve the effectiveness of capturing the heterogeneity of the graph data of clients. Experiment results demonstrate that FedCCH outperforms other state-of-the-art baseline methods.
6190: FedCPD:Personalized Federated Learning with Prototype-Enhanced Representation and Memory Distillation
Authors: Kaili Jin, Li Xu, Xiaoding Wang, Sun-Yuan Hsieh, Jie Wu, Limei Lin
Location: Montreal | Day: August 19th | Time: 15:00 | Session: ML: Federated Learning
Show Abstract
Federated learning, as a distributed learning framework, aims to develop a global model while preserving client privacy. However, heterogeneity of client data leads to fairness issues and reduced performance. Techniques like parameter decoupling and prototype learning appear promising, yet challenges such as forgetting historical data and limited generalization persist. These methods also lack local insights, with locally trained features prone to overfitting, which affects generalization in global parameter aggregation. To address these challenges, we propose FedCPD, a personalized federated learning framework. FedCPD maintains historical information, reduces information loss, and increases personalization through hierarchical feature distillation and cross-layer feature fusion. Moreover, we utilize representation techniques like prototype contrastive learning and prototype alignment to capture diverse client data features, thus improving model generalization and fairness. Experiments show FedCPD outperforms state-of-the-art models, enhancing generalization by up to 10.40% and personalization by up to 4.90%, highlighting its effectiveness and superiority.
6195: DGraFormer: Dynamic Graph Learning Guided Multi-Scale Transformer for Multivariate Time Series Forecasting
Authors: Han Yan, Dongliang Chen, Guiyuan Jiang, Bin Wang, Lei Cao, Junyu Dong, Yanwei Yu
Location: Guangzhou | Day: TBD
Show Abstract
Multivariate time series forecasting is a critical focus across many fields. Existing transformer-based models have overlooked the explicit modeling of inter-variable correlations. Similarly, the graph-based methods have also failed to address the dynamic nature of multivariate correlations and the noise in correlation modeling. To overcome these challenges, we propose a novel Dynamic Graph Learning Guided Multi-Scale Transformer (DGraFormer) for multivariate time series forecasting. Specifically, our method consists of two main components: Dynamic correlation-aware graph Learning (DCGL) and multi-scale temporal transformer (MTT). The former aims to capture dynamic correlations across different time windows, filters out noise, and selects key weights to guide the aggregation of relevant feature representations. The latter can effectively extract temporal patterns from patch data at varying scales. Finally, the proposed method can capture rich local correlation graph structures and multi-scale global temporal features. Experimental results demonstrate that DGraformer significantly outperforms existing state-of-the-art models on ten real-world datasets, achieving the best performance across multiple evaluation metrics. The source code of our model is available at \url{https://anonymous.4open.science/r/DGraFormer}.
6216: Block Circulant Adapter for Large Language Models
Authors: Xinyu Ding, Meiqi Wang, Siyu Liao, Zhongfeng Wang
Location: Montreal | Day: August 21st | Time: 15:00 | Session: ML: Large Language Models
Show Abstract
Fine-tuning large language models (LLMs) is difficult due to their huge model size. Recent Fourier domain-based methods show potential for reducing fine-tuning costs.
We propose a block circulant matrix-based fine-tuning method with a stable training heuristic to leverage the properties of circulant matrices and one-dimensional Fourier transforms to reduce storage and computation costs.
Experiments show that our method uses 14× less number of parameters than VeRA, 16× smaller than LoRA and 32× less FLOPs than FourierFT, while maintaining close or better task performance.
Our approach presents a promising way in frequency domain to fine-tune large models on downstream tasks.
6224: Revisiting Proportional Allocation with Subsidy: Simplification and Improvements
Authors: Xiaowei Wu, Quan Xue, Shengwei Zhou
Location: Guangzhou | Day: TBD
Show Abstract
In this paper, we revisit the problem of fair allocation with subsidy. We first consider the allocation of m indivisible chores to n agents with additive (dis)utility functions. Under the assumption that the maximum (dis)utility of an item can be compensated by one dollar, Wu et al. (WINE 2023) showed that a total of n/4 dollars suffices to guarantee a proportional allocation by rounding fractional allocations. Their subsidy guarantee is optimal when n is even. For odd n, there is still a small gap between the upper and lower bounds for the total subsidy. In this paper, we propose a much simpler algorithm for the problem, which does not require rounding fractional allocations, and achieves an optimal subsidy guarantee for all values of n. Different from existing works, our algorithm does not require the computation and rounding of fractional allocations and admits a much simpler analysis. We further show that our algorithm and analysis framework can be extended to the mixture of (subjective) goods and chores, achieving the optimal subsidy guarantee.
6226: Causality-Inspired Disentanglement for Fair Graph Neural Networks
Authors: Guixian Zhang, Debo Cheng, Guan Yuan, Shang Liu, Yanmei Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Fair graph neural networks aim to eliminate discriminatory biases in predictions. Existing approaches often rely on adversarial learning to mitigate dependencies between sensitive attributes and labels but face challenges due to optimisation difficulties. A key limitation lies in neglecting intrinsic causality, which may lead to the entanglement of sensitive and causal factors, discarding causal factors or retaining sensitive factors in the final prediction, especially on unbalanced datasets.
To address this issue, we propose a Causality-inspired Disentangled framework for Fair Graph neural networks (CDFG). In CDFG, node representations are conceptualised as a combination of causal and sensitive factors, enabling fair representation learning by only utilising the causal factors. We first use a counterfactual data generation mechanism to generate counterfactual data with similar causal factors but completely different sensitive factors. Then, we input real-world data and counterfactual data into the factor disentanglement module to achieve independence and disentanglement between the causal factors and sensitive factors. Finally, an adaptive mask module extracts the causal representation for fair and accurate graph-based predictions.
Extensive experiments on three widely used datasets demonstrate that CDFG consistently outperforms existing methods, achieving competitive utility and significantly improved fairness.
6244: Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation
Authors: Ning Wang, Zihan Yan, Weiyang Li, Chuan Ma, He Chen, Tao Xiang
Location: Guangzhou | Day: TBD
Show Abstract
Embodied agents exhibit immense potential across a multitude of domains, making the assurance of their behavioral safety a fundamental prerequisite for their widespread deployment. However, existing research predominantly concentrates on the security of general large language models, lacking specialized methodologies for establishing safety benchmarks and input moderation tailored to embodied agents. To bridge this gap, this paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents. This framework encompasses the entire pipeline, including taxonomy definition, dataset curation, moderator architecture, model training, and rigorous evaluation. Notably, we introduce EAsafetyBench, a meticulously crafted safety benchmark engineered to facilitate both the training and stringent assessment of moderators specifically designed for embodied agents. Furthermore, we propose Pinpoint, an innovative prompt-decoupled input moderation scheme that harnesses a masked attention mechanism to effectively isolate and mitigate the influence of functional prompts on moderation tasks. Extensive experiments conducted on diverse benchmark datasets and models validate the feasibility and efficacy of the proposed approach. The results demonstrate that our methodologies achieve an impressive average detection accuracy of 94.58%, surpassing the performance of existing state-of-the-art techniques, alongside an exceptional moderation processing time of merely 0.002 seconds per instance. The source code and datasets can be found at https://github.com/ZihanYan-CQU/EAsafetyBench.
6252: Enhancing Multimodal Model Robustness Under Missing Modalities via Memory-Driven Prompt Learning
Authors: Yihan Zhao, Wei Xi, Xiao Fu, Jizhong Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Existing multimodal models typically assume the availability of all modalities, leading to significant performance degradation when certain modalities are missing. Recent methods have introduced prompt learning to adapt pretrained models to incomplete data, achieving remarkable performance when the missing cases are consistent during training and inference. However, these methods rely heavily on distribution consistency and fail to compensate for missing modalities, limiting their ability to generalize to unseen missing cases. To address this issue, we propose Memory-Driven Prompt Learning, a framework that adaptively compensates for missing modalities through prompt learning. The compensation strategies are achieved by two types of prompts: generative prompts and shared prompts. Generative prompts retrieve semantically similar samples from a predefined prompt memory that stores modality-specific semantic information, while shared prompts leverage available modalities to provide cross-modal compensation. Extensive experiments demonstrate the effectiveness of the proposed model, achieving significant improvements across diverse missing-modality scenarios, with average performance increasing from 34.76% to 40.40% on MM-IMDb, 62.71% to 77.06% on Food101, and 60.40% to 62.77% on Hateful Memes. The code is available at https://github.com/zhao-yh20/MemPrompt.
6254: A Little Subsidy Ensures MMS Allocation for Three Agents
Authors: Xiaowei Wu, Quan Xue, Shengwei Zhou
Location: Guangzhou | Day: TBD
Show Abstract
We consider the problem of fair allocation of m indivisible items to a group of n agents with subsidies (money). We address scenarios where agents have general additive cost/utility functions. Our work primarily focuses on the special case of three agents. Assuming that the maximum cost/utility of an item to an agent can be compensated by one dollar, we demonstrate that a total subsidy of 1/6 dollars is sufficient to ensure the existence of Maximin Share (MMS) allocations for both goods and chores. Additionally, we provide examples to establish the lower bounds of the required subsidies.
6260: Efficient Hi-Fi Style Transfer via Statistical Attention and Modulation
Authors: Zhirui Fang, Yi Li, Xin Xie, Chengyan Li, Yanqing Guo
Location: Guangzhou | Day: TBD
Show Abstract
Style transfer is a challenging task in computer vision, aiming to blend the stylistic features of one image with the content of another while preserving the content details. Traditional methods often face challenges in terms of computational efficiency and fine-grained content preservation. In this paper, we propose a novel feature modulation mechanism based on parameterized normalization, where the modulation parameters for content and style features are learned using a dual convolution network (BiConv). These parameters adjust the mean and standard deviation of the features, improving both the stability and quality of the style transfer process. To achieve fast inference, we introduce an efficient acceleration technique by leveraging a row and column weighted attention matrix. In addition, we incorporate a contrastive learning scheme to align the local features of the content and the stylized images, improving the fidelity of the generated output. Experimental results demonstrate that our method significantly improves the inference speed and the quality of style transfer while preserving content details, outperforming existing approaches based on both convolution and diffusion.
6264: ADFormer: Aggregation Differential Transformer for Passenger Demand Forecasting
Authors: Haichen Wang, Liu Yang, Xinyuan Zhang, Haomin Yu, Ming Li, Jilin Hu
Location: Guangzhou | Day: TBD
Show Abstract
Passenger demand forecasting helps optimize vehicle scheduling, thereby improving urban efficiency. Recently, attention-based methods have been used to adequately capture the dynamic nature of spatio-temporal data. However, existing methods that rely on heuristic masking strategies cannot fully adapt to the complex spatio-temporal correlations, hindering the model from focusing on the right context. These works also overlook the high-level correlations that exist in the real world. Effectively integrating these high-level correlations with the original correlations is crucial. To fill this gap, we propose the Aggregation Differential Transformer (ADFormer), which offers new insights to demand forecasting promotion. Specifically, we utilize Differential Attention to capture the original spatial correlations and achieve attention denoising. Meanwhile, we design distinct aggregation strategies based on the nature of space and time. Then, the original correlations are unified with the high-level correlations, enabling the model to capture holistic spatio-temporal relations. Experiments conducted on taxi and bike datasets confirm the effectiveness and efficiency of our model, demonstrating its practical value. The code is available at https://github.com/decisionintelligence/ADFormer.
6276: ID-RemovalNet: Identity Removal Network for EEG Privacy Protection with Enhancing Decoding Tasks
Authors: Huabin Wang, Jie Ruan, Cunhang Fan, Yingfan Cheng, Zhao Lv
Location: Guangzhou | Day: TBD
Show Abstract
Electroencephalogram (EEG) contains not only decoding task information but also personal identity privacy information. If it is stolen or attacked, the user’s brain-computer interaction behavior may be maliciously manipulated. Existing EEG identity privacy protection generally adopts generative or adding tiny perturbation methods, which can protect the identity privacy in EEG signals to some extent. However, these methods also damage the performance of decoding task. In order to solve these problems, this paper proposes an identity removal network (ID-RemovalNet) to achieve EEG privacy protection while improving the classification accuracy of decoding task. Firstly, an identity decorrelation separation module is constructed to accurately remove the identity features to achieve privacy protection while reducing the interference with the task decoding features. Secondly, a multi-domain multi-level fusion feature extraction module is designed to extract the high-quality EEG time-frequency features. Finally, the feature enhancement module is used to compensate for the loss of task decoding features and excitation of dominant feature selection during identity feature removal. The experimental results show that ID-RemoveNet removes identity information to 0.43% on four EEG datasets with two different paradigms, and significantly improves the EEG task decoding accuracy by 3.28%, and achieves the state-of-the-art performance in cross-subject EEG experiment.
6286: GPL4SRec: Graph Multi-Level Aware Prompt Learning for Streaming Recommendation
Authors: Hao Cang, Huanhuan Yuan, Jiaqing Fan, Lei Zhao, Guanfeng Liu, Pengpeng Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Streaming Recommendation (SRec) aims to capture evolving user preferences in the streaming scenarios. Recently, Graph Prompt Learning (GPL) methods have demonstrated their effectiveness and adaptability within SRec. However, existing graph prompt solutions rarely consider the evolution of multi-hop cascading relationships between users and items, which are crucial for modeling the shifts in user preferences. To address this problem, we propose a novel Graph Multi-Level Aware Prompt Learning for Streaming Recommendation, named GPL4SRec. Specifically, a graph encoder is first pre-trained on extensive historical data to capture user long-term preferences. Then, we design three types of prompts, namely node-aware, structure-aware, and layer-aware prompts, which are used to guide the pre-trained encoder to better capture user short-term preferences. This is accomplished by accounting for both the incremental changes in users and items, as well as the cascading evolution in multi-hop relationships. Furthermore, we provide a theoretical analysis showing that our prompt templates are critical to achieving superior performance. Finally, experimental results also prove that our model significantly outperforms the state-of-the-art approaches in SRec.
6301: DualCast: A Model to Disentangle Aperiodic Events from Traffic Series
Authors: Xinyu Su, Feng Liu, Yanchuan Chang, Egemen Tanin, Majid Sarvi, Jianzhong Qi
Location: Guangzhou | Day: TBD
Show Abstract
Traffic forecasting is crucial for transportation systems optimisation. Current models minimise the mean forecasting errors, often favouring periodic events prevalent in the training data, while overlooking critical aperiodic ones like traffic incidents. To address this, we propose DualCast, a dual-branch framework that disentangles traffic signals into intrinsic spatial-temporal patterns and external environmental contexts, including aperiodic events. DualCast also employs a cross-time attention mechanism to capture high-order spatial-temporal relationships from both periodic and aperiodic patterns. DualCast is versatile. We integrate it with recent traffic forecasting models, consistently reducing their forecasting errors by up to 9.6% on multiple real datasets.
6303: On Temporal ASP with Eager Unfoldable Operators
Authors: Thomas Eiter, Davide Soldà
Location: Montreal | Day: August 21st | Time: 15:00 | Session: Knowledge Representation and Reasoning (4/4)
Show Abstract
Temporal Equilibrium Logic (TEL) extends Answer Set Programming (ASP) with linear-time temporal operators (LTL), enabling reasoning about dynamic systems. However, TEL enforces strong minimization criteria that may preclude intuitive models. Liveness formulas, for instance, tend to fail to have infinite equilibrium models, as TEL minimization postpones satisfaction forever. We address this limitation by introducing eager temporal operators (eager Until, eager Release, etc.), and present non-disjunctive temporal programs (NDTP) as a framework for modeling dependencies, inertia, and non-determinism. The fragment of tight temporal programs (TTP), which can be recognized efficiently based on automata techniques for loop detections, guarantees polynomial encodability into LTL. Practical examples, such as request-grant protocols and user permissions in distributed systems, illustrate the applicability of our approach.
6309: DFCA: Disentangled Feature Contrastive Learning and Augmentation for Fairer Dermatological Diagnostics
Authors: Pengcheng Zhao, Xiaowei Ding
Location: Montreal | Day: August 21st | Time: 11:30 | Session: ETF: Fairness and diversity
Show Abstract
With the increasing integration of AI in medical research and applications, the issue of fairness has become as critical as diagnostic accuracy. In dermatology diagnosis, the challenge of class-imbalanced data, which is sometimes limited and contains demographic attributes, results in an imbalanced and insufficient representation within the feature space of deep learning models. Besides, feature entanglement within deep learning models confuses skin tone and disease condition information, impairing model performance among vulnerable groups. Moreover, feature entanglement often constrains the efforts to mitigate unfairness, entailing a trade-off between fairness and diagnostic accuracy. This paper introduces the Disentangled Feature Contrastive learning and Augmentation framework (DFCA), aiming to enhance fairness in dermatological diagnoses without compromising accuracy. Initially, DFCA disentangles skin images into disease related and skin-tone features. Subsequently, the two sets of features are projected into normalized spaces for contrastive learning, each modeled by a mixture of von Mises-Fisher (vMF) distributions. DFCA then samples from these vMF distributions to inversely augment the feature space. To further evaluate the fairness-accuracy balance, we propose a new metric, the Accuracy-Fairness Balance Degree (AFBD). Extensive experiments demonstrate that DFCA significantly improves both fairness and accuracy compared to state-of-the-art methods.
6311: Omni-Dimensional State Space Model-driven SAM for Pixel-level Anomaly Detection
Authors: Chao Huang, Qianyi Li, Jie Wen, Bob Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Pixel-level anomaly detection is indispensable in industrial defect detection and medical diagnosis. Recently, Segment Anything Model (SAM) has achieved promising results in many vision tasks. However, direct application of the SAM to pixel-level anomaly detection tasks results in unsatisfactory performance, meanwhile SAM needs the manual prompt. Although some automatically prompt-based SAM has been proposed, these automated prompting approaches merely utilize partial image features as prompts and fail to incorporate crucial features such as multi-scale image features to generate more suitable prompts. In this paper, we propose a novel Omni Dimensional State Space Model-driven SAM (ODS-SAM) for pixel-level anomaly detection. Specifically, the proposed method adopts the SAM architecture, ensuring easy implementation and avoiding the need for fine-tuning. A State-Space Model-based residual Omni Dimensional module is designed to automatically generate suitable prompts. This module can effectively leverage multi-scale and global information, facilitating an iterative search for optimal prompts in the prompt space. The identified optimal prompts are then fed into SAM as high-dimensional tensors. Experimental results demonstrate that the proposed ODS-SAM outperforms state-of-the-art models on both industrial and medical image datasets.
6338: A Generalized Diffusion Framework with Learnable Propagation Dynamics for Source Localization
Authors: Dongpeng Hou, Yuchen Wang, Chao Gao, Xianghua Li
Location: Guangzhou | Day: TBD
Show Abstract
Source localization has been widely studied in recent years due to its crucial role in controlling the spread of harmful information. Existing methods only achieve satisfactory performance within a specific propagation model, which restricts their applicability and generalizability across different scenarios. To address this, we propose a Generalized Diffusion Framework for Source Localization (GDFSL), which enhances probabilistic diffusion models to flexibly capture the underlying dynamics of various propagation scenarios. By redefining the forward diffusion process, GDFSL ensures convergence to a real distribution of infected states that accurately represents the targeted dynamics, enabling the model to learn unbiased noise in a self-supervised manner that encodes fine-grained propagation characteristics. A closed-form reverse diffusion process is then derived to trace the propagation back to the source. The process does not rely on an explicit source label term, facilitating direct inference of sources from observed data. Experimental results show that GDFSL outperforms SOTA methods in various propagation models, particularly in scenarios where historical training data is limited or unavailable. The code is available at https://github.com/cgao-comp/GDFSL.
6341: Object-Level Backdoor Attacks in RGB-T Semantic Segmentation with Cross-Modality Trigger Optimization
Authors: Xianghao Jiao, Di Wang, Jiawei Liang, Jianjie Huang, Wei Wang, Xiaochun Cao
Location: Guangzhou | Day: TBD
Show Abstract
The escalating threat of backdoor risks in deep vision models is a pressing concern. Existing research on backdoor attacks is often confined to a single modality, neglecting the challenges posed by multi-modality scene perception. This work is a pioneer of backdoor attacks in RGB-Thermal (RGB-T) semantic segmentation. We overcome the critical limitation of current segmentation backdoor attacks that indiscriminately compromise all objects of a victim class, failing to provide fine-grained control for selectively targeting specific objects as required by adversaries. To address this, we introduce a novel Object-level Backdoor Attack pipeline, termed OBA. The OBA first employs a precise data poisoning (PDP) to lock a specific victim object. Specifically, the PDP embeds the trigger into the only victim object and modifies its label’s pixels at the corresponding positions, thus enabling object-level attacks. In addition, the domain gap between static single-modality triggers and multi-modality scenarios limits the PDP. We therefore introduce a Cross-Modality Trigger Generation (CMTG) method. Through style designs of triggers and cross-modality trigger co-optimization, the target domain semantics and multi-modality model perception patterns are encoded into triggers, achieving high effectiveness, stealth, and physical feasibility of triggers. Extensive experiments show that the proposed OBA enables precise manipulation of the designated object within the specific class.
6354: CoLA-Former: Graph Transformer Using Communal Linear Attention for Lightweight Sequential Recommendation
Authors: Zhongying Zhao, Jinyu Zhang, Chuanxu Jia, Chao Li, Yanwei Yu, Qingtian Zeng
Location: Guangzhou | Day: TBD
Show Abstract
Graph Transformer has shown great promise in capturing the dynamics of user preferences for sequential recommendations. However, the self-attention mechanism within its structure is of quadratic complexity, posing challenges for deployment on devices with limited resources. To this end, we propose a Communal Linear Attention-enhanced Graph TransFormer for lightweight sequential recommendation, namely CoLA-Former. Specifically, we introduce a Communal Linear Attention (CoLAttention) mechanism. It utilizes low-rank yet reusable communal units to calculate the global correlations on sequential graphs. The weights from the units are also made communal across different training batches, enabling inter-batch global weighting. Moreover, we devise a low-rank approximation component. It utilizes weights distillation to reduce the scale of the trainable parameters in the Graph Transformer network. Extensive experimental results on three real-world datasets demonstrate that the proposed CoLA-Former significantly outperforms twelve state-of-the-art methods in accuracy and efficiency. The datasets and codes are available at https://github.com/ZZY-GraphMiningLab/CoLA_Former.
6362: Automated Detection of Pre-training Text in Black-box LLMs
Authors: Ruihan Hu, Yu-Ming Shang, Jiankun Peng, Wei Luo, Yazhe Wang, Xi Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Detecting whether a given text is a member in the pre-training data of Large Language Models (LLMs) is crucial for ensuring data privacy and copyright protection. Most existing methods rely on the LLM’s hidden information (e.g., model parameters or token probabilities), making them ineffective in the black-box setting, where only input and output texts are accessible. Although some methods have been proposed for the black-box setting, they rely on massive manual efforts such as designing complicated questions or instructions. To address these issues, we propose VeilProbe, the first framework for automatically detecting LLMs’ pre-training texts in a black-box setting without human intervention. VeilProbe utilizes a sequence-to-sequence mapping model to infer the latent mapping feature between the input text and the corresponding output suffix generated by the LLM. Then it performs the key token perturbations to obtain more distinguishable membership features. Additionally, considering real-world scenarios where the ground-truth training text samples are limited, a prototype-based membership classifier is introduced to alleviate the overfitting issue. Extensive evaluations on three widely used datasets demonstrate that our framework is effective and superior in the black-box setting.
6367: Good Advisor for Source Localization: Using Large Language Model to Guide the Source Inference Process
Authors: Dongpeng Hou, Wenfei Wei, Chao Gao, Xianghua Li, Zhen Wang
Location: Guangzhou | Day: TBD
Show Abstract
With the rapid development of AI large model technology, large language models (LLMs) provide a new solution for source localization tasks due to the deep linguistic understanding and generation capabilities. However, it is difficult to understand complex propagation patterns and network structures when LLMs are directly applied to source localization, resulting in limited accuracy of source localization. Meanwhile, the high-dimensional embedding of the textual representation introduces significant amounts of redundant features, which also reduces its efficiency in source localization task to some extent. To solve the above problems, this paper proposes a multi-modal fusion framework for rumor source localization, namely Contrastive Rumor Source Localization via LLM (CRSLL), based on the idea of contrastive learning. Specifically, the framework constructs propagation embeddings by comprehensively capturing both propagation dynamics and user profile features, adopts a contrastive learning approach to enhance the representation ability of comment embeddings of rumor cascades by differentiating them from non-rumor cascade comments, filters out invalid features through a differentiable masking strategy, and fuses comment modality embeddings with propagation embeddings through an attention mechanism, so as to better capture the multi-modal data interactions. It is worth mentioning that the framework uses LLM as a good “advisor” to provide a rich deep semantic representation, which improves the accuracy of rumor source localization. The code is available at https://github.com/cgao-comp/CRSLL.
6368: High-Confident Local Structure Guided Consensus Graph Learning For Incomplete Multi-view Clustering
Authors: Shuping Zhao, Lunke Fei, Qi Lai, Jie Wen, Jinrong Cui, Tingting Chai
Location: Guangzhou | Day: TBD
Show Abstract
Current existing clustering methods for handling incomplete multi-view data primarily concentrate on learning a common representation or graph from the available views, while overlooking the latent information contained in the missing views and the imbalance of information among different views. Furthermore, instances with weak discriminative features usually degrading the precision of consistent representation or graph across all views. To address these problems, in this paper, we propose a simple but efficient method, called high-confident local structure guided consensus graph learning for incomplete multi-view clustering (HLSCG_IMC). Specifically, this method can adaptively learn a strict block diagonal structure from the available samples using a block diagonal representation regularizer. Different from the existing methods using a simple pairwise affinity graph for structure construction, we consider the influence of instances located at the edge of two clusters on the construction of graph for each view. By harnessing the proposed high-confident strict block diagonal structures, the approach seeks to directly guide the learning of the robust consensus graph. A number of experiments have been conducted to verify the efficacy of our approach.
6372: RegionMatch: Pixel-Region Collaboration for Semi-Supervised Semantic Segmentation in Remote Sensing Images
Authors: Xiaoqian Zhu, Xiangrong Zhang, Tianyang Zhang, Chaowei Fang, Xu Tang, Licheng Jiao
Location: Guangzhou | Day: TBD
Show Abstract
Semi-supervised semantic segmentation (S4) has shown significant promise in reducing the burden of labor-intensive data annotation. However, existing methods mainly rely on pixel-level information, neglecting the strong region consistency inherent in remote sensing images (RSIs), which limits their effectiveness in handling the complex and diverse backgrounds of RSIs. To address this, we propose RegionMatch, a novel approach that leverages unlabeled data from a fresh object-level perspective, which is more tailored to the nature of semantic segmentation. We design the Pixel-Region Synergy Pseudo-Labeling strategy, which explicitly injects object-level contextual information into the S4 pipeline and promotes knowledge collaboration between pixel and region perspectives for generating high-quality pseudo-labels. In addition, we propose the Region Structure-Aware Correlation Consistency, which models object-level relationships by establishing inter-region correlations across images and pixel correlations within regions, providing more effective supervision signals for unlabeled data. Experimental results demonstrate that RegionMatch outperforms state-of-the-art methods on multiple authoritative remote sensing datasets, highlighting its superiority in the RSIs.
6376: MHANet: Multi-scale Hybrid Attention Network for Auditory Attention Detection
Authors: Lu Li, Cunhang Fan, Hongyu Zhang, Jingjing Zhang, Xiaoke Yang, Jian Zhou, Zhao Lv
Location: Guangzhou | Day: TBD
Show Abstract
Auditory attention detection (AAD) aims to detect the target speaker in a multi-talker environment from brain signals, such as electroencephalography (EEG), which has made great progress. However, most AAD methods solely utilize attention mechanisms sequentially and overlook valuable multi-scale contextual information within EEG signals, limiting their ability to capture long-short range spatiotemporal dependencies simultaneously. To address these issues, this paper proposes a multi-scale hybrid attention network (MHANet) for AAD, which consists of the multi-scale hybrid attention (MHA) module and the spatiotemporal convolution (STC) module. Specifically, MHA combines channel attention and multi-scale temporal and global attention mechanisms. This effectively extracts multi-scale temporal patterns within EEG signals and captures long-short range spatiotemporal dependencies simultaneously. To further improve the performance of AAD, STC utilizes temporal and spatial convolutions to aggregate expressive spatiotemporal representations. Experimental results show that the proposed MHANet achieves state-of-the-art performance with fewer trainable parameters across three datasets, 3 times lower than that of the most advanced model. Code is available at: https://github.com/fchest/MHANet.
6378: Generate or Re-Weight? A Mutual-Guidance Method for Class-Imbalanced Graphs
Authors: Zhongying Zhao, Gen Liu, Qi Meng, Chao Li, Qingtian Zeng
Location: Guangzhou | Day: TBD
Show Abstract
Class imbalance is a widespread problem in graph-structured data. The existing studies tailored for class-imbalanced graphs are typically categorized into generative and re-weighting methods. However, the former merely focuses on quantity balance rather than learning balance. The latter performs the fine-tuning in a majority-minority paradigm, overlooking the authentic-generative one. In fact, the collaboration of them is capable of relieving their respective limitations. To this end, we propose a Mutual-Guidance method for class-imbalanced graphs, namely GraphMuGu. Specifically, we first design an uncertainty-aware method to quantify the number of synthesized samples for each category. Furthermore, we devise a similarity-aware method to re-weight the importance of the authentic and generative samples. To the best our knowledge, the proposed GraphMuGu is the first try to incorporate the generative and re-weighting methods into a unified framework. The experimental results on five class-imbalanced datasets demonstrate the superiority of the proposed method. The source codes are available at https://github.com/ZZY-GraphMiningLab/GraphMuGu.
6381: Efficient Dynamic Graphs Learning with Refined Batch Parallel Training
Authors: Zhengzhao Feng, Rui Wang, Longjiao Zhang, Tongya Zheng, Ziqi Huang, Mingli Song
Location: Guangzhou | Day: TBD
Show Abstract
Memory-based temporal graph neural networks (MTGNN) use node memory to store historical information, enabling efficient processing of large dynamic graphs through batch parallel training, with larger batch sizes leading to increased training efficiency. However, this approach overlooks the interdependency among edges within the same batch, leading to outdated memory states and reduced training accuracy. Previous studies have attempted to mitigate this issue through methods such as measuring memory loss, overlap training, and additional compensation modules. Despite these efforts, challenges persist, including imprecise coarse-grained memory loss measurement and ineffective compensation modules. To address these challenges, we propose the Refined Batch parallel Training (RBT) framework, which accurately evaluates intra-batch information loss and optimizes batch partitioning to minimize loss, enhancing the training process’s effectiveness and efficiency. RBT also includes a precise and efficient memory compensation algorithm. Experimental results demonstrate RBT’s superior performance compared to existing MTGNN frameworks like TGL, ETC, and PRES in terms of training efficiency and accuracy across various dynamic graph datasets. Our code is made publicly available at https://github.com/fengwudi/RBT.
6383: Generalized Safe Conditional Syntax Splitting of Belief Bases
Authors: Lars-Phillip Spiegel, Jonas Haldimann, Jesse Heyninck, Gabriele Kern-Isberner, Christoph Beierle
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Knowledge Representation and Reasoning (2/4)
Show Abstract
Splitting techniques in knowledge representation
help focus on relevant parts of a belief base and
reduce the complexity of reasoning generally. In
this paper, we propose a generalization of safe conditional syntax splittings that broadens the applicability of splitting postulates for inductive inference from belief bases. In contrast to safe conditional syntax splitting, our generalized notion supports syntax splittings of a belief base ∆ where the
subbases of ∆ may share atoms and nontrivial conditionals. We illustrate how this new notion overcomes limitations of previous splitting concepts,
and we identify genuine splittings, separating them
from simple splittings that do not provide benefits
for inductive inference from ∆. We introduce adjusted inference postulates based on our generalization of conditional syntax splitting. We evaluate
several inductive inference operators with respect
to these postulates, and show that generalized safe
conditional syntax splitting is a strictly stronger requirement for inductive inference operators, covering more syntax splitting applications.
6392: Robustness in Single-Audience Value-based Abstract Argumentation: Complexity Results
Authors: Bettina Fazzinga, Sergio Flesca, Filippo Furfaro
Location: Montreal | Day: August 19th | Time: 15:00 | Session: KRR: Argumentation
Show Abstract
We address the context of Single-Audience Value-Based Abstract Argumentation Framework (AVAF),
where the arguments are labeled with the social values that they promote and
the activation/deactivation of the attacks depends on the audience profile
(expressed as a set of preferences between the social values).
Herein, we introduce a new notion of robustness for measuring the sensitivity of the outcome of the reasoning to the extent of changes in the audience profile.
In particular, for a set of arguments S or a single argument a, we define the robustness degree
of the status of S or a as the maximum number k* of deletions/insertions of preferences
from/into the audience profile that are tolerable, in the sense that S remains an extension
(or a non-extension) or a accepted (or unaccepted) after performing at most k* deletions/insertions.
We introduce the decision problems related to the computation of the robustness degree and focus on thoroughly investigating their computational complexity.
6400: UltraModel: A Modeling Paradigm for Industrial Objects
Authors: Haoran Yang, Yinan Zhang, Qunshan He, Yuqi Ye, Jing Zhao, Wenhai Wang
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Multidisciplinary Topics and Applications (1/2)
Show Abstract
As Industrial 4.0 unfolds and digital twin technology rapidly advances, modeling techniques that can abstract real-world industrial objects into accurate and robust models, referred to modeling for industrial objects (MIO) tasks, have become increasingly crucial. However, existing works still face two major limitations. First, each of these works primarily focuses on modeling a specific industrial object. When the industrial objects change, the proposed methods often struggle to adapt. Second, they fail to fully consider latent relationships within industrial data, limiting the model’s ability to leverage the data and resulting in suboptimal performance. To address these issues, we propose a novel modeling paradigm tailored for MIO tasks, named UltraModel. Specifically, a twin model graph module is designed to construct a customized graph based on the mechanisms of industrial objects and employ graph convolution to generate high-dimensional representations. Then, a multi-scale feature abstraction module and a spatial attention-based feature fusion module are proposed to complement each other in performing multi-scale feature abstraction and fusion on high-dimensional representations. Finally, the outputs are obtained by processing the fused representations through a feedforward network. Experiments on two different industrial objects demonstrate our UltraModel outperforms existing methods, offering a novel perspective for addressing industrial modeling challenges.
6402: A Prior-based Discrete Diffusion Model for Social Graph Generation
Authors: Shu Yin, Dongpeng Hou, Lianwei Wu, Xianghua Li, Chao Gao
Location: Guangzhou | Day: TBD
Show Abstract
Graph generation is essential in social network analysis, particularly for modeling information flow and user interactions. However, existing probabilistic diffusion models face challenges when applied to social propagation graphs. The continuous noise does not apply to the discrete nature of graph generation tasks, and the random Gaussian initialization in the reverse process can introduce biases that deviate from real-world propagation patterns. To address these issues, this paper introduces a Prior-based Discrete Diffusion Model (PDDM) for social graph generation. PDDM redefines the forward process as a discrete process for node denoising and edge generation, and the task of the denoising module is transformed into the connection probability learning of node-level tasks. Further, PDDM employs a new starting point of the reverse process by incorporating user similarity as the probability matrix, which can better leverage the social context. These developments mitigate reverse-starting bias and enhance model robustness. Moreover, PDDM integrates lightweight deep graph networks such as GAT, demonstrating both scalability and applicability to graph generation scenarios. Comprehensive experiments on real-world social network datasets demonstrate PDDM’s superiority in terms of the MMD metric and downstream tasks. The code is available at https://github.com/cgao-comp/PDDM.
6407: Co-Learning of Strategy and Structure Achieves Full Cooperation in Complex Networks with Dynamical Linking
Authors: Xiaoqing Fan, Chin-wing Leung, Paolo Turrini
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Agent-based and Multi-agent Systems (1/3)
Show Abstract
Social dilemmas are an important benchmark to study the emergence of cooperation among autonomous learning agents and impressive results were recently achieved in two-player games by reinforcement learning agents equipped with a partner selection module. However, the same cannot be said for games on networks. When surrounded by many other defectors, cooperators suffer harsher punishments and find it hard to replicate, making mass defection quickly take over. The frameworks studied so far for the emergence of cooperation in social dilemmas on networks have shown the key role of dynamical linking, the capacity of agents to select their own neighbours, but they have also relied on hard-wired heuristics, such as imitation dynamics, designed to favour cooperation. In this paper, we remove this constraint and study a population of agents that can autonomously learn whether to cooperate or defect with any of their neighbours in a social dilemma, as well as whether to form or sever social ties with others. Building on a seminal framework for the emergence of cooperation in complex social networks with dynamical linking, we implement our agents as Sarsa learners with Boltzmann exploration and equipped with partner selection actions. We show, for the first time, that these agents can reach a fully cooperative society without requiring ad-hoc heuristics. In doing so, we confirm the fundamental role of timescales, the relative speed at which strategy and structure updates occur, for the emergence of cooperation, highlighting the intricate interplay between network dynamics and decision-making in agent societies.
6413: Universal Graph Self-Contrastive Learning
Authors: Liang Yang, Yukun Cai, Hui Ning, Jiaming Zhuo, Di Jin, Ziyi Ma, Yuanfang Guo, Chuan Wang, Zhen Wang
Location: Guangzhou | Day: TBD
Show Abstract
As a pivotal architecture in Self-Supervised Learning (SSL), Graph Contrastive Learning (GCL) has demonstrated substantial application value in scenarios with limited labeled nodes (samples). However, existing GCLs encounter critical issues in the graph augmentation and positive and negative sampling stemming from the lack of explicit supervision, which collectively restrict their efficiency and universality. On the one hand, the reliance on graph augmentations in existing GCLs can lead to increased training times and memory usage, while potentially compromising the semantic integrity. On the other hand, the difficulty in selecting TRUE positive and negative samples for GCLs limits their universality to both homophilic and heterophilic graphs. To address these drawbacks, this paper introduces a novel GCL framework called GRAph learning via Self-contraSt (GRASS). The core mechanism is node-attribute self-contrast, which specifically involves increasing the feature similarities between nodes and their included attributes while decreasing the similarities between nodes and their non-included attributes. Theoretically, the self-contrast mechanism implicitly ensures accurate node-node contrast by capturing high-hop co-inclusion relationships, thereby enabling GRASS to be universally applicable to graphs with varying degrees of homophily. Evaluations on diverse benchmark datasets demonstrate the universality and efficiency of GRASS. The dataset and code are available at URL: https://github.com/YukunCai/GRASS.
6421: Image-Enhanced Hybrid Encoding with Reinforced Contrastive Learning for Spatial Domain Identification in Spatial Transcriptomics
Authors: Daoyuan Wang, Lu Gao, Wenlan Chen, Cheng Liang, Fei Guo
Location: Guangzhou | Day: TBD
Show Abstract
Spatial transcriptomics integrates spatial, gene expression, and multichannel immunohistochemistry image data, enabling advanced insights into cellular organization. However, existing methods often struggle to effectively fuse these multimodal data, limiting their potential for accurate spatial domain identification. Here, we propose IE-HERCL (Image-Enhanced Hybrid Encoding with Reinforced Contrastive Learning), a novel framework designed to address this challenge. Specifically, IE-HERCL employs hybrid encoding to capture both the non-spatial features and spatial dependencies for both gene and image modalities via autoencoders and GraphSAGE, respectively. These features are then fused using cross-view attention mechanisms to generate the unified informative embedding. To enhance the representation learning capability, we introduce a reinforced contrastive learning strategy to mitigate the influences of false negative samples, where we detect potential positive counterparts with high-order random walks. In addition, the cluster alignment is dynamically refined through optimal transport, which ensures that the fused consensus representation is coherent and robust, enabling accurate spatial domain identification. Our approach achieves state-of-the-art performance on five image-enhanced spatial transcriptomics datasets, demonstrating its robustness and effectiveness in multimodal integration and spatial domain identification. IE-HERCL offers a powerful and innovative solution for advancing spatial transcriptomics analysis. The code is released on https://github.com/wdyi701/IE-HERCL.
6434: MCD-CLIP: Multi-view Chest Disease Diagnosis with Disentangled CLIP
Authors: Songyue Cai, Yujie Mo, Liang Peng, Yucheng Xie, Tao Tong, Xiaofeng Zhu
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Computer vision (2/3)
Show Abstract
Pre-trained methods for multi-view chest X-ray images have demonstrated impressive performance in chest disease diagnosis, but there are still some limitations that need to be addressed. Firstly, many pre-trained methods require full fine-tuning pre-trained models to induce significant computational resource usage and the prior knowledge destruction. Secondly, many pre-trained methods cannot efficiently balance consistency and complementarity among views, leading to information loss and performance degradation. To tackle these issues, we propose MCD-CLIP, a CLIP-based multi-view chest disease diagnosis method. It uses visual prompts and a Prompt-Aligner to align prompts across views, along with the additional text representation for efficient transfer. Moreover, we employ Adapters to disentangle the image representation, maintaining consistency and complementarity from different views. Experimental results on the chest X-ray dataset demonstrate that MCD-CLIP achieves comparable or better performance on a variety of tasks with 94.31% fewer tunable parameters compared to state-of-the-art methods. The source codes are released at https://github.com/YuzunoKawori/MCD-CLIP.
6447: A Fast-Adaptive Cognitive Diagnosis Framework for Computerized Adaptive Testing Systems
Authors: Yuanhao Liu, Yiya You, Shuo Liu, Hong Qian, Ying Qian, Aimin Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Computerized Adaptive Testing (CAT) measures student ability by iteratively selecting informative questions, with core components being the Cognitive Diagnosis Model (CDM) and selection strategy. Current research focuses on optimizing the selection strategy, assuming relatively accurate CDM results. However, existing static CDMs struggle with rapid and accurate diagnosis in the early stage of CAT. To this end, this paper proposes a Fast Adaptive Cognitive Diagnosis (FACD) framework, which incorporates dynamic collaborative and personalized diagnosis modules. Specifically, the collaborative module in FACD uses a dynamic response graph to quickly build student cognitive profiles, while the personalized module leverages each student’s response sequence for robust and individualized diagnosis. Extensive experiments on real-world datasets show that, compared with existing static CDMs, FACD not only achieves superior prediction performance across various selection strategies with an improvement between roughly 5%-10% in the early stage of CAT, but also maintains a commendable inference speed.
6451: MA-RAG: Automating Role Engineering for RESTful APIs with Multi-Head Attention and Retrieval-Augmented Generation
Authors: Yang Luo, Qingni Shen, Zhonghai Wu
Location: Guangzhou | Day: TBD
Show Abstract
This paper addresses the role engineering problem for RESTful applications and proposes a role engineering method based on multi-head attention and Retrieval Augmented Generation called MA-RAG. The method first performs fine-grained control flow analysis on the system source code to extract permission information of API handlers. Then, using basic blocks as units, it employs pre-trained code models to convert the source code into semantic vectors, which are stored in the retrieval augmented generation model. On this basis, a call chain structure tree is constructed with permissions as the center, utilizing the multi-head attention mechanism to aggregate semantic information of different code granularities from bottom to top, with each attention head corresponding to a role engineering objective. Finally, the root vectors of each permission tree are subjected to self-supervised clustering to adaptively determine the number of roles and perform division. We evaluated MA-RAG on 284 real-world software systems, and the results show that compared with other methods, MA-RAG can significantly save time overhead, reduce the number of generated roles, lower the role permission overlap rate, and improve the interpretability score.
6457: Mamba-Based Graph Convolutional Networks: Tackling Over-smoothing with Selective State Space
Authors: Xin He, Yili Wang, Wenqi Fan, Xu Shen, Xin Juan, Rui Miao, Xin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) have shown great success in various graph-based learning tasks. However, it often faces the issue of over-smoothing as the model depth increases, which causes all node representations to converge to a single value and become indistinguishable. This issue stems from the inherent limitations of GNNs, which struggle to distinguish the importance of information from different neighborhoods. In this paper, we introduce MbaGCN, a novel graph convolutional architecture that draws inspiration from the Mamba paradigm—originally designed for sequence modeling. MbaGCN presents a new backbone for GNNs, consisting of three key components: the Message Aggregation Layer, the Selective State Space Transition Layer, and the Node State Prediction Layer. These components work in tandem to adaptively aggregate neighborhood information, providing greater flexibility and scalability for deep GNN models. While MbaGCN may not consistently outperform all existing methods on each dataset, it provides a foundational framework that demonstrates the effective integration of the Mamba paradigm into graph representation learning. Through extensive experiments on benchmark datasets, we demonstrate that MbaGCN paves the way for future advancements in graph neural network research. Our code is in https://github.com/hexin5515/MbaGCN.
6494: EDGE: Efficient Data Selection for LLM Agents via Guideline Effectiveness
Authors: Yunxiao Zhang, Guanming Xiong, Haochen Li, Wen Zhao
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: Natural Language Processing (1/2)
Show Abstract
Large Language Models (LLMs) have shown remarkable capabilities as AI agents. However, existing methods for enhancing LLM-agent abilities often lack a focus on data quality, leading to inefficiencies and suboptimal results in both fine-tuning and prompt engineering. To address this issue, we introduce EDGE, a novel approach for identifying informative samples without needing golden answers. We propose the Guideline Effectiveness (GE) metric, which selects challenging samples by measuring the impact of human-provided guidelines in multi-turn interaction tasks. A low GE score indicates that the human expertise required for a sample is missing from the guideline, making the sample more informative. By selecting samples with low GE scores, we can improve the efficiency and outcomes of both prompt engineering and fine-tuning processes for LLMs. Extensive experiments validate the performance of our method. Our method achieves competitive results on the HotpotQA and WebShop and datasets, requiring 75% and 50% less data, respectively, while outperforming existing methods. We also provide a fresh perspective on the data quality of LLM-agent fine-tuning.
6507: Multimodal Fake News Detection: MFND Dataset and Shallow-Deep Multitask Learning
Authors: Ye Zhu, Yunan Wang, Zitong Yu
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal news contains a wealth of information and is easily affected by deepfake modeling attacks. To combat the latest image and text generation methods, we present a new Multimodal Fake News Detection dataset (MFND) containing 11 manipulated types, designed to detect and localize highly authentic fake news. Furthermore, we propose a Shallow-Deep Multitask Learning (SDML) model for fake news, which fully uses unimodal and mutual modal features to mine the intrinsic semantics of news. Under shallow inference, we propose the momentum distillation-based light punishment contrastive learning for fine-grained uniform spatial image and text semantic alignment, and an adaptive cross-modal fusion module to enhance mutual modal features. Under deep inference, we design a two-branch framework to augment the image and text unimodal features, respectively merging with mutual modalities features, for four predictions via dedicated detection and localization projections. Experiments on both mainstream and our proposed datasets demonstrate the superiority of the model. Codes and dataset are released at https://github.com/yunan-wang33/sdml.
6519: Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction
Authors: Xinhe Li, Jiajun Liu, Peng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Recent studies have demonstrated that Large Language Models (LLMs) have strong mathematical reasoning abilities but rely on hundreds of billions of parameters. To tackle the challenge of poor reasoning in Small Language Models (SLMs), existing methods typically leverage LLMs to generate massive amounts of data for cramming training. In psychology, they are akin to System 1 thinking, which resolves reasoning problems rapidly based on experience and intuition. However, human learning also requires System 2 thinking, where knowledge is first acquired and then reinforced through practice. Inspired by such two distinct modes of thinking, we propose a novel method based on the multi-LoRA Interaction for mathematical reasoning Distillation (LoRID). First, we input the question and reasoning of each sample into an LLM to create knowledge-enhanced datasets. Subsequently, we train a LoRA block on the student model as an Intuitive Reasoner (IR), which directly generates Chain-of-Thoughts for problem-solving. Then, to imitate System 2 thinking, we train the Knowledge Generator (KG) and Deep Reasoner (DR), respectively. The former outputs only knowledge after receiving problems, while the latter uses that knowledge to perform reasoning. Finally, to address the randomness in the generation of IR and DR, we evaluate whether their outputs are consistent, and the inference process needs to be iterated if not. This step can enhance the mathematical reasoning ability of SLMs through mutual feedback. Experimental results show that LoRID achieves state-of-the-art performance, especially on the GSM8K dataset, where it outperforms the second-best method by 2.3%, 16.1%, 2.4%, 12.3%, and 1.8% accuracy across the five base models, respectively. Meanwhile, we select four strong baselines as System 1, and after integrating them with our method, the reasoning ability of student models is consistently and significantly improved. The datasets and codes are available at https://github.com/Xinhe-Li/LoRID.
6523: Disentangling Multi-view Representations via Curriculum Learning with Learnable Prior
Authors: Kai Guo, Jiedong Wang, Xi Peng, Peng Hu, Hao Wang
Location: Guangzhou | Day: TBD
Show Abstract
Multi-view representation learning methods typically follow a consistent-and-specific pipeline that aims at extracting latent representations for an entity from its multiple observable views to facilitate downstream tasks. However, most of them overlook the complex underlying correlation between different views. To solve this issue, we delve into a well-known property of neural networks (NNs) that NNs tend to learn simple patterns first and then hard ones. In our case, view-consistent representations are simple patterns and view-specific representations are hard. To this end, we propose to disentangle view-consistency and view-specificity and learn them gradually. Specifically, we devise a novel curriculum learning approach that adjusts the whole model to learn view-consistent representations first and then progressively view-specific representations. Besides, we saddle each view with a learnable prior that allows each view-specific representation to appropriate its distribution. Moreover, we incorporate a mixture-of-experts layer and a disentangling module to further enhance the quality of the learned representations. Extensive experiments on five real-world datasets show that the proposed model outperforms its counterparts markedly. The code is available at https://github.com/XLearning-SCU/2025-IJCAI-CL2P.
6526: FADE: Towards Fairness-aware Data Generation for Domain Generalization via Classifier-Guided Score-based Diffusion Models
Authors: Yujie Lin, Dong Li, Minglai Shao, Guihong Wan, Chen Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Fairness-aware domain generalization (FairDG) has emerged as a critical challenge for deploying trustworthy AI systems, particularly in scenarios involving distribution shifts. Traditional methods for addressing fairness have failed in domain generalization due to their lack of consideration for distribution shifts. Although disentanglement has been used to tackle FairDG, it is limited by its strong assumptions. To overcome these limitations, we propose Fairness-aware Classifier-Guided Score-based Diffusion Models (FADE) as a novel approach to effectively address the FairDG issue. Specifically, we first pre-train a score-based diffusion model (SDM) and two classifiers to equip the model with strong generalization capabilities across different domains. Then, we guide the SDM using these pre-trained classifiers to effectively eliminate sensitive information from the generated data. Finally, the generated fair data is used to train downstream classifiers, ensuring robust performance under new data distributions. Extensive experiments on three real-world datasets demonstrate that FADE not only enhances fairness but also improves accuracy in the presence of distribution shifts. Additionally, FADE outperforms existing methods in achieving the best accuracy-fairness trade-offs.
6528: Cross-modal Collaborative Representation Learning for Text-to-Image Person Retrieval
Authors: Shuanglin Yan, Jun Liu, Neng Dong, Liyan Zhang, Jinhui Tang
Location: Guangzhou | Day: TBD
Show Abstract
Text-to-image person retrieval (TIPR) aims to find images of the same identity that match a given text description. Current TIPR methods mainly focus on mining the association between images and texts, ignoring their potential complementarity. Besides, existing matching losses treat all positive pairs from the same identity equally, leading to noisy correspondences. In this paper, we propose CoRL: a cross-modal Collaborative Representation Learning framework designed to improve TIPR by effectively leveraging the complementarity between modalities. The text typically contains identity details with less noise, which helps distinguish visually similar pedestrians. This inspires us to integrate it into the corresponding image to emphasize identity-related and modality-shared visual information. However, corresponding text for each image is not always available, especially during inference. Accordingly, we introduce a Virtual-text Embedding Synthesizer that generates high-quality virtual-text features for cross-modal collaboration, eliminating the need for actual texts. We then design a Cross-Modal Collaboration learning process, incorporating a Cross-modal Relation Consistency loss to promote interaction and fusion between image and virtual-text features for mutual enhancement. Additionally, an Identity-bounded Matching loss is proposed to handle different types of image-text pairs distinctly, leading to more accurate cross-modal correspondences. Extensive experiments on multiple benchmarks demonstrate the superiority of CoRL over existing TIPR methods.
6553: Federated Domain Generalization with Decision Insight Matrix
Authors: Tianchi Liao, Binghui Xie, Lele Fu, Sheng Huang, Bowen Deng, Chuan Chen, Zibin Zheng
Location: Guangzhou | Day: TBD
Show Abstract
Federated domain generalization addresses the crucial challenge of developing models that can generalize across diverse domains while maintaining data privacy in federated learning settings. Current approaches either compromise privacy constraints or focus narrowly on specific aspects of model invariance, often incurring significant computational overhead. We propose a novel approach FedDIM, which leverages the concept of “insight matrix” – a fine-grained representation of the model’s decision-making process derived from element-wise products between feature vectors and classifier weights. By introducing a regularization term that promotes consistency between individual sample insight matrices and their class-wise mean representations, our method effectively captures both feature and classifier invariance. This approach not only maintains strict privacy requirements but also introduces minimal computational overhead as it utilizes intermediate computations already present in the forward pass. Extensive experiments demonstrate that our method achieves superior out-of-distribution generalization compared to existing federated learning approaches while being simple to implement. Our work provides a new perspective on achieving robust generalization in federated learning settings through the lens of decision-making processes.
6558: Flexible Generalized Low-Rank Regularizer for Tensor RPCA
Authors: Zhiyang Gong, Jie Yu, Yutao Hu, Yulong Wang
Location: Guangzhou | Day: TBD
Show Abstract
Tensor Robust Principal Component Analysis (TRPCA) has emerged as a powerful technique for low-rank tensor recovery. To achieve better recovery performance, a variety of TNN (Tensor Nuclear Norm) based low-rank regularizers have been proposed case by case, lacking a general and flexible framework. In this paper, we design a novel tensor low-rank regularization framework coined FGTNN (Flexible Generalized Tensor Nuclear Norm). Equipped with FGTNN, we develop the FGTRPCA (Flexible Generalized TRPCA) framework, which has two desirable properties. 1) Generalizability: Many existing TRPCA methods can be viewed as special cases of our framework; 2) Flexibility: Using FGTRPCA as a general platform, we derive a series of new TRPCA methods by tuning a continuous parameter to improve performance. In addition, we develop another novel smooth and low-rank regularizer coined t-FGJP and the resulting SFGTRPCA (Smooth FGTRPCA) method by leveraging the low-rankness and smoothness priors simultaneously. Experimental results on various tensor denoising and recovery tasks demonstrate the superiority of our methods.
6581: Bridging Generative and Discriminative Learning: Few-Shot Relation Extraction via Two-Stage Knowledge-Guided Pre-training
Authors: Quanjiang Guo, Jinchuan Zhang, Sijie Wang, Ling Tian, Zhao Kang, Bin Yan, Weidong Xiao
Location: Guangzhou | Day: TBD
Show Abstract
Few-Shot Relation Extraction (FSRE) remains a challenging task due to the scarcity of annotated data and the limited generalization capabilities of existing models. Although large language models (LLMs) have shown potential in FSRE through in-context learning, their general-purpose training objectives often result in suboptimal performance for task-specific relation extraction. To overcome these challenges, we propose TKRE (Two-Stage Knowledge-Guided Pre-training for Relation Extraction), a novel framework that synergistically integrates LLMs with traditional relation extraction models, bridging generative and discriminative learning paradigms. TKRE introduces two key innovations: (1) leveraging LLMs to generate explanation-driven knowledge and schema-constrained synthetic data, addressing the issue of data scarcity; and (2) a two-stage pre-training strategy combining Masked Span Language Modeling (MSLM) and Span-Level Contrastive Learning (SCL) to enhance relational reasoning and generalization. Together, these components enable TKRE to effectively handle FSRE tasks. Comprehensive experiments on benchmark datasets demonstrate the efficacy of TKRE, achieving new state-of-the-art performance in FSRE and underscoring its potential for broader application in low-resource scenarios. The code and data are released on https://github.com/UESTC-GQJ/TKRE.
6586: Relation-Augmented Dueling Bayesian Optimization via Preference Propagation
Authors: Xiang Xia, Xiang Shu, Shuo Liu, Yiyi Zhu, Yijie Zhou, Weiye Wang, Bingdong Li, Hong Qian
Location: Guangzhou | Day: TBD
Show Abstract
In black-box optimization, when directly evaluating the function values of solutions is very costly or infeasible, access to the objective function is often limited to comparing pairs of solutions, which yields dueling black-box optimization. Dueling optimization is solely based on pairwise preferences, and thus notably reduces cost compared with function value based methods. However, the optimization performance of dueling optimization is often limited due to that most existing dueling optimization methods do not make full use of the pairwise preferences collected. To better utilize these preferences, this paper proposes relation-augmented dueling Bayesian optimization (RADBO) via preference propagation. By considering solution similarity, RADBO aims to uncover the potential dueling relations between solutions within different preferences through the proposed preference propagation technique. Specifically, RADBO first clusters solutions using a Gaussian mixture model. After obtaining the solution set with the highest intra-cluster similarity, RADBO utilizes a directed hypergraph to model the potential dueling relations between solutions, thereby realizing relation augmentation. Extensive experiments are conducted on both synthetic functions and real-world tasks such as motion control, car cab design and spacecraft trajectory optimization. The experimental results disclose the satisfactory accuracy of augmented preferences in RADBO, and show the superiority of RADBO compared with existing dueling optimization methods. Notably, it is verified that, under the same evaluation cost budget, RADBO can be competitive with or even surpass the function value based Bayesian optimization methods with respect to optimization performance.
6600: PerfSeer: An Efficient and Accurate Deep Learning Models Performance Predictor
Authors: Xinlong Zhao, Jiande Sun, Jia Zhang, Tong Liu, Ke Liu
Location: Guangzhou | Day: TBD
Show Abstract
Predicting the performance of deep learning (DL) models, such as execution time and resource utilization, is crucial for Neural Architecture Search (NAS), DL cluster schedulers, and other technologies that advance deep learning. The representation of a model is the foundation for its performance prediction. However, existing methods cannot comprehensively represent diverse model configurations, resulting in unsatisfactory accuracy. To address this, we represent a model as a graph that includes the topology, along with node, edge, and global features, all of which are crucial for effectively capturing the performance of the model. Based on this representation, we propose PerfSeer, a novel predictor that uses a Graph Neural Network (GNN)-based performance prediction model, SeerNet. SeerNet fully leverages the topology and various features, while incorporating optimizations such as Synergistic Max-Mean aggregation (SynMM) and Global-Node Perspective Boost (GNPB) to more effectively capture the critical performance information, enabling it to predict the performance of models accurately. Furthermore, SeerNet can be extended to SeerNet-Multi by using Project Conflicting Gradients (PCGrad), enabling efficient simultaneous prediction of multiple performance metrics without significantly affecting accuracy. We constructed a dataset containing performance metrics for 53k+ model configurations, including execution time, memory usage, and Streaming Multiprocessor (SM) utilization during both training and inference. The evaluation results show that PerfSeer outperforms nn-Meter, Brp-NAS, and DIPPM.
6601: Approximation Fixpoint Theory as a Unifying Framework for Fuzzy Logic Programming Semantics
Authors: Pascal Kettmann, Jesse Heyninck, Hannes Strass
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: KRR: Logic programming
Show Abstract
Fuzzy logic programming is an established approach for reasoning under uncertainty. Several semantics from classical, two-valued logic programming have been generalized to the case of fuzzy logic programs. In this paper, we show that two of the most prominent classical semantics, namely the stable model and the well-founded semantics, can be reconstructed within the general framework of approximation fixpoint theory (AFT).

This not only widens the scope of AFT from two- to many-valued logics, but allows a wide range of existing AFT results to be applied to fuzzy logic programming. As first examples of such applications, we clarify the formal relationship between existing semantics, generalize the notion of stratification from classical to fuzzy logic programs, and devise “more precise” variants of the semantics.
6602: RPMIL: Rethinking Uncertainty-Aware Probabilistic Multiple Instance Learning for Whole Slide Pathology Diagnosis
Authors: Zhikang Zhao, Kaitao Chen, Jing Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Whole slide images (WSIs) are gigapixel digital scans of traditional pathology slides, offering substantial support for cancer diagnosis. Current multiple instance learning (MIL) methods for WSIs typically extract instance features and aggregate these into a single bag feature for prediction. We observe that these MIL methods rely on point estimation, where each bag is mapped to a deterministic embedding. Such MIL methods based on point estimation fail to capture the full spectrum of data variability due to the reliance on fixed embedding, especially when the number of trainable bags is limited. In this paper, we rethink probabilistic modeling in MIL and propose RPMIL, an uncertainty-aware probabilistic MIL method for whole slide pathology diagnosis. RPMIL learns a probabilistic aggregator to consolidate instance features into dynamic bag feature distributions instead of a deterministic bag feature. Specifically, we employ a variational autoencoder approach to compress multiple instance features into a low-dimension space with probabilistic representation and obtain the bag feature distribution formulated by the mean and variance. Furthermore, we drive the prediction by jointly leveraging the instance feature distribution and bag feature distribution. We evaluate the WSI classification performance on two public datasets: Camelyon16 and TCGA-NSCLC. Extensive experiments demonstrate that our method surpasses point estimation methods in MIL, achieving state-of-the-art levels.
6603: Participatory Budgeting Project Strength via Candidate Control
Authors: Piotr Faliszewski, Łukasz Janeczko, Dušan Knop, Jan Pokorný, Šimon Schierreich, Mateusz Słuszniak, Krzysztof Sornat
Location: Montreal | Day: August 21st | Time: 15:00 | Session: GTEP: Computational social choice (2/2)
Show Abstract
We study the complexity of candidate control in participatory budgeting elections. The goal of constructive candidate control is to ensure that a given candidate wins by either adding or deleting candidates from the election (in the destructive setting, the goal is to prevent a given candidate from winning). We show that such control problems are NP-hard to solve for many participatory budgeting voting rules, including Phragmén and Equal-Shares, but there are natural cases with polynomial-time algorithms. We also argue that control by deleting candidates is a useful tool for assessing the performance (or, strength) of initially losing projects, and we support this view with experiments on real-life PB instances.
6606: SourceDetMamba: A Graph-aware State Space Model for Source Detection in Sequential Hypergraphs
Authors: Le Cheng, Peican Zhu, Yangming Guo, Chao Gao, Zhen Wang, Keke Tang
Location: Guangzhou | Day: TBD
Show Abstract
Source detection on graphs has demonstrated high efficacy in identifying rumor origins. Despite advances in machine learning-based methods, many fail to capture intrinsic dynamics of rumor propagation. In this work, we present SourceDetMamba: A Graph-aware State Space Model for Source Detection in Sequential Hypergraphs, which harnesses the recent success of the state space model Mamba, known for its superior global modeling capabilities and computational efficiency, to address this challenge. Specifically, we first employ hypergraphs to model high-order interactions within social networks. Subsequently, temporal network snapshots generated during the propagation process are sequentially fed in reverse order into Mamba to infer underlying propagation dynamics. Finally, to empower the sequential model to effectively capture propagation patterns while integrating structural information, we propose a novel graph-aware state update mechanism, wherein the state of each node is propagated and refined by both temporal dependencies and topological context. Extensive evaluations on eight datasets demonstrate that SourceDetMamba consistently outperforms state-of-the-art approaches.
6614: An Approach to Quantify Plans Robustness in Real-world Applications
Authors: Francesco Percassi, Sandra Castellanos-Paez, Romain Rombourg, Mauro Vallati
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Planning and Scheduling (3/5)
Show Abstract
Automated planning systems are increasingly deployed in real-world applications, often characterised by uncertainty and noise stemming from sensors, actuators, and environmental conditions. Under such circumstances, improving the deployability of generated plans requires assessing their robustness to varying conditions, thereby reducing the need for costly replanning. Replanning can be computationally intensive and may hinder the practical applicability of planning systems. In many domains, such as urban traffic control or underwater exploration, it is often sufficient for plans to reach an acceptable region rather than the exact goal.

A key distinction in this context lies between valid plans (which achieve the intended goal under ideal conditions) and executable plans (which remain feasible under uncertainty or perturbation). This paper formalises the notion of execution-invariant planning tasks, in which plans are robust to noise and uncertainty. To foster the adoption of automated planning in real-world settings, we propose a statistical framework for evaluating plan robustness, offering a quantifiable measure of a plan’s ability to reach a goal within a specified tolerance under diverse perturbations or uncertainty. We validate our approach in two real-world domains, demonstrating its effectiveness.
6615: Human Motion Capture from Loose and Sparse Inertial Sensors with Garment-aware Diffusion Models
Authors: Andela Ilic, Jiaxi Jiang, Paul Streli, Xintong Liu, Christian Holz
Location: Montreal | Day: August 19th | Time: 15:00 | Session: CV: Difusion models
Show Abstract
Motion capture using sparse inertial sensors has shown great promise due to its portability and lack of occlusion issues compared to camera-based tracking.
Existing approaches typically assume that IMU sensors are tightly attached to the human body.
However, this assumption often does not hold in real-world scenarios.
In this paper, we present a new task of full-body human pose estimation using sparse, loosely attached IMU sensors.
To solve this task, we simulate IMU recordings from an existing garment-aware human motion dataset.
We developed transformer-based diffusion models to synthesize loose IMU data and estimate human poses based on this challenging loose IMU data.
In addition, we show that incorporating garment-related parameters while training the model on simulated loose data effectively maintains expressiveness and enhances the ability to capture variations introduced by looser or tighter garments.
Experiments show that our proposed diffusion methods trained on simulated and synthetic data outperformed the state-of-the-art methods quantitatively and qualitatively, opening up a promising direction for future research.
6627: HyperDet: Source Detection in Hypergraphs via Interactive Relationship Construction and Feature-rich Attention Fusion
Authors: Le Cheng, Peican Zhu, Yangming Guo, Keke Tang, Chao Gao, Zhen Wang
Location: Guangzhou | Day: TBD
Show Abstract
Hypergraphs offer superior modeling capabilities for social networks, particularly in capturing group phenomena that extend beyond pairwise interactions in rumor propagation. Existing approaches in rumor source detection predominantly focus on dyadic interactions, which inadequately address the complexity of more intricate relational structures. In this study, we present a novel approach for Source Detection in Hypergraphs (HyperDet) via Interactive Relationship Construction and Feature-rich Attention Fusion. Specifically, our methodology employs an Interactive Relationship Construction module to accurately model both the static topology and dynamic interactions among users, followed by the Feature-rich Attention Fusion module, which autonomously learns node features and discriminates between nodes using a self-attention mechanism, thereby effectively learning node representations under the framework of accurately modeled higher-order relationships. Extensive experimental validation confirms the efficacy of our HyperDet approach, showcasing its superiority relative to current state-of-the-art methods.
6632: How to Resolve Envy by Adding Goods
Authors: Matthias Bentert, Robert Bredereck, Eva Deltl, Pallavi Jain, Leon Kellerhals
Location: Montreal | Day: August 21st | Time: 15:00 | Session: GTEP: Computational social choice (2/2)
Show Abstract
We consider the problem of resolving the envy of a given initial allocation by adding elements from a pool of goods. We give a characterization of the instances where envy can be resolved by adding an arbitrary number of copies of the items in the pool. From this characterization, we derive a polynomial-time algorithm returning a respective solution if it exists. If the number of copies or the total number of added items are bounded, the problem becomes computationally intractable even in various restricted cases. We perform a parameterized complexity analysis, focusing on the number of agents and the pool size as parameters. Notably, although not every instance admits an envy-free solution, our approach allows us to efficiently determine, in polynomial time, whether a solution exists—an aspect that is both theoretically interesting and far from trivial.
6634: A Timestep-Adaptive Frequency-Enhancement Framework for Diffusion-based Image Super-Resolution
Authors: Yueying Li, Hanbin Zhao, Jiaqing Zhou, Guozhi Xu, Tianlei Hu, Gang Chen, Haobo Wang
Location: Guangzhou | Day: TBD
Show Abstract
Image super-resolution (ISR) is a classic and challenging problem in computer vision because of complex and unknown degradation patterns in the data collection process. Leveraging powerful generative priors, diffusion-based methods have recently established new state-of-the-art ISR performance, but their characteristics in the frequency domain are still underexplored. In this paper, we innovatively investigate their frequency-domain behaviors from a sampling timestep perspective. Experimentally, we find that current diffusion-based ISR algorithms exhibit insufficiency in different frequency components in distinct groups of timesteps during the sampling. To address this, we first propose a Timestep Division Controller that is able to adaptively divide the timesteps into groups based on the performance gradient across different components. Next, we design two dedicated modules — the Amplitude and Phase Enhancement Module (APEM) and the High- and Low-Frequency Enhancement Module (HLEM), to regulate the information flow of distinct frequency-domain features. By adaptively enhancing specific frequency components at different stages of the sampling process, the two modules effectively compensate for the insufficient frequency-domain perception of diffusion-based ISR models. Extensive experiments on three benchmark datasets verify the superior ISR performance of our method, e.g., achieving an average 5.40% improvement on CLIP-IQA compared to the best diffusion-based ISR baseline.
6638: Towards Robust Incremental Learning Under Ambiguous Supervision
Authors: Rui Wang, Mingxuan Xia, Haobo Wang, Lei Feng, Junbo Zhao, Gang Chen, Chang Yao
Location: Guangzhou | Day: TBD
Show Abstract
Traditional Incremental Learning (IL) targets to handle sequential fully-supervised learning problems where novel classes emerge from time to time. However, due to inherent annotation uncertainty and ambiguity, collecting high-quality annotated data in a dynamic learning system can be extremely expensive. To mitigate this problem, we propose a novel weakly-supervised learning paradigm called Incremental Partial Label Learning (IPLL), where the sequentially arrived data relate to a set of candidate labels rather than the ground truth. Technically, we develop the Prototype-Guided Disambiguation and Replay Algorithm (PGDR) which leverages the class prototypes as a proxy to mitigate two intertwined challenges in IPLL, i.e., label ambiguity and catastrophic forgetting. To handle the former, PGDR encapsulates a momentum-based pseudo-labeling algorithm along with prototype-guided initialization, resulting in a balanced perception of classes. To alleviate forgetting, we develop a memory replay technique that collects well-disambiguated samples while maintaining representativeness and diversity. By jointly distilling knowledge from curated memory data, our framework exhibits a great disambiguation ability for samples of new tasks and achieves less forgetting of knowledge. Extensive experiments demonstrate that PGDR achieves superior performance over the baselines in the IPLL task.
6660: Combining MORL with Restraining Bolts to Learn Normative Behaviour
Authors: Emery A. Neufeld, Agata Ciabattoni, Radu Florin Tulcan
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Knowledge Representation and Reasoning (2/4)
Show Abstract
Normative Restraining Bolts (NRBs) adapt the restraining bolt technique (originally developed for safe reinforcement learning) to ensure compliance with social, legal, and ethical norms. While effective, NRBs rely on trial-and-error weight tuning, which hinders their ability to enforce hierarchical norms; moreover, norm updates require retraining. In this paper, we reformulate learning with NRBs as a multi-objective reinforcement learning (MORL) problem, where each norm is treated as a distinct objective. This enables the introduction of Ordered Normative Restraining Bolts (ONRBs), which support algorithmic weight selection, prioritized norms, norm updates, and provide formal guarantees on minimizing norm violations. Case studies show that ONRBs offer a robust and principled foundation for RL-agents to comply with a wide range of norms while achieving their goals.
6669: NeuBM: Mitigating Model Bias in Graph Neural Networks Through Neutral Input Calibration
Authors: Jiawei Gu, Ziyue Qiao, Xiao Luo
Location: Guangzhou | Day: TBD
Show Abstract
Graph Neural Networks (GNNs) have shown remarkable performance across various domains, yet they often struggle with model bias, particularly in the presence of class imbalance. This bias can lead to suboptimal performance and unfair predictions, especially for underrepresented classes. We introduce NeuBM (Neutral Bias Mitigation), a novel approach to mitigate model bias in GNNs through neutral input calibration. NeuBM leverages a dynamically updated neutral graph to estimate and correct the inherent biases of the model. By subtracting the logits obtained from the neutral graph from those of the input graph, NeuBM effectively recalibrates the model’s predictions, reducing bias across different classes. Our method integrates seamlessly into existing GNN architectures and training procedures, requiring minimal computational overhead. Extensive experiments on multiple benchmark datasets demonstrate that NeuBM significantly improves the balanced accuracy and recall of minority classes, while maintaining strong overall performance. The effectiveness of NeuBM is particularly pronounced in scenarios with severe class imbalance and limited labeled data, where traditional methods often struggle. We provide theoretical insights into how NeuBM achieves bias mitigation, relating it to the concept of representation balancing. Our analysis reveals that NeuBM not only adjusts the final predictions but also influences the learning of balanced feature representations throughout the network.
6674: Disconfounding Fake News Video Explanation with Causal Inference
Authors: Lizhi Chen, Zhong Qian, Peifeng Li, Qiaoming Zhu
Location: Guangzhou | Day: TBD
Show Abstract
The proliferation of fake news videos on social media has heightened the demand for credible verification systems. While existing methods focus on detecting false content, generating human-readable explanations for such predictions remains a critical challenge. Current approaches suffer from spurious correlations caused by two key confounders: 1) video object bias, where co-occurring objects entangle features leading to incorrect semantic associations; and 2) explanation aspect bias, where models over-rely on frequent aspects while neglecting rare ones. To address these issues, we propose CIFE, a causal inference framework that disentangles confounding factors to generate unbiased explanations. First, we formalize the problem through a Structural Causal Model (SCM) to identify confounding factors. We then introduce two novel modules: 1) the Interventional Video-Object Detector (IVOD), which employs backdoor adjustment to decouple object-level visual semantics; and 2) the Interventional Explanation Aspect Module (IEAM), which balances aspect selection during multimodal fusion. Extensive experiments on the FakeVE dataset demonstrate the effectiveness of CIFE, which generates more faithful explanations by mitigating object entanglement and aspect imbalance. Our code is available at https://github.com/Lieberk/CIFE.
6682: Proactive Data-driven Scheduling of Business Processes
Authors: Francesca Meneghello, Arik Senderovich, Massimiliano Ronzani, Chiara Di Francescomarino, Chiara Ghidini
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Planning and Scheduling (1/5)
Show Abstract
Proactive scheduling creates robust offline schedules that optimize resource utilization and minimize job flow times. This work addresses scheduling challenges in business processes,
often encountered in service systems, which differ from traditional applications like manufacturing due to inherent uncertainties in activity durations, and human resource availability. We model the business process scheduling problem (BPSP) as a variation of stochastic resource-constrained multi-project scheduling (RCMPSP), and apply process mining to infer unknown parameter values from historical event data. To overcome the randomness in activity durations, we transform the problem into its deterministic counterpart, and prove that the latter provides a lower bound on the Makespan of the stochastic problem. Our approach integrates data-driven Monte Carlo simulation with constraint programming to generate proactive schedules that guarantee, with high probability, that the Makespan remains below a predefined threshold. We evaluate our approach using synthetic datasets with varying levels of uncertainty and size. In addition, we apply the approach to a real-world dataset from an outpatient cancer hospital, demonstrating its effectiveness in optimizing the process Makespan by an average of 5% to 14%.
6687: Theoretical Analysis of Evolutionary Algorithms with Quality Diversity for a Classical Path Planning Problem
Authors: Duc-Cuong Dang, Aneta Neumann, Frank Neumann, Andre Opris, Dirk Sudholt
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: S: Evolutionary computation (2/2)
Show Abstract
Quality diversity (QD) algorithms, an extension of evolutionary algorithms, excel at generating diverse sets of high-quality solutions for complex problems in robotics, games, and combinatorial optimisation. Despite their success, the underlying mechanisms remain poorly understood due to a lack of a theoretical foundation. We address this gap by analysing QD algorithms on the all-pairs-shortest-paths (APSP) problem, a classical planning task that naturally seeks multiple solutions. Using Map-Elites, a prominent QD approach, we leverage its ability to evolve solutions across distinct regions of a behavioural space, which for APSP corresponds to all pairs of nodes in the graph.

Our analysis rigorously demonstrates that evolutionary algorithms using Map-Elites efficiently compute shortest paths for all node pairs in parallel by exploiting synergies in the behavioural space. By appending edges to an existing shortest path, mutation can create optimal solutions in other regions of the behavioural space. Crossover is particularly effective, as it can combine optimal paths from two regions to produce an optimal path for a third region simply by concatenating two shortest paths. Finally, refining the parent selection to facilitate successful crossovers exhibits significant speed-ups compared to standard QD approaches.
6688: More Efforts Towards Fixed-Parameter Approximability of Multiwinner Rules
Authors: Sushmita Gupta, Pallavi Jain, Souvik Saha, Saket Saurabh, Anannya Upasana
Location: Guangzhou | Day: TBD
Show Abstract
Multiwinner Elections have emerged as a prominent area of research with numerous practical applications. Given a set of candidates, C, a set of voters, V, approving a subset of candidates (called approval set of a voter), and an integer k, we consider the problem of
selecting a “good” committee using Thiele rules. This problem is computationally challenging for most Thiele rules with monotone submodular satisfaction functions, as there is no (1-1/e- epsilon) approximation algorithm in f(k)(|C| + |V|)^(o(k)) time for any fixed epsilon > 0 and any computable function f, and no PTAS even when the length of approval set is two. Skowron designed an approximation scheme running in FPT time parameterized by the combined parameter, size of the approval set, and k. In this paper, we consider a parameter d+k (no d voters approve the same set of d candidates), where d is upper bounded by the size of the approval set (thus, can be much smaller). With respect to this parameter, we design parameterized approximation schemes, a lossy polynomial-time preprocessing method, and show that an extra committee member suffices to achieve the desired score (i.e., 1-additive approximation). Additionally, we resolve an open question by Yang and Wang regarding the fixed-parameter tractability of the problem under the PAV rule with the total score as the parameter, demonstrating that it admits an FPT algorithm.
6699: The First Theoretical Approximation Guarantees for the Non-Dominated Sorting Genetic Algorithm III (NSGA-III)
Authors: Renzhong Deng, Weijie Zheng, Benjamin Doerr
Location: Montreal | Day: August 19th | Time: 11:30 | Session: S: Evolutionary computation (1/2)
Show Abstract
This work conducts a first theoretical analysis studying how well the NSGA-III approximates the Pareto front when the population size N is less than the Pareto front size. We show that when N is at least the number Nr of reference points, then the approximation quality, measured by the maximum empty interval (MEI) indicator, on the OneMinMax benchmark is such that there is no empty interval longer than ⌈(5-2√2)n/(Nr-1)⌉. This bound is independent of N, which suggests that further increasing the population size does not increase the quality of approximation when Nr is fixed. This is a notable difference to the NSGA-II with sequential survival selection, where increasing the population size improves the quality of the approximations. We also prove two results indicating approximation difficulties when N<Nr. These theoretical results suggest that the best setting to approximate the Pareto front is Nr=N. In our experiments, we observe that with this setting the NSGA-III computes optimal approximations, very different from the NSGA-II, for which optimal approximations have not been observed so far.
6701: Minimizing Polarization and Disagreement in the Friedkin–Johnsen Model with Unknown Innate Opinions
Authors: Federico Cinus, Atsushi Miyauchi, Yuko Kuroki, Francesco Bonchi
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Multidisciplinary Topics and Applications (1/2)
Show Abstract
The bulk of the literature on opinion optimization in social networks adopts the Friedkin–Johnsen (FJ) opinion dynamics model, in which the innate opinions of all nodes are known: this is an unrealistic assumption.
In this paper, we study opinion optimization under the FJ model without the full knowledge of innate opinions. Specifically, we borrow from the literature a series of objective functions, aimed at minimizing polarization and/or disagreement, and we tackle the budgeted optimization problem, where we can query the innate opinions of only a limited number of nodes.
Given the complexity of our problem, we propose a framework based on three steps: (1) select the limited number of nodes we query, (2) reconstruct the innate opinions of all nodes based on those queried, and (3) optimize the objective function with the reconstructed opinions. For each step of the framework, we present and systematically evaluate several effective strategies. A key contribution of our work is a rigorous error propagation analysis that quantifies how reconstruction errors in innate opinions impact the quality of the final solutions.
Our experiments on various synthetic and real-world datasets show that we can effectively minimize polarization and disagreement even if we have quite limited information about innate opinions.
6719: Proven Approximation Guarantees in Multi-Objective Optimization: SPEA2 Beats NSGA-II
Authors: Yasser Alghouass, Benjamin Doerr, Martin S. Krejca, Mohammed Lagmah
Location: Montreal | Day: August 19th | Time: 11:30 | Session: S: Evolutionary computation (1/2)
Show Abstract
Together with the NSGA-II and SMS-EMOA, the strength Pareto evolutionary algorithm 2 (SPEA2) is one of the most prominent dominance-based multi-objective evolutionary algorithms (MOEAs). Different from the NSGA-II, it does not employ the crowding distance (essentially the distance to neighboring solutions) to compare pairwise non-dominating solutions but a complex system of σ-distances that builds on the distances to all other solutions. In this work, we give a first mathematical proof showing that this more complex system of distances can be superior. More specifically, we prove that a simple steady-state SPEA2 can compute optimal approximations of the Pareto front of the OneMinMax benchmark in polynomial time. The best proven guarantee for a comparable variant of the NSGA-II only assures approximation ratios of roughly a factor of two, and both mathematical analyses and experiments indicate that optimal approximations are not found efficiently.
6729: CGI: Identifying Conditional Generative Models with Example Images
Authors: Zhi Zhou, Hao-Zhe Tan, Peng-Xiao Song, Lan-Zhe Guo
Location: Guangzhou | Day: TBD
Show Abstract
Generative models have achieved remarkable performance recently, and thus model hubs have emerged. Existing model hubs typically assume basic text matching is sufficient to search for models. However, in reality, due to different abstractions and the large number of models in model hubs, it is not easy for users to review model descriptions and example images, choosing which model best meets their needs. Therefore, it is necessary to describe model functionality wisely so that future users can efficiently search for the most suitable model for their needs. Efforts to address this issue remain limited. In this paper, we propose Conditional Generative Model Identification (CGI), which aims to provide an effective way to identify the most suitable model using user-provided example images rather than requiring users to manually review a large number of models with example images. To address this problem, we propose the Prompt-Based Model Identification (PMI), which can adequately describe model functionality and precisely match requirements with specifications. To evaluate PMI approach and promote related research, we provide a benchmark comprising 65 models and 9100 identification tasks. Extensive experimental and human evaluation results demonstrate that PMI is effective. For instance, 92% of models are correctly identified with significantly better FID scores when four example images are provided.
6742: Synthesis of Communication Policies for Multi-Agent Systems Robust to Communication Restrictions
Authors: Saleh Soudijani, Rayna Dimitrova
Location: Guangzhou | Day: TBD
Show Abstract
We study stochastic multi-agent systems in which agents must cooperate to maximize the probability of achieving a common reach-avoid objective.
In many applications, during the execution of the system, the communication between the agents can be constrained by restrictions on the bandwidth currently available for exchanging local-state information between the agents.
In this paper, we propose a method for computing joint action and communication policies for the group of agents that aim to satisfy the communication restrictions as much as possible while achieving the optimal reach-avoid probability when communication is unconstrained. Our method synthesizes a pair of action and communication policies robust to restrictions on the number of agents allowed to communicate. To this end, we introduce a novel cost function that measures the amount of information exchanged beyond what the communication policy allows. We evaluate our approach experimentally on a range of benchmarks and demonstrate that it is capable of computing pairs of action and communication policies that satisfy the communication restrictions, if such exist.
6750: FedHAN: A Cache-Based Semi-Asynchronous Federated Learning Framework Defending Against Poisoning Attacks in Heterogeneous Clients
Authors: Xiaoding Wang, Bin Ye, Li Xu, Lizhao Wu, Sun-Yuan Hsieh, Jie Wu, Limei Lin
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Data Mining
Show Abstract
Federated learning is vulnerable to model poisoning attacks in which malicious participants compromise the global model by altering the model updates. Current defense strategies are divided into three types: aggregation-based methods, validation dataset-based methods, and update distance-based methods. However, these techniques often neglect the challenges posed by device heterogeneity and asynchronous communication. Even upon identifying malicious clients, the global model may already be significantly damaged, requiring effective recovery strategies to reduce the attacker’s impact. Current recovery methods, which are based on historical update records, are limited in environments with device heterogeneity and asynchronous communication. To address these problems, we introduce FedHAN, a reliable federated learning algorithm designed for asynchronous communication and device heterogeneity. FedHAN customizes sparse models, uses historical client updates to impute missing parameters in sparse updates, dynamically assigns adaptive weights, and combines update deviation detection with update prediction-based model recovery. Theoretical analysis indicates that FedHAN achieves favorable convergence despite unbounded staleness and effectively discriminates between benign and malicious clients. Experiments reveal that FedHAN, compared to leading methods, increases the accuracy of the model by 7.86%, improves the detection accuracy of poisoning attacks by 12%, and enhances the recovery accuracy by 7.26%. As evidenced by these results, FedHAN exhibits enhanced reliability and robustness in intricate and dynamic federated learning scenarios.
6755: VideoHumanMIB: Unlocking Appearance Decoupling for Video Human Motion In-betweening
Authors: Haiwei Xue, Zhensong Zhang, Minglei Li, Zonghong Dai, Fei Yu, Fei Ma, Zhiyong Wu
Location: Guangzhou | Day: TBD
Show Abstract
We propose VideoHumanMIB, a novel framework for Video Human Motion In-betweening that enables seamless transitions between different motion video clips, facilitating the generation of longer and more natural digital human videos. While existing video frame interpolation methods work well for similar motions in adjacent frames, they often struggle with complex human movements, resulting in artifacts and unrealistic transitions. To address these challenges, we introduce a two-stage approach: First, we design an Appearance Reconstruction AutoEncoder to decouple appearance and motion information, extracting robust appearance-invariant features. Second, we develop an enhanced diffusion pretrained network that leverages both motion optical flow and human pose as guidance conditions, enabling the model to learn comprehensive latent distributions of possible motions. Rather than operating directly in pixel space, our model works in a learned latent space, allowing it to better capture the underlying motion dynamics. The framework is optimized with a dual-frame constraint loss and a motion flow loss to ensure temporal consistency and natural movement transitions. Extensive experiments demonstrate that our approach generates highly realistic transition sequences that significantly outperform existing methods, particularly in challenging scenarios with large motion variations. The proposed VideoHumanMIB establishes a new baseline for human motion synthesis and enables more natural and controllable digital human animation.
6756: On Independence and SCC-Recursiveness in Assumption-Based Argumentation
Authors: Lydia Blümel, Anna Rapberger, Matthias Thimm, Francesca Toni
Location: Montreal | Day: August 19th | Time: 15:00 | Session: KRR: Argumentation
Show Abstract
We introduce a notion of conditional independence in (flat) assumption-based argumentation (ABA), where independence between (sets of) assumptions amounts to the presence of information about one set of assumptions not impacting the acceptability of another. We study general properties, computational complexity, and the relation to independence in abstract argumentation. In light of the high computational complexity of deciding independence, we introduce sound methods for checking independence in polynomial time via two different routes: the first utilizes the strongly connected components (SCCs) of the instantiated abstract argumentation framework; the second exploits the structure of the ABA framework directly. Along the way, we introduce the notion of SCC-recursiveness for ABA.
6759: Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion
Authors: Qingguo Hu, Ante Wang, Jia Song, Delai Qiu, Qingsong Liu, Jinsong Su
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: CV: multimodal LLMs
Show Abstract
Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, Causality-driven Visual object Completion (CVC). This task requires LVLMs to infer the masked object in an image based on its causal relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs (e.g., GPT-4V) or human assistance. Then, LVLMs effectively self-improve through trial and error learning using these created instances. Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4% and 4.0% compared to the corresponding baselines when utilizing LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. Code and the supplementary file are available at https://github.com/XMUDeepLIT/CVC.
6761: A Finite-State Controller Based Offline Solver for Deterministic POMDPs
Authors: Alex Schutz, Yang You, Matías Mattamala, Ipek Caliskanelli, Bruno Lacerda, Nick Hawes
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Planning and Scheduling (5/5)
Show Abstract
Deterministic partially observable Markov decision processes (DetPOMDPs) often arise in planning problems where the agent is uncertain about its environmental state but can act and observe deterministically. In this paper, we propose DetMCVI, an adaptation of the Monte Carlo Value Iteration (MCVI) algorithm for DetPOMDPs, which builds policies in the form of finite-state controllers (FSCs). DetMCVI solves large problems with a high success rate, outperforming existing baselines for DetPOMDPs. We also verify the performance of the algorithm in a real-world mobile robot forest mapping scenario.
6769: FSDFormer: Progressive Rain Removal Network Based on Fourier-Spatial Dual Transformer
Authors: Shuying Huang, Jiaxuan Yang, Yong Yang, Weiguo Wan
Location: Guangzhou | Day: TBD
Show Abstract
Most rain removal methods based on deep learning typically adopt a single-stage network architecture to remove the rain streaks in rainy images by increasing the depth of the network. The increase in network depth will increase the computational complexity of the model, and the lack of guidance for intermediate features will lead to inaccurate feature learning. To address this issue, we proposed a progressive rain removal network based on Fourier-spatial dual Transformer, called FSDFormer. The network consists of multiple rain removal stages, each with the same structure, which can utilize background prior features to guide the network to reconstruct rainless images with more texture information. Each stage consists of a prior extraction module (PEM), a prior attention fusion module (PAFM), and a U-Net including multiple Fourier-spatial dual Transformers (FSD-Transformers). Firstly, PEM is constructed to extract the background prior features from the input rainy image or the output of each stage. Then, a PAFM is designed to reconstruct accurate image background features by utilizing background prior features to guide the network. Finally, U-Net extracts and reconstructs features at different scales by constructing multiple FSD-Transformers to obtain rainless features at each stage. Extensive experimental results on synthetic and real datasets have shown that the proposed method outperforms some state-of-the-art (SOTA) rain removal methods in terms of visual quality and quantitative indicators. The source code is available at https: //github.com/yangjiaxuan6250/FSDFormer.
6770: Tree-of-AdEditor: Heuristic Tree Reasoning for Automated Video Advertisement Editing with Large Language Model
Authors: Yuqi Zhang, Bin Guo, Nuo Li, Ying Zhang, Shijie Wang, Zhiwen Yu, Qing Li
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Planning and Scheduling (3/5)
Show Abstract
Video advertising has become a popular marketing strategy on e-commerce platforms, requiring high-level semantic reasoning like selling point discovery, narrative organization. Previous rule-based methods struggle with these complex tasks, and learning-based approaches demand large datasets and high training costs. Recently, Large Language Models have opened incredible opportunities for advancing intelligent video advertisement editing. However, Input-output (IO) prompting and Chain-of-Thought (CoT) struggle to adapt to the nonlinear thinking hierarchy of video editing, where editors iteratively select shots or revert them to explore potential editing solutions. While Tree-of-Thought (ToT) offers a conceptual structure that mirrors this hierarchy, it falls short in aligning with effective video advertising strategies and lacks robust fact-checking mechanisms. To address these, we propose a novel framework, Tree-of-AdEditor (ToAE), which constructs a reasoning tree to mimic human editors, and incorporates domain-specific theories and heuristic fact-checking to identify optimal editing solutions. Specifically, motivated by effective advertisement principles, we develop a "local-global" mechanism to guide LLM in both the shot level and sequence level decision-making. We introduce a visual incoherence pruning module to provide external heuristic fact-checking, ensuring visual attractiveness and reducing computation costs. Quantitative experiments and expert evaluation demonstrate the superiority of our method compared to baselines.
6771: Strategy-Architecture Synergy: A Multi-View Graph Contrastive Paradigm for Consistent Representations
Authors: Shuman Zhuang, Zhihao Wu, Yuhong Chen, Zihan Fang, Jiali Yin, Ximeng Liu
Location: Guangzhou | Day: TBD
Show Abstract
Facing the growing diversity of multi-view data, multi-view graph-based models have made encouraging progress in handling multi-view data modeled as graphs. Graph Contrastive Learning (GCL) naturally fits multi-view graph data by treating their inherent views as augmentations. However, the development of GCL on multi-view graph data is still in the infant stage. Challenges remain in designing strategies that coordinate preprocessing and contrastive learning, and in developing model architectures that automatically meet the needs of diverse views. To tackle these, we propose a framework named CAMEL, which refines consistency learning by introducing a tailored contrastive paradigm for multi-view graphs. Initially, we theoretically analyze the positive effect of edge-dropping preprocessing on the consistency and quantify the factors that influence it. Paired with a learnable model architecture, the proposed adaptive edge-dropping preprocessing strategy is guided by dynamic topology, making the heterogeneity of views more controllable and better aligned with contrastive learning. Finally, we design a neighborhood consistency multi-view contrastive objective that enhances consistency information interaction by extending positive samples. Extensive experiments on downstream tasks, including node classification and clustering, validate the superiority of our proposed model.
6773: Finite-Time Analysis of Heterogeneous Federated Temporal Difference Learning
Authors: Ye Zhu, Xiaowen Gong, Shiwen Mao
Location: Guangzhou | Day: TBD
Show Abstract
Federated Temporal Difference (FTD) learning has emerged as a promising framework for collaboratively evaluating policies without sharing raw data. Despite its potential, existing approaches often yield biased convergence results due to the inherent challenges of federated reinforcement learning, such as multiple local updates and environment heterogeneity. In response, we investigate federated temporal difference (TD) learning, focusing on collaborative policy evaluation with linear function approximation among agents operating in heterogeneous environments. We devise a heterogeneous federated temporal difference (HFTD) algorithm which iteratively aggregates agents’ local stochastic gradients for TD learning. The HFTD algorithm involves two major novel contributions: 1) it aims to find the optimal value function model for the mixture environment which is the environment randomly drawn from agents’ heterogeneous environments, using the local gradients of agents’ mean squared Bellman errors (MSBEs) for their respective environments; 2) it allows agents to perform different numbers of local iterations for TD learning based on their heterogeneous computational capabilities. We analyze the finite-time convergence of the HFTD algorithm for the scenarios of IID sampling and Markovian sampling respectively. By characterizing bounds on the convergence error, we show that the HFTD algorithm can exactly converge to the optimal model and also achieves linear speedups as the number of agents increases.
6781: Quantifying the Self-Interest Level of Markov Social Dilemmas
Authors: Richard Willis, Yali Du, Joel Z. Leibo, Michael Luck
Location: Montreal | Day: August 21st | Time: 15:00 | Session: Agent-based and Multi-agent Systems (3/3)
Show Abstract
This paper introduces a novel method for estimating the self-interest level of Markov social dilemmas.
We extend the concept of self-interest level from normal-form games to Markov games, providing a quantitative measure of the minimum reward exchange required to align individual and collective interests.
We demonstrate our method on three environments from the Melting Pot suite, representing either common-pool resources or public goods.
Our results illustrate how reward exchange can enable agents to transition from selfish to collective equilibria in a Markov social dilemma.
This work contributes to multi-agent reinforcement learning by providing a practical tool for analysing complex, multistep social dilemmas.
Our findings offer insights into how reward structures can promote or hinder cooperation, with potential applications in areas such as mechanism design.
6785: Avoiding Undesired Future with Sequential Decisions
Authors: Lue Tao, Tian-Zuo Wang, Yuan Jiang, Zhi-Hua Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Machine learning has advanced in predictive tasks, but practitioners often need to proactively avoid undesired outcomes rather than just predicting them. To this end, a framework called rehearsal has been introduced, which tackles the avoiding undesired future (AUF) problem by modeling how variables influence each other and searching for a decision that leads to desired results. In this paper, we propose a novel rehearsal approach for addressing the AUF problem by making a sequence of decisions, where each decision is dynamically informed by the latest observations via retrospective inference. Theoretically, we show that sequential decisions in our approach tend to achieve a higher success rate in avoiding undesired outcomes by more reliably inferring the outcome of actions compared with existing solutions. Perhaps surprisingly, our approach remains advantageous even under imprecise modeling of relations between variables, and we provide a sufficient condition under which the advantage holds. Finally, experimental results confirm the practical effectiveness of the proposed approach in both simulated and real-world tasks.
6787: Most General Explanations of Tree Ensembles
Authors: Yacine Izza, Akexey Ignatiev, Sasha Rubin, Joao Marques-Silva, Peter J. Stuckey
Location: Guangzhou | Day: TBD
Show Abstract
Explainable Artificial Intelligence (XAI) is critical for attaining trust in the operation of AI systems. A key question of an AI system is “why was this decision made this way”. Formal approaches to XAI use a formal model of the AI system to identify abductive explanations. While abductive explanations may be applicable to a large number of inputs sharing the same concrete values, more general explanations may be preferred for numeric inputs.
So-called inflated abductive explanations give intervals for each feature ensuring that any input whose values fall withing these intervals is still guaranteed to make the same prediction. Inflated explanations cover a larger portion of the input space, and hence are deemed more general explanations. But there can be many (inflated) abductive explanations for an instance. Which is the best? In this paper, we show how to find a most general abductive explanation for an AI decision. This explanation covers as much of the input space as possible, while still being a correct formal explanation of the model’s behaviour. Given that we only want to give a human one explanation for a decision, the most general explanation gives us the explanation with the broadest applicability, and hence the one most likely to seem sensible.
6788: Attention-based Conditional Random Field for Financial Fraud Detection
Authors: Xiaoguang Wang, Chenxu Wang, Luyue Zhang, Xiaole Wang, Mengqin Wang, Huanlong Liu, Tao Qin
Location: Guangzhou | Day: TBD
Show Abstract
Financial fraud detection is critical for market transparency and regulatory compliance. Existing methods often ignore the temporal patterns in financial data, which are essential for understanding dynamic financial behaviors and detecting fraud. Moreover, they also treat companies as independent entities, overlooking the valuable interrelationships. To address these issues, we propose ACRF-RNN, a Recurrent Neural Network (RNN) with Attention-based Conditional Random Field (CRF) for fraud detection. Specifically, we use an RNN with a sliding window to capture temporal dependencies from historical data, and an attention-based CRF feature transformer to model inter-company relationships. This transforms raw financial data into optimized features, fed into a multi-layer perceptron for classification. Besides, we also use the focal loss to alleviate the class imbalance problem caused by rare fraudulent cases. This work presents a novel real-world dataset to evaluate the performance of ACRF-RNN. Extensive experiments show that ACRF-RNN outperforms the state-of-the-art methods by 15.28% in KS and 4.04% in Recall.
Data and code are available at: https://github.com/XNetLab/ACRF-RNN.git.
6797: Constrained Serial Dictatorships Can Be Fair
Authors: Sylvain Bouveret, Hugo Gilbert, Jérôme Lang, Guillaume Méroué
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Game Theory and Economic Paradigms
Show Abstract
When allocating indivisible items to agents, it is known that the only strategyproof mechanisms that satisfy a set of rather mild conditions are constrained serial dictatorships: given a fixed order over agents, at each step the designated agent chooses a given number of items (depending on her position in the sequence).
Agents who come earlier in the sequence have a larger choice of items; however, this advantage can be compensated by a higher number of items received by those who come later. How to balance priority in the sequence and number of items received is a nontrivial question.
We use a previous model, parameterized by a mapping from ranks to scores, a social welfare functional, and a distribution over preference profiles. For several meaningful choices of parameters, we show that the optimal sequence can be computed exactly in polynomial time or approximated using sampling.
Our results hold for several probabilistic models on preference profiles, with an emphasis on the Plackett-Luce model.
We conclude with experimental results showing how the optimal sequence is impacted by various parameters.
6819: A Novel Sparse Active Online Learning Framework for Fast and Accurate Streaming Anomaly Detection Over Data Streams
Authors: Zhong Chen, Yi He, Di Wu, Chen Zhao, Meikang Qiu
Location: Guangzhou | Day: TBD
Show Abstract
Online Anomaly Detection (OAD) is critical for identifying rare yet important data points in large, dynamic, and complex data streams. A key challenge lies in achieving accurate and consistent detection of anomalies while maintaining computational and memory efficiency. Conventional OAD approaches, which depend on distributional deviations and static thresholds, struggle with model update delays and catastrophic forgetting, leading to missed detections and high false positive rates. To address these limitations, we propose a novel Streaming Anomaly Detection (SAD) method, grounded in a sparse active online learning framework. Our approach uniquely integrates ℓ1,2-norm sparse online learning with CUR decomposition-based active learning, enabling simultaneous fast feature selection and dynamic instance selection. The efficient CUR decomposition further supports real-time residual analysis for anomaly scoring, eliminating the need for manual threshold settings about temporal data distributions. Extensive experiments on diverse streaming datasets demonstrate SAD’s superiority, achieving a 14.06% reduction in detection error rates compared to five state-of-the-art competitors.
6838: Multi-Organizational Scheduling: Individual Rationality, Optimality, and Complexity
Authors: Jiehua Chen, Martin Durand, Christian Hatschka
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Game Theory and Economic Paradigms
Show Abstract
We investigate multi-organizational scheduling problems, building upon the framework introduced by Pascual et al. in 2009. In this setting, multiple organizations each own a set of identical machines and sequential jobs with distinct processing times. The challenge lies in optimally assigning jobs across organizations’ machines to minimize the overall makespan while ensuring no organization’s performance deteriorates. To formalize this fairness constraint, we introduce individual rationality, a game-theoretic concept that guarantees each organization benefits from participation. Our analysis reveals that finding an individually rational schedule with minimum makespan is ΘP2-hard, placing it in a complexity class strictly harder than both NP and coNP. We further extend the model by considering an alternative objective: minimizing the sum of job completion times, both
within individual organizations and across the entire system. The corresponding decision variant proves to be NP-complete. Through comprehensive parameterized complexity analysis of both problems, we provide new insights into these computationally challenging multi-organizational scheduling scenarios.
6841: A Logic of General Attention Using Edge-Conditioned Event Models
Authors: Gaia Belardinelli, Thomas Bolander, Sebastian Watzl
Location: Montreal | Day: August 20th | Time: 14:00 | Session: KR: Logic
Show Abstract
In this work, we present the first general logic of attention. Attention is a powerful cognitive ability that allows agents to focus on potentially complex information, such as logically structured propositions, higher-order beliefs, or what other agents pay attention to. This ability is a strength, as it helps to ignore what is irrelevant, but it can also introduce biases when some types of information or agents are systematically ignored. Existing dynamic epistemic logics for attention cannot model such complex attention scenarios, as they only model attention to atomic formulas. Additionally, such logics quickly become cumbersome, as their size grows exponentially in the number of agents and announced literals. Here, we introduce a logic that overcomes both limitations. First, we generalize edge-conditioned event models, which we show to be as expressive as standard event models yet exponentially more succinct (generalizing both standard event models and generalized arrow updates). Second, we extend attention to arbitrary formulas, allowing agents to also attend to other agents’ beliefs or attention. Our work treats attention as a modality, like belief or awareness. We introduce attention principles that impose closure properties on that modality and that can be used in its axiomatization. Throughout, we illustrate our framework with examples of AI agents reasoning about human attention, demonstrating how such agents can discover attentional biases.
6871: Optimizing Parameters of Quantum Circuits with Sparsity-Inducing Coordinate Descent
Authors: Rudy Raymond, Zichang He
Location: Montreal | Day: August 20th | Time: 14:00 | Session: ML: Machine Learning 6/8
Show Abstract
Parameterized Quantum Circuit (PQC) is a family of structured quantum circuits that consists of quantum gates whose parameters are optimized with classical computers. With the quest for a potential speedup, there is a need to run larger quantum circuits, which in turn results in the arduous task of parameter optimization. In this paper, we propose a generic method, called Rotolasso, that utilizes sparsity-inducing coordinate descent (CD) to optimize parameters of a PQC for balancing its accuracy and the number of parameterized gates. The use of CD allows significant reduction in the number of quantum circuit runs, and the sparsity in the model leads to simpler and faster PQCs, both of which are important ingredients to overcome limitations of near-term quantum devices. We provide theoretical analyses and demonstrate experiments showing the effectiveness of Rotolasso to solve instances of combinatorial optimization problems.
6876: A Non-Interventionist Approach to Causal Reasoning Based on Lewisian Counterfactuals
Authors: Carlos Aguilera-Ventura, Xinghan Liu, Emiliano Lorini, Dmitry Rozplokhas
Location: Montreal | Day: August 21st | Time: 11:30 | Session: Knowledge Representation and Reasoning (3/4)
Show Abstract
We present a computationally grounded semantics for counterfactual conditionals in which i) the state in a model is decomposed into two elements: a propositional valuation and a causal base in propositional form that represents the causal information available at the state; and ii) the comparative similarity relation between states is computed from the states’ two components. We show that, by means of our semantics, we can elegantly formalize the notion of actual cause without recurring to the primitive notion of intervention. Furthermore, we provide a succinct formulation of the model checking problem for a language of counterfactual conditionals in our semantics. We show that this problem is PSPACE-complete and provide a reduction of it into QBF that can be used for automatic verification of causal properties.
6890: Α Descent-based Method on the Duality Gap for Solving Zero-sum Games
Authors: Michail Fasoulakis, Evangelos Markakis, Georgios Roussakis, Christodoulos Santorinaios
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Game Theory
Show Abstract
We focus on the design of algorithms for finding equilibria in 2-player zero-sum games. Although it is well known that such problems can be solved by a single linear program, there has been a surge of interest in recent years for simpler algorithms, motivated in part by applications in machine learning. Our work proposes such a method, inspired by the observation that the duality gap (a standard metric for evaluating convergence in min-max optimization problems) is a convex function for bilinear zero-sum games. To this end, we analyze a descent-based approach, variants of which have also been used as a subroutine in a series of algorithms for approximating Nash equilibria in general non-zero-sum games.
In particular, we study a steepest descent approach, by finding the direction that minimises the directional derivative of the duality gap function.
Our main theoretical result is that the derived algorithms achieve a geometric decrease in the duality gap until we reach an approximate equilibrium. Finally, we complement this with an experimental evaluation, which provides promising findings. Our algorithm is comparable with (and in some cases outperforms) some of the standard approaches for solving 0-sum games, such as OGDA (Optimistic Gradient Descent/Ascent), even with thousands of available strategies per player.
6896: A Sequent Calculus for Answer Set Entailment
Authors: Thomas Eiter, Tobias Geibinger
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: KRR: Logic programming
Show Abstract
Answer Set Programming (ASP) is a popular nonmonotonic formalism used for common-sense reasoning and problem-solving based on stable model semantics. Equilibrium logic is a generalisation of ASP for arbitrary propositional theories and thus provides a logical characterisation of the nonmonotonic stable model semantics. In difference to classical logic, which can be defined via proof or model theory, nonmonotonic reasoning formalisms are defined via their models exclusively. Equilibrium logic is no exception here, as it has no proper proof-theoretic axiomatisation. Besides this being a theoretical imbalance, it also has consequences regarding notions of justification and explainability. In this work, we fill this gap by providing a sequent calculus for answer set entailment. Our calculus builds upon ideas from existing calculi for other nonmonotonic formalisms and utilises calculi for the logic of here and there, which is the underlying base logic of equilibrium logic. We show that the calculus is sound and complete and discuss pitfalls as well as alternative axiomatisations. Finally, we address how our approach can be of use for explainability in ASP.
6901: Scalable Speed-ups for the SMS-EMOA from a Simple Aging Strategy
Authors: Mingfeng Li, Weijie Zheng, Benjamin Doerr
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: S: Evolutionary computation (2/2)
Show Abstract
Different from single-objective evolutionary algorithms, where non-elitism is an established concept, multi-objective evolutionary algorithms almost always select the next population in a greedy fashion. In the only notable exception, a stochastic selection mechanism was recently proposed for the SMS-EMOA and was proven to speed up computing the Pareto front of the bi-objective jump benchmark with problem size n and gap parameter k by a factor of max{1,2^(k/4)/n}. While this constitutes the first proven speed-up from non-elitist selection, suggesting a very interesting research direction, it has to be noted that a true speed-up only occurs for k ≥ 4log(n), where the runtime is super-polynomial, and that the advantage reduces for larger numbers of objectives as shown in a later work. In this work, we propose a different non-elitist selection mechanism based on aging, which exempts individuals younger than a certain age from a possible removal. This remedies the two shortcomings of stochastic selection: We prove a speed-up by a factor of max{1,Θ(k)^(k-1)}, regardless of the number of objectives. In particular, a positive speed-up can already be observed for constant k, the only setting for which polynomial runtimes can be witnessed. Overall, this result supports the use of non-elitist selection schemes, but suggests that aging-based mechanisms can be considerably more powerful than stochastic selection mechanisms.
6902: The Core of Approval-Based Committee Elections with Few Seats
Authors: Dominik Peters
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Game Theory and Economic Paradigms
Show Abstract
In an approval-based committee election, the goal is to select a committee consisting of k out of m candidates, based on n voters who each approve an arbitrary number of the candidates. The core of such an election consists of all committees that satisfy a certain stability property which implies proportional representation. In particular, committees in the core cannot be "objected to" by a coalition of voters who is underrepresented. The notion of the core was proposed in 2016, but it has remained an open problem whether it is always non-empty. We prove that core committees always exist when k ≤ 8, for any number of candidates m and any number of voters n, by showing that the Proportional Approval Voting (PAV) rule, proposed by Thiele in 1895, always satisfies the core when k ≤ 7 and always selects at least one committee in the core when k = 8. We also develop an artificial rule based on recursive application of PAV, and use it to show that the core is non-empty whenever there are m ≤ 15 candidates, for any committee size k ≤ m and any number of voters n. These results are obtained with the help of computer search using linear programs.
6913: Are Large Language Models Fluent in Declarative Process Mining?
Authors: Valeria Fionda, Antonio Ielo, Francesco Ricca
Location: Montreal | Day: August 21st | Time: 15:00 | Session: Knowledge Representation and Reasoning (4/4)
Show Abstract
Recent advancements in AI have made LLMs valuable tools for automating the interpretation of textual descriptions of business processes and for converting formal process specifications into natural language. However, there are no practical methodologies or systematic assessments to ensure these automatic translations are faithful. This paper proposes a novel approach, based on an auxiliary bidirectional translation task, to assess LLMs performance quantitatively; also, it also empirically evaluates the performance of state-of-the-art LLMs for bidirectional translations between natural language and declarative formal process specifications. The results reveal substantial variability in performance among the LLMs, highlighting the importance of LLM selection and confirming the need for a robust method for assessing LLMs’ outputs.
6919: Heterophily-Aware Personalized PageRank for Node Classification
Authors: Giuseppe Pirrò
Location: Montreal | Day: August 19th | Time: 15:00 | Session: ML: Reinforcement learning (1/2)
Show Abstract
Node classification in heterophilous graphs, where connected nodes often have different characteristics, which presents a significant challenge. We introduce HAPPY, which combines heterophily-aware random walks with targeted subgraph extraction. Our approach enhances Personalized PageRank by incorporating both label and feature diversity into the random walk process. Through theoretical analysis, we demonstrate that HAPPY effectively captures both homophilous and heterophilous relationships. Comprehensive experiments validate our method’s state-of-the-art performance across challenging heterophilous benchmarks.
6926: What Can We Learn From MIMO Graph Convolutions?
Authors: Andreas Roth, Thomas Liebig
Location: Montreal | Day: August 19th | Time: 15:00 | Session: ML: Reinforcement learning (1/2)
Show Abstract
Most graph neural networks (GNNs) utilize approximations of the general graph convolution derived in the graph Fourier domain. While GNNs are typically applied in the multi-input multi-output (MIMO) case, the approximations are performed in the single-input single-output (SISO) case. In this work, we first derive the MIMO graph convolution through the convolution theorem and approximate it directly in the MIMO case. We find the key MIMO-specific property of the graph convolution to be operating on multiple computational graphs, or equivalently, applying distinct feature transformations for each pair of nodes. As a localized approximation, we introduce localized MIMO graph convolutions (LMGCs), which generalize many linear message-passing neural networks. For almost every choice of edge weights, we prove that LMGCs with a single computational graph are injective on multisets, and the resulting representations are linearly independent when more than one computational graph is used. Our experimental results confirm that an LMGC can combine the benefits of various methods.
6933: Curriculum Abductive Learning for Mitigating Reasoning Shortcuts
Authors: Wen-Da Wei, Xiao-Wen Yang, Jie-Jing Shao, Lan-Zhe Guo
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Machine Learning (2/4)
Show Abstract
Abductive Learning (ABL), a prominent neural-symbolic learning algorithm, integrates perception models with logical reasoning via intermediate symbolic concepts, substantially improving the interpretability and generalization of AI systems. However, a significant challenge in this domain is the issue of reasoning shortcuts, where the system achieve high final prediction accuracy but generate incorrect intermediate concept inferences, severely
undermining ABL’s interpretability and generalization capabilities. Current mitigation methods to this problem often neglect potential correlations among training samples, leading to suboptimal performances. This paper innovatively reveals that simple samples can facilitate the learning of intermediate concepts in complex samples, prompting our proposed method Curriculum Abductive Learning (CurABL) technique. This approach employs a curriculum training strategy, integrating a knowledge transfer mechanism from simple to complex samples, effectively addressing the issue of reasoning shortcuts. Comprehensive experimental results demonstrate that the CurABL method substantially improves the ABL framework’s capability to extract intermediate concepts especially in difficult tasks and accelerates the training convergence rate, thus markedly enhancing its robustness against reasoning shortcuts.
6938: GCNT: Graph-Based Transformer Policies for Morphology-Agnostic Reinforcement Learning
Authors: Yingbo Luo, Meibao Yao, Xueming Xiao
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Robotics
Show Abstract
Training a universal controller for robots with different morphologies is a promising research trend, since it can significantly enhance the robustness and resilience of the robotic system. However, diverse morphologies can yield different dimensions of state space and action space, making it difficult to comply with traditional policy networks. Existing methods address this issue by modularizing the robot configuration, while do not adequately extract and utilize the overall morphological information, which has been proven crucial for training a universal controller. To this end, we propose GCNT, a morphology-agnostic policy network based on improved Graph Convolutional Network (GCN) and Transformer. It exploits the fact that GCN and Transformer can handle arbitrary number of modules to achieve compatibility with diverse morphologies. Our key insight is that the GCN is able to efficiently extract morphology information of robots, while Transformer ensures that it is fully utilized by allowing each node of the robot to communicate this information directly. Experimental results show that our method can generate resilient locomotion behaviors for robots with different configurations, including zero-shot generalization to robot morphologies not seen during training. In particular, GCNT achieved the best performance on 8 tasks in the 2 standard benchmarks.
6940: Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models
Authors: Guangzhi Xiong, Eric Xie, Corey Williams, Myles Kim, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, Aidong Zhang
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: LLM applications
Show Abstract
Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful scientific hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at https://github.com/Teddy-XiongGZ/TruthHypo.
6948: Cap-and-Penalize: Competitive Mechanisms for Multi-Phase Regularized Online Allocation
Authors: Seyedehkimia Alaviyar, Faraz Zargari, John Tyler, Yunwei Ryan Li, Xiaoqi Tan
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Game Theory and Economic Paradigms
Show Abstract
This paper introduces a novel mechanism for online allocation with multi-phase, non-separable regularizers, termed Cap-and-Penalize (CnP), inspired by real-world applications such as cap-and-tax policies in carbon pricing. The CnP regularizer models a multi-phase cost structure, imposing a monotone convex penalty when total allocation exceeds a predefined level (soft cap) and enforcing a strict limit (hard cap) beyond which allocation is prohibited. Our contributions are twofold: (1) we propose an online mechanism for CnP-regularized allocation without per-step resource constraints, which operates as a simple and intuitive posted-price mechanism, but achieves the best-possible guarantee among all possible online algorithms; (2) we tackle the more complex setting with per-step resource constraints by decomposing the regularizer into local components, yielding a similar mechanism with time-dependent marginal pricing functions. To establish the tightness of our results in both settings, we introduce a representative function-based approach that transforms the lower-bound proof into the problem of solving an ordinary differential equation with boundary conditions. We believe that this technique has the potential to be applied to other similar online optimization problems.
6952: Grounding Methods for Neural-Symbolic AI
Authors: Rodrigo Castellano Ontiveros, Francesco Giannini, Marco Gori, Giuseppe Marra, Michelangelo Diligenti
Location: Montreal | Day: August 21st | Time: 11:30 | Session: ML: Neurosymbolic AI
Show Abstract
A large class of Neural-Symbolic (NeSy) methods employs a machine learner to process the input entities, while relying on a reasoner based on First-Order Logic to represent and process more complex relationships among the entities. A fundamental role for these methods is played by the process of logic grounding, which determines the relevant substitutions for the logic rules using a (sub)set of entities.
Some NeSy methods use an exhaustive derivation of all possible substitutions, preserving the full expressive power of the logic knowledge, but leading to a combinatorial explosion of the number of ground formulas to consider and, therefore, strongly limiting their scalability. Other methods rely on heuristic-based selective derivations, which are generally more computationally efficient, but lack a justification and provide no guarantees of preserving the information provided to and returned by the reasoner.
Taking inspiration from multi-hop symbolic reasoning, this paper proposes a parametrized family of grounding methods generalizing classic Backward Chaining. Different selections within this family allow to obtain commonly employed grounding methods as special cases, and to control the trade-off between expressiveness and scalability of the reasoner.
The experimental results show that the selection of the grounding criterion is often as important as the NeSy method itself.
6965: Efficient Algorithms for Electing Successive Committees
Authors: Pallavi Jain, Andrzej Kaczmarczyk
Location: Montreal | Day: August 21st | Time: 15:00 | Session: GTEP: Computational social choice (2/2)
Show Abstract
In a recently introduced model of successive committee elections, for a given set of ordinal or approval preferences one aims to find a sequence of a given length of “best” same-size committees such that each candidate is a member of a limited number of consecutive committees. However, the practical usability of this model remains limited, as the described task turns out to be NP-hard for most selection criteria already for seeking committees of size three. Non-trivial or somewhat efficient algorithms for these cases are lacking too. Motivated by a desire to unlock the full potential of the described temporal model of committee elections, we devise (parameterized) algorithms that effectively solve the mentioned hard cases in realistic scenarios of a moderate number of candidates or of a limited time horizon.
6970: Beyond the Map: Learning to Navigate Unseen Urban Dynamics Using Diffusion-Guided Deep Reinforcement Learning
Authors: Monu Nagar, Debasis Das
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Robotics
Show Abstract
Vision-based motion planning is a crucial task in Autonomous Driving (AD). Recent advancements in urban AD show that integrating Imitation Learning (IL) with Deep Reinforcement Learning (DRL) improves decision-making to be more like humans. However, IL methods depend on expert demonstrations to learn the optimal policy. The main drawback of this approach is the assumption that expert demonstrations are always optimal, which is not always true in real-world settings. This creates challenges in adapting to diverse weather conditions and dynamic traffic scenarios, often resulting in higher collision rates and increased risks to pedestrian safety. To address these challenges, we propose a Diffusion-Guided Deep Reinforcement Learning (DGDRL) framework that integrates a diffusion model with a Soft Actor-Critic DRL method to effectively mitigate environmental uncertainties and enable self-learning beyond the training maps for new tasks. This framework follows a novel modified partially observable Markov decision process (mPOMDP) to choose optimal action from original and diffusion-generated observations, ensuring that the policy behavior remains consistent with the current action. We use the CARLA NoCrash benchmark to train and evaluate the proposed framework. The method is validated in diverse urban environments (e.g., empty, regular, and dense) across multiple towns. Additionally, we compare our model against state-of-the-art techniques to ensure robustness and generalizability to new environments. The project page and code are available at the link https://autovisionproject.github.io/project/.
6973: Optimal Metric Distortion for Matching on the Line
Authors: Aris Filos-Ratsikas, Vasilis Gkatzelis, Mohamad Latifian, Emma Rewinski, Alexandros A. Voudouris
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Game Theory and Economic Paradigms
Show Abstract
We study the distortion of one-sided and two-sided matching problems on the line. In the one-sided case, n agents need to be matched to n items, and each agent’s cost in a matching is their distance from the item they were matched to. We propose an algorithm that is provided only with ordinal information regarding the agents’ preferences (each agent’s ranking of the items from most- to least-preferred) and returns a matching aiming to minimize the social cost with respect to the agents’ true (cardinal) costs. We prove that our algorithm simultaneously achieves the best-possible approximation of 3 (known as distortion) with respect to a variety of social cost measures which include the utilitarian and egalitarian social cost. In the two-sided case, where the agents need be matched to n other agents and both sides report their ordinal preferences over each other, we show that it is always possible to compute an optimal matching. In fact, we show that this optimal matching can be achieved using even less information, and we provide bounds regarding the sufficient number of queries.
6975: Optimal Capacity Modification for Stable Matchings with Ties
Authors: Keshav Ranjan, Meghana Nasre, Prajakta Nimbhorkar
Location: Montreal | Day: August 21st | Time: 15:00 | Session: GTEP: Computational social choice (2/2)
Show Abstract
We consider the Hospitals/Residents (HR) problem in the presence of ties in preference lists of hospitals. Among the three notions of stability, viz. weak, strong, and super stability, we focus on strong stability. Strong stability is appealing both theoretically and practically; however, its existence is not guaranteed. In this paper, our objective is to optimally augment the quotas of hospitals to ensure that a strongly stable matching exists in the modified instance. Such an augmentation is guaranteed to exist when resident preference lists are strict. We explore two natural optimization criteria: (i) minimizing the total capacity increase across all hospitals (MINSUM) and (ii) minimizing the maximum capacity increase for any hospital (MINMAX). We show that the MINSUM problem admits a polynomial-time algorithm, whereas the MINMAX problem is NP-hard. We prove an analogue of the Rural Hospitals theorem for the MINSUM problem. When each hospital incurs a cost for a unit increase in its quota, the MINSUM problem becomes NP-hard, even for 0/1 costs. In fact, we show that the problem cannot be approximated to any multiplicative factor. We also present a polynomial-time algorithm for optimal MINSUM augmentation when a specified subset of edges is required to be included in the matching.
6978: Beyond Winning Strategies: Admissible and Admissible Winning Strategies for Quantitative Reachability Games
Authors: Karan Muvvala, Qi Heng Ho, Morteza Lahijanian
Location: Montreal | Day: August 21st | Time: 10:00 | Session: MAS: Formal verification, validation and synthesis
Show Abstract
Classical reactive synthesis approaches aim to synthesize a reactive system that always satisfies a given specification. These approaches often reduce to playing a two-player zero-sum game where the goal is to synthesize a winning strategy. However, in many pragmatic domains, such as robotics, a winning strategy does not always exist, yet it is desirable for the system to make an effort to satisfy its requirements instead of "giving up." To this end, this paper investigates the notion of admissible strategies, which formalize "doing-your-best", in quantitative reachability games. We show that, unlike the qualitative case, memoryless strategies are not sufficient to capture all admissible strategies, making synthesis a challenging task. In addition, we prove that admissible strategies always exist but may produce undesirable optimistic behaviors. To mitigate this, we propose admissible winning strategies, which enforce the best possible outcome while being admissible. We show that both strategies always exist but are not memoryless. We provide necessary and sufficient conditions for the existence of both strategies and propose synthesis algorithms. Finally, we illustrate the strategies on gridworld and robot manipulator domains.
6988: Counterfactual Explanations Under Model Multiplicity and Their Use in Computational Argumentation
Authors: Gianvincenzo Alfano, Adam Gould, Francesco Leofante, Antonio Rago, Francesca Toni
Location: Montreal | Day: August 19th | Time: 15:00 | Session: KRR: Argumentation
Show Abstract
Counterfactual explanations (CXs) are widely recognised as an essential technique for providing recourse recommendations for AI models.
However, it is not obvious how to determine CXs in model multiplicity scenarios, where equally performing but different models can be obtained for the same task.
In this paper, we propose novel qualitative and quantitative definitions of CXs based on explicit, nested quantification over (groups) of model decisions.
We also study properties of these notions and identify decision problems of interest therefor.
While our CXs are broadly applicable, in this paper we instantiate them within computational argumentation where model multiplicity naturally emerges, e.g. with incomplete and case-based argumentation frameworks.
We then illustrate the suitability of our CXs for model multiplicity in legal and healthcare contexts, before analysing the complexity of the associated decision problems.
6990: Rule-Guided Reinforcement Learning Policy Evaluation and Improvement
Authors: Martin Tappler, Ignacio D. Lopez-Miguel, Sebastian Tschiatschek, Ezio Bartocci
Location: Montreal | Day: August 21st | Time: 15:00 | Session: ML: Reinforcement Learning (2/2)
Show Abstract
We consider the challenging problem of using domain knowledge to improve deep reinforcement learning policies. To this end, we propose LEGIBLE, a novel approach, following a multi-step process, which starts by mining rules from a deep RL policy, constituting a partially symbolic representation. These rules describe which decisions the RL policy makes and which it avoids making. In the second step, we generalize the mined rules using domain knowledge expressed as metamorphic relations. We adapt these relations from software testing to RL to specify expected changes of actions in response to changes in observations. The third step is evaluating generalized rules to determine which generalizations improve performance when enforced. These improvements show weaknesses in the policy, where it has not learned the general rules and thus can be improved by rule guidance. LEGIBLE supported by metamorphic relations provides a principled way of expressing and enforcing domain knowledge about RL environments. We show the efficacy of our approach by demonstrating that it effectively finds weaknesses, accompanied by explanations of these weaknesses, in eleven RL environments and by showcasing that guiding policy execution with rules improves performance w.r.t. gained reward.
7002: SPoRt – Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL
Authors: Jacques Cloete, Nikolaus Vertovec, Alessandro Abate
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Machine Learning (3/4)
Show Abstract
To apply reinforcement learning to safety-critical applications, we ought to provide safety guarantees during both policy training and deployment. In this work we present novel theoretical results that provide a bound on the probability of violating a safety property for a new task-specific policy in a model-free, episodic setup: the bound, based on a ‘maximum policy ratio’ that is computed with respect to a ‘safe’ base policy, can also be more generally applied to temporally-extended properties (beyond safety) and to robust control problems. We thus present SPoRt, which also provides a data-driven approach for obtaining such a bound for the base policy, based on scenario theory, and which includes Projected PPO, a new projection-based approach for training the task-specific policy while maintaining a user-specified bound on property violation. Hence, SPoRt enables the user to trade off safety guarantees in exchange for task-specific performance. Accordingly, we present experimental results demonstrating this trade-off, as well as a comparison of the theoretical bound to posterior bounds based on empirical violation rates.
7005: Speeding Up Hyper-Heuristics With Markov-Chain Operator Selection and the Only-Worsening Acceptance Operator
Authors: Abderrahim Bendahi, Benjamin Doerr, Adrien Fradin, Johannes F. Lutzeyer
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Search
Show Abstract
The move-acceptance hyper-heuristic was recently shown to be able to leave local optima with astonishing efficiency (Lissovoi et al., Artificial Intelligence (2023)). In this work, we propose two modifications to this algorithm that demonstrate impressive performances on a large class of benchmarks including the classic CLIFF_d and JUMP_m function classes. (i) Instead of randomly choosing between the only-improving and any-move acceptance operator, we take this choice via a simple two-state Markov chain. This modification alone reduces the runtime on JUMP_m functions with gap parameter m from 𝛺(n²ᵐ⁻¹) to O(nᵐ⁺¹). (ii) We then replace the all-moves acceptance operators with the operator that only accepts worsenings. Such a, counter-intuitive, operator has not been used before in the literature. However, our proofs show that our only-worsening operator can greatly help in leaving local optima, reducing, e.g., the runtime on Jump functions to O(n³ log n) independent of the gap size. In general, we prove a remarkably good runtime of O(nᵏ⁺¹ log n) for our Markov move-acceptance hyper-heuristic on all members of a new benchmark class SEQOPT_k, which contains a large number of functions having k successive local optima, and which contains the commonly studied JUMP_m and CLIFF_d functions for k=2.
7011: A-I-RAVEN and I-RAVEN-Mesh: Two New Benchmarks for Abstract Visual Reasoning
Authors: Mikołaj Małkiński, Jacek Mańdziuk
Location: Montreal | Day: August 21st | Time: 11:30 | Session: CV: Benchmarks
Show Abstract
We study generalization and knowledge reuse capabilities of deep neural networks in the domain of abstract visual reasoning (AVR), employing Raven’s Progressive Matrices (RPMs), a recognized benchmark task for assessing AVR abilities. Two knowledge transfer scenarios referring to the I-RAVEN dataset are investigated. Firstly, inspired by generalization assessment capabilities of the PGM dataset and popularity of I-RAVEN, we introduce Attributeless-I-RAVEN (A-I-RAVEN), a benchmark with 10 generalization regimes that allow to systematically test generalization of abstract rules applied to held-out attributes at various levels of complexity (primary and extended regimes). In contrast to PGM, A-I-RAVEN features compositionality, a variety of figure configurations, and does not require substantial computational resources. Secondly, we construct I-RAVEN-Mesh, a dataset that enriches RPMs with a novel component structure comprising line-based patterns, facilitating assessment of progressive knowledge acquisition in transfer learning setting. We evaluate 13 strong models from the AVR literature on the introduced datasets, revealing their specific shortcomings in generalization and knowledge transfer.
7020: Advancing Generalization Across a Variety of Abstract Visual Reasoning Tasks
Authors: Mikołaj Małkiński, Jacek Mańdziuk
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Machine Learning (2/4)
Show Abstract
The abstract visual reasoning (AVR) domain presents a diverse suite of analogy-based tasks devoted to studying model generalization. Recent years have brought dynamic progress in the field, particularly in i.i.d. scenarios, in which models are trained and evaluated on the same data distributions. Nevertheless, o.o.d. setups that assess model generalization to new test distributions remain challenging even for the most recent models. To advance generalization in AVR tasks, we present the Pathways of Normalized Group Convolution model (PoNG), a novel neural architecture that features group convolution, normalization, and a parallel design. We consider a wide set of AVR benchmarks, including Raven’s Progressive Matrices and visual analogy problems with both synthetic and real-world images. The experiments demonstrate strong generalization capabilities of the proposed model, which in several settings outperforms the existing literature methods.
7027: Dynamic Seed-GrowthCM: A Dynamic Benefit-Oriented Algorithm for Core Maximization on Large Graphs
Authors: Dongyuan Ma, Dongxiao He, Xin Huang
Location: Guangzhou | Day: TBD
Show Abstract
The k-core has garnered significant attention in recent research as an effective measure of node importance within a graph. A k-core is defined as the maximal induced subgraph where each node has a degree of at least k. This paper addresses the core maximization problem: given a graph G, an integer k, and a budget b, the objective is to insert b new distinct edges into G to maximize the size of its k-core. This problem is theoretically proven to be NP-hard and APX-hard. However, the existing heuristic methods often struggle to achieve a good balance between efficiency and answer quality. In this paper, we propose a novel dynamic approach that, for the first time, uncovers the dynamic changes in node degrees. We introduce a new concept using the contribution of edges across different λ-shell components to the final solution. Based on these findings, we present the Dynamic Seed-GrowthCM method. This method selects the λ-shell component with the largest estimated benefit as the initial seed. In each iteration, depending on complete/partial growth, either a new seed is incorporated into the solution, or an existing seed undergoes growth, becoming a larger seed by adding connected components of the λ-shell component to the solution. Experimental results on ten datasets demonstrate that our algorithm significantly outperforms state-of-the-art methods in terms of solution quality on large graphs, while achieving a high computational efficiency.
7029: SpectralGap: Graph-Level Out-of-Distribution Detection via Laplacian Eigenvalue Gaps
Authors: Jiawei Gu, Ziyue Qiao, Zechao Li
Location: Guangzhou | Day: TBD
Show Abstract
The task of graph-level out-of-distribution (OOD) detection is crucial for deploying graph neural networks in real-world settings. In this paper, we observe a significant difference in the relationship between the largest and second-largest eigenvalues of the Laplacian matrix for in-distribution (ID) and OOD graph samples: OOD samples often exhibit anomalous spectral gaps (the difference between the largest and second-largest eigenvalues). This observation motivates us to propose SpecGap, an effective post-hoc approach for OOD detection on graphs. SpecGap adjusts features by subtracting the component associated with the second-largest eigenvalue, scaled by the spectral gap, from the high-level features (i.e., X – (λn – λn-1) u_n-1 v_n-1^T). SpecGap achieves state-of-the-art performance across multiple benchmark datasets. We present extensive ablation studies and comprehensive theoretical analyses to support our empirical results. As a parameter-free post-hoc method, SpecGap can be easily integrated into existing graph neural network models without requiring any additional training or model modification.
7030: HGEN: Heterogeneous Graph Ensemble Networks
Authors: Jiajun Shen, Yufei Jin, Kaibu Feng, Yi He, Xingquan Zhu
Location: Montreal | Day: August 19th | Time: 15:00 | Session: ML: Reinforcement learning (1/2)
Show Abstract
This paper presents HGEN that pioneers ensemble learning for heterogeneous graphs. We argue that the heterogeneity in node types, nodal features, and local neighborhood topology poses significant challenges for ensemble learning, particularly in accommodating diverse graph learners. Our HGEN framework ensembles multiple learners through a meta-path and transformation-based optimization pipeline to uplift classification accuracy. Specifically, HGEN uses meta-path combined with random dropping to create Allele Graph Neural Networks (GNNs), whereby the base graph learners are trained and aligned for later ensembling. To ensure effective ensemble learning, HGEN presents two key components:1) a residual-attention mechanism to calibrate allele GNNs of different meta-paths, thereby enforcing node embeddings to focus on more informative graphs to improve base learner accuracy, and 2) a correlation-regularization term to enlarge the disparity among embedding matrices generated from different meta-paths, thereby enriching base learner diversity. We analyze the convergence of HGEN and attest its higher regularization magnitude over simple voting. Experiments on five heterogeneous networks validate that HGEN consistently outperforms its state-of-the-art competitors by substantial margin. Codes are available at https://github.com/Chrisshen12/HGEN.
7033: On Integrating Logical Analysis of Data into Random Forests
Authors: David Ing, Said Jabbour, Lakhdar Saïs
Location: Montreal | Day: August 20th | Time: 10:00 | Session: AI Ethics, Trust, Fairness (1/3)
Show Abstract
Random Forests (RFs) are one of the most popular classifiers in machine learning. RF is an ensemble learning method that combines multiple Decision Trees (DTs), providing a more robust and accurate model than a single DT. However, one of the main step of RFs is the random selection of many different features during the construction phase of DTs, resulting in a forest with various features, which makes it difficult to extract short and concise explanations. In this paper, we propose integrating Logical Analysis of Data (LAD) into RFs. LAD is a pattern learning framework that combines optimization, Boolean functions, and combinatorial theory. One of its main goals is to generate minimal support sets (MSSes) that discriminate between different groups of data. More precisely, we show how to enhance the classical RF algorithm by randomly choosing MSSes rather than randomly choosing feature subsets that potentially contain irrelevant features for constructing DTs. Experiments on benchmark datasets reveal that integrating LAD into classical RFs using MSSes can maintain similar performance in terms of accuracy, produce forests of similar size, reduce the set of used features, and enable the extraction of significantly shorter explanations compared to classical RFs.
7034: Assessing the Exposure to Public Knowledge in Policy-Protected Description Logic Ontologies
Authors: Gianluca Cima, Domenico Lembo, Lorenzo Marconi, Riccardo Rosati, Domenico Fabio Savo
Location: Montreal | Day: August 20th | Time: 14:00 | Session: KR: Logic
Show Abstract
We propose a general framework for assessing the exposure of sensitive knowledge in policy-protected knowledge bases (KBs), where knowledge is represented as logical theories and data protection policies are defined declaratively using epistemic dependencies. The framework models scenarios in which confidential parts of the KB may be publicly known due to security breaches. We study two fundamental decision problems: determining whether the exposed knowledge violates the data protection policy (leakage), and whether there exists a secure view of the KB that complies with the policy. We analyze the computational complexity (specifically, data complexity) of these problems, focusing on the DL-Lite_R and EL_\bot Description Logics. Our findings show that, for DL-Lite_R with restricted forms of policy, both the problems can be efficiently solved through query rewriting methods. For EL_\bot, we establish conditions for tractable computational bounds. Our results highlight the potential of this framework for practical applications in confidentiality-preserving knowledge management.
7051: Online Resource Sharing: Better Robust Guarantees via Randomized Strategies
Authors: David X. Lin, Daniel Hall, Giannis Fikioris, Siddhartha Banerjee, Éva Tardos
Location: Montreal | Day: August 21st | Time: 15:00 | Session: Agent-based and Multi-agent Systems (3/3)
Show Abstract
We study the problem of fair online resource allocation via non-monetary mechanisms, where multiple agents repeatedly share a resource without monetary transfers. Previous work has shown that every agent can guarantee 1/2 of their ideal utility (the highest achievable utility given their fair share of resources) robustly, i.e., under arbitrary behavior by the other agents. While this 1/2-robustness guarantee has now been established under very different mechanisms, including pseudo-markets and dynamic max-min allocation, improving on it has appeared difficult.

In this work, we obtain the first significant improvement on the robustness of online resource sharing. In more detail, we consider the widely-studied repeated first-price auction with artificial currencies. Our main contribution is to show that a simple randomized bidding strategy can guarantee each agent a 2 – √2 ≈ 0.59 fraction of her ideal utility, irrespective of others’ bids. Specifically, our strategy requires each agent with fair share α to use a uniformly distributed bid whenever her value is in the top α-quantile of her value distribution. Our work almost closes the gap to the known 1 – 1/e ≈ 0.63 hardness for robust resource sharing; we also show that any static (i.e., budget independent) bidding policy cannot guarantee more than a 0.6-fraction of the ideal utility, showing our technique is almost tight.
7059: Featured Argumentation Framework: Semantics and Complexity
Authors: Gianvincenzo Alfano, Sergio Greco, Francesco Parisi, Irina Trubitsyna
Location: Montreal | Day: August 19th | Time: 15:00 | Session: KRR: Argumentation
Show Abstract
Dung’s Argumentation Framework (AF) has been extended in several directions to make knowledge representation and reasoning tasks more intuitive and/or expressive. We present a novel extension of AF called Featured AF (FAF), where each argument has associated a set of features expressed by means of unary and binary facts.
In such a context, a query is expressed by means of a conjunctive relational calculus formula which is evaluated over the extensions of the FAF.
Then, this framework is further expanded into the so-called Extended FAF (EFAF), where a first-order logic formula (FOL) is used for reasoning over `feasible’ subframeworks that satisfy the FOL formula and minimally differ from the original framework. We investigate the computational complexity of verification and acceptance problems under several semantics and show that incomplete AF (iAF) frameworks, including correlated iAF and constrained iAF, are special cases of EFAF.
7066: Efficient and Rigorous Model-Agnostic Explanations
Authors: Joao Marques-Silva, Jairo A. Lefebre-Lobaina, Maria Vanina Martinez
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Constraint Satisfaction and Optimization (2/3)
Show Abstract
Explainable artificial intelligence (XAI) is at the core of trustworthy AI. The best-known methods of XAI are sub-symbolic. Unfortunately, these methods do not give guarantees of rigor. Logic-based XAI addresses the lack of rigor of sub-symbolic methods, but in turn it exhibits some drawbacks. These include scalability, explanation size, but also the need to access the details of the machine learning model. Furthermore, access to the details of an ML model may reveal sensitive information. This
paper builds on recent work on symbolic model-agnostic XAI, which is based on explaining samples of behavior of a blackbox ML model, and proposes efficient algorithms for the computation of explanations. The experiments confirm the scalability of the novel algorithms.
7070: Evolutionary Algorithms Are Significantly More Robust to Noise When They Ignore It
Authors: Denis Antipov, Benjamin Doerr
Location: Montreal | Day: August 19th | Time: 11:30 | Session: S: Evolutionary computation (1/2)
Show Abstract
Randomized search heuristics (RSHs) are known to have a certain robustness to noise. Mathematical analyses trying to quantify rigorously how robust RSHs are to a noisy access to the objective function typically assume that each solution is re-evaluated whenever it is compared to others. This aims at preventing that a single noisy evaluation has a lasting negative effect, but is computationally expensive and requires the user to foresee that noise is present (as in a noise-free setting, one would never re-evaluate solutions).
In this work, we conduct the first mathematical runtime analysis of an evolutionary algorithm solving a single-objective noisy problem without re-evaluations. We prove that the (1+1) evolutionary algorithm without re-evaluations can optimize the classic LeadingOnes benchmark with up to constant noise rates, in sharp contrast to the version with re-evaluations, where only noise with rates O(n⁻²log n) can be tolerated.
This result suggests that re-evaluations are much less needed than what was previously thought, and that they actually can be highly detrimental. The insights from our mathematical proofs indicate that this similar results are plausible for other classic benchmarks.
7122: Rewarding Explainability in Drug Repurposing with Knowledge Graphs
Authors: Susana Nunes, Samy Badreddine, Catia Pesquita
Location: Montreal | Day: August 21st | Time: 11:30 | Session: Humans and AI: Interpretable Models
Show Abstract
Knowledge graphs (KGs) are powerful tools for modelling complex, multi-relational data and supporting hypothesis generation, particularly in applications like drug repurposing. However, for predictive methods to gain acceptance as credible scientific tools, they must ensure not only accuracy but also the capacity to offer meaningful scientific explanations.

This paper presents a novel approach REx, for generating scientific explanations based in link prediction in knowledge graphs. It employs reward and policy mechanisms that consider desirable properties of scientific explanation to guide a reinforcement learning agent in the identification of explanatory paths within a KG. The approach further enriches explanatory paths with domain-specific ontologies, ensuring that the explanations are both insightful and grounded in established biomedical knowledge.

We evaluate our approach in drug repurposing using three popular knowledge graph benchmarks. The results clearly demonstrate its ability to generate explanations that validate predictive insights against biomedical knowledge and that outperform the state-of-the-art approaches in predictive performance, establishing REx as a relevant contribution to advance AI-driven scientific discovery.
7124: A First Runtime Analysis of NSGA-III on a Many-Objective Multimodal Problem: Provable Exponential Speedup via Stochastic Population Update
Authors: Andre Opris
Location: Montreal | Day: August 19th | Time: 11:30 | Session: S: Evolutionary computation (1/2)
Show Abstract
The NSGA-III is a prominent algorithm in evolutionary many-objective optimization. It is well-suited for optimizing functions with more than three objectives, setting it apart from the classic NSGA-II. However, theoretical insights about NSGA-III of when and why it performs well are still in its early development. This paper addresses this point and conducts a rigorous runtime analysis of NSGA-III on the many-objective OneJumpZeroJump benchmark (OJZJ for short), providing runtime bounds where the number of objectives is constant. We show that NSGA-III finds the Pareto front of OJZJ in time O(n^(k+d/2)+ N n ln(n)) where n is the problem size, d is the number of objectives, k is the gap size, a problem specific parameter, if its population size N is in 2^(O(n)) and at least (2n/d+1)^(d/2). Notably, NSGA-III is faster than NSGA-II by a factor of N/n^(d/2) for N=omega(n^(d/2)) . We also show that a stochastic population update provably guarantees a speedup of order (k/b)^(k-1) in the runtime where b>0 is a constant. Besides a paper of Wietheger and Doerr (PPSN 2024), this is the first rigorous runtime analysis of NSGA-III on OJZJ. Proving these bounds requires a much deeper understanding of the population dynamics of NSGA-III than previous papers achieved.
7142: Continuous-Time Reward Machines
Authors: Amin Falah, Shibashis Guha, Ashutosh Trivedi
Location: Montreal | Day: August 21st | Time: 15:00 | Session: ML: Reinforcement Learning (2/2)
Show Abstract
Reinforcement Learning (RL) is a sampling-based method for sequential decision-making, in which a learning agent iteratively converges toward an optimal policy by leveraging feedback from the environment in the form of scalar reward signals.
While timing information is often abstracted in discrete-time domains, time-critical learning applications—such as queuing systems, population processes, and manufacturing systems—are naturally modeled as Continuous-Time Markov Decision Processes (CTMDPs).
Since the seminal work of Bradtke and Duff, model-free RL for CTMDPs has become well-understood. However, in many practical applications, practitioners possess high-quality information about system rates derived from traditional queuing theory, which learning agents could potentially exploit to accelerate convergence. Despite this, classical RL algorithms for CTMDPs typically re-learn these parameters through sampling.
In this work, we propose continuous-time reward machines (CTRMs), a novel framework that embeds reward functions and real-time state-action dynamics into a unified structure.
CTRMs enable RL agents to effectively navigate dense-time environments while leveraging reward shaping and counterfactual experiences for accelerated learning.
Our empirical results demonstrate CTRMs’ ability to improve learning efficiency in time-critical environments.
7157: Finding Possible Winners in Spatial Voting with Incomplete Information
Authors: Hadas Shachnai, Rotem Shavitt, Andreas Wiese
Location: Montreal | Day: August 19th | Time: 11:30 | Session: GTEP: Computational social choice (1/2)
Show Abstract
We consider a spatial voting model where both candidates and voters are positioned in the d-dimensional Euclidean space, and each voter ranks candidates based on their proximity to the voter’s ideal point. We focus on the scenario where the given information about the locations of the voters’ ideal points is incomplete; for each dimension, only an interval of possible values is known. In this context, we investigate the computational complexity of determining the possible winners under positional scoring rules. Our results show that the possible winner problem in one dimension is solvable in polynomial time for all k-truncated voting rules with constant k. Moreover, for some scoring rules for which the possible winner problem is NP-complete, such as approval voting for any dimension or k-approval for two or more dimensions, we give an FPT algorithm parameterized by the number of candidates. Finally, we classify tractable and intractable settings of the em weighted possible winner problem in one dimension, and resolve the computational complexity of the weighted case for all two-valued positional scoring rules when d=1.
7161: Implicitly Aligning Humans and Autonomous Agents through Shared Task Abstractions
Authors: Stéphane Aroca-Ouellette, Miguel Aroca-Ouellette, Katharina von der Wense, Alessandro Roncone
Location: Montreal | Day: August 21st | Time: 10:00 | Session: Humans and AI
Show Abstract
In collaborative tasks, autonomous agents fall short of humans in their capability to quickly adapt to new and unfamiliar teammates. We posit that a limiting factor for zero-shot coordination is the lack of shared task abstractions, a mechanism humans rely on to implicitly align with teammates. To address this gap, we introduce HA^2: Hierarchical Ad Hoc Agents, a framework leveraging hierarchical reinforcement learning to mimic the structured approach humans use in collaboration. We evaluate HA^2 in the Overcooked environment, demonstrating statistically significant improvement over existing baselines when paired with both unseen agents and humans, providing better resilience to environmental shifts, and outperforming all state-of-the-art methods.
7190: Decomposing Inconsistencies: Marginal Contributions and Pooling Techniques
Authors: Christian Straßer, Badran Raddaoui, Said Jabbour
Location: Montreal | Day: August 19th | Time: 11:30 | Session: Knowledge Representation and Reasoning (1/4)
Show Abstract
Inconsistency measures quantify the degree of conflict within a set of propositions. They can be broadly categorized into global measures, which assess the overall inconsistency of a set, and local measures, which evaluate the contribution of single formulas to the overall inconsistency. This paper investigates the relationship between these two classes of measures through the lens of marginal contributions and pooling mechanisms. We propose a systematic framework for deriving local inconsistency measures from global ones by employing notions of marginal contributions inspired by cooperative game theory, including Shapley and Banzhaf values. Conversely, we explore methods for constructing global inconsistency measures by
aggregating local contributions using various pooling techniques. A key research question arises: which combinations of marginal contribution notions (maC) and pooling mechanisms (P) are compatible? Compatibility is defined such that, given a global measure I, applying (P) to the marginal contributions derived from I yields the same result as directly applying I, and vice versa. We analyze this compatibility condition and identify specific pairs of methods, (maC) and (P), that satisfy it across various inconsistency frameworks. Our findings provide a deeper understanding of the interplay between global and local inconsistency measures, providing a foundation for designing principled and interpretable inconsistency evaluation methods in logic-based systems.
7201: Efficient Counterexample-Guided Fairness Verification and Repair of Neural Networks Using Satisfiability Modulo Convex Programming
Authors: Arya Fayyazi, Yifeng Xiao, Pierluigi Nuzzo, Massoud Pedram
Location: Montreal | Day: August 21st | Time: 11:30 | Session: ETF: Fairness and diversity
Show Abstract
Ensuring fairness is essential for ethical decision-making in various domains. Informally, a neural network is considered fair if and only if it treats similar individuals similarly in a given task.
We introduce FaVeR (Fairness Verification and Repair), a framework for efficiently verifying and repairing pre-trained neural networks with respect to individual fairness properties.
FaVeR ensures fairness via iterative search of high-sensitivity neurons and backward adjustment of their weights, guided by counterexamples generated from fairness verification using satisfiability modulo convex programming. By addressing fairness at the neuron level, FaVeR minimizes the impact of neural network repair on the overall performance. Experimental evaluations on common fairness datasets show that FaVeR achieves a 100% fairness repair rate across all models, with accuracy reduction of less than 2.27%. Moreover, its significantly lower average runtime makes it suitable for practical applications.
7204: Dynamic Replanning for Improved Public Transport Routing
Authors: Abdallah Abuaisha, Bojie Shen, Daniel D. Harabor, Peter J. Stuckey, Mark Wallace
Location: Guangzhou | Day: TBD
Show Abstract
Delays in public transport are common, often impacting users through prolonged travel times and missed transfers. Existing solutions for handling delays remain limited; backup plans based on historical data miss opportunities for earlier arrivals, while snapshot planning accounts for current delays but not future ones. With the growing availability of live delay data, users can adjust their journeys in real-time. However, the literature lacks a framework that fully exploits this advantage for system-scale dynamic replanning. To address this, we formalise the dynamic replanning problem in public transport routing and propose two solutions: a "pull" approach, where users manually request replanning, and a novel "push" approach, where the server proactively monitors and adjusts journeys. Our experiments show that the push approach outperforms the pull approach, achieving significant speedups. The results also reveal substantial arrival time savings enabled by dynamic replanning.
7224: Coalition Obstruction Temporal Logic: A New Obstruction Logic to Reason About Demon Coalitions
Authors: Davide Catta, Jean Leneutre, Vadim Malvone, James Ortiz
Location: Montreal | Day: August 21st | Time: 10:00 | Session: MAS: Formal verification, validation and synthesis
Show Abstract
In multi-agent systems, especially in cybersecurity, the dynamic interplay between attackers and defenders is crucial to the security and resilience of the system. Traditional methods often assume static game models and fail to account for the strategic adaptation of the environment to the actions of the players. This paper presents Coalition Obstruction Temporal Logic (COTL), a formal framework for analyzing defender coalitions in dynamic game scenarios. Within this framework, defenders, conceptualized as demons, can actively obstruct attackers by selectively disabling certain actions in response to perceived threats. We establish the formal semantics of COTL and propose a model checking algorithm to verify complex security properties in systems with evolving adversarial dynamics. The utility of the framework is demonstrated through its application to a coalition of defenders that collaboratively defend a system against coordinated attacks.
7230: Sample-Efficient Behavior Cloning Using General Domain Knowledge
Authors: Feiyu Zhu, Jean Oh, Reid Simmons
Location: Montreal | Day: August 20th | Time: 14:00 | Session: Machine Learning (3/4)
Show Abstract
Behavior cloning has shown success in many sequential decision-making tasks by learning from expert demonstrations, yet they can be very sample inefficient and fail to generalize to unseen scenarios. One approach to these problems is to introduce general domain knowledge, such that the policy can focus on the essential features and may generalize to unseen states by applying that knowledge. Although this knowledge is easy to acquire from the experts, it is hard to be combined with learning from individual examples due to the lack of semantic structure in neural networks and the time-consuming nature of feature engineering. To enable learning from both general knowledge and specific demonstration trajectories, we use a large language model’s coding capability to instantiate a policy structure based on expert domain knowledge expressed in natural language and tune the parameters in the policy with demonstrations. We name this approach the Knowledge Informed Model (KIM) as the structure reflects the semantics of expert knowledge. In our experiments with lunar lander and car racing tasks, our approach learns to solve the tasks with as few as 5 demonstrations and is robust to action noise, outperforming the baseline model without domain knowledge. This indicates that with the help of large language models, we can incorporate domain knowledge into the structure of the policy, increasing sample efficiency for behavior cloning.
7248: RetroMoE: A Mixture-of-Experts Latent Translation Framework for Single-step Retrosynthesis
Authors: Xinjie Li, Abhinav Verma
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Multidisciplinary Topics and Applications (1/2)
Show Abstract
Single-step retrosynthesis is a crucial task in organic synthesis, where the objective is to identify the reactants needed to produce a given product. In recent years, a variety of machine learning methods have been developed to tackle retrosynthesis prediction. In our study, we introduce RetroMoE, a novel generative model designed for the single-step retrosynthesis task. We start with a non-symmetric variational autoencoder (VAE) that incorporates a graph encoder to map molecular graphs into a latent space, followed by a transformer decoder for precise prediction of molecular SMILES strings. Additionally, we implement a simple yet effective mixture-of-experts (MoE) network to translate the product latent embedding into the reactant latent embedding. To our knowledge, this is the first approach that frames single-step retrosynthesis as a latent translation problem. Extensive experiments on the USPTO-50K and USPTO-MIT datasets demonstrate the superiority of our method, which not only surpasses most semi-template-based and template-free methods but also delivers competitive results against template-based methods. Notably, under the class-known setting on the USPTO-50K, our method achieves top-1 exact match accuracy comparable to the state-of-the-art template method, RetroKNN.
7257: A Logic-Based Approach to Causal Discovery: Signal Temporal Logic Perspective
Authors: Nasim Baharisangari, Yucheng Ruan, Chengcheng Zhao, Zhe Xu
Location: Montreal | Day: August 20th | Time: 14:00 | Session: ML: time series, sequences and signals
Show Abstract
Causal discovery in time-series datasets is critical for understanding complex systems, especially when the \textit{effectiveness} of causal relationships depends on both the \textit{duration} and \textit{magnitude} of the cause. We introduce a novel framework for causal discovery based on \textbf{Signal Temporal Logic (STL)}, enabling the extraction of interpretable causal diagrams (STL-CD) that explicitly capture these temporal dynamics. Our method first identifies statistically meaningful time intervals, then infers STL formulas that classify system behaviors, and finally employs transfer entropy to determine direct causal relationships among the formulas. This approach not only uncovers causal structure but also identifies the temporal persistence required for causal influence—an insight missed by existing methods. Experimental results on synthetic and real-world datasets demonstrate that our method achieves superior structural accuracy over state-of-the-art baselines, providing more informative and temporally precise causal models.
7268: Partially Observable Reference Policy Programming
Authors: Edward Kim, Hanna Kurniawati
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Planning and Scheduling (5/5)
Show Abstract
This paper proposes Partially Observable Reference Policy Programming, a novel anytime online approximate POMDP solver which samples meaningful future histories very deeply while simultaneously forcing a gradual policy update. We provide theoretical guarantees for the algorithm’s underlying scheme which say that the performance loss is bounded by the average of the sampling approximation errors rather than the usual maximum; a crucial requirement given the sampling sparsity of online planning. Empirical evaluations on two large-scale problems with dynamically evolving environments—including a helicopter emergency scenario in the Corsica region requiring approximately 150 planning steps—corroborate the theoretical results and indicate that our solver considerably outperforms current online benchmarks.
7279: Latte: Transfering LLMs’ Latent-level Knowledge for Few-shot Tabular Learning
Authors: Ruxue Shi, Hengrui Gu, Hangting Ye, Yiwei Dai, Xu Shen, Xin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Few-shot tabular learning, in which machine learning models are trained with a limited amount of labeled data, provides a cost-effective approach to addressing real-world challenges. The advent of Large Language Models (LLMs) has sparked interest in leveraging their pre-trained knowledge for few-shot tabular learning. Despite promising results, existing approaches either rely on test-time knowledge extraction, which introduces undesirable latency, or text-level knowledge, which leads to unreliable feature engineering. To overcome these limitations, we propose Latte, a training-time knowledge extraction framework that transfers the latent prior knowledge within LLMs to optimize a more generalized downstream model. Latte enables general knowledge-guided downstream tabular learning, facilitating the weighted fusion of information across different feature values while reducing the risk of overfitting to limited labeled data. Furthermore, Latte is compatible with existing unsupervised pre-training paradigms and effectively utilizes available unlabeled samples to overcome the performance limitations imposed by an extremely small labeled dataset. Extensive experiments on various few-shot tabular learning benchmarks demonstrate the superior performance of Latte, establishing it as a state-of-the-art approach in this domain. Our code is available at https://github.com/ruxueshi/Latte.git.
7307: FairSMOE: Mitigating Multi-Attribute Fairness Problem with Sparse Mixture-of-Experts
Authors: Changdi Yang, Zheng Zhan, Ci Zhang, Yifan Gong, Yize Li, Zichong Meng, Jun Liu, Xuan Shen, Hao Tang, Geng Yuan, Pu Zhao, Xue Lin, Yanzhi Wang
Location: Montreal | Day: August 20th | Time: 10:00 | Session: AI Ethics, Trust, Fairness (1/3)
Show Abstract
Real‐world datasets usually contain multiple attributes, making it essential to ensure fairness across all of them simultaneously. However, different attributes may vary in difficulty, and no existing approaches have effectively addressed this issue. Consequently, an attribute‐adaptive strategy is needed to achieve fairness for all attributes.
Multi‐task Learning (MTL) leverages shared information to optimize multiple tasks concurrently, while Sparsely‐Gated Mixture‐of‐Experts (SMoE) can dynamically allocate computational resources to the most needed tasks. In this work, we formulate multi‐attribute fairness issue as an MTL problem and employ SMoE to achieve desirable performance across all attributes simultaneously.

We first analyze the feasibility and find the potentiality by formalizing multi-attribute fairness problem into a MTL problem and mitigating it by using SMoE. However, vanilla SMoE could lead to over-utilization problem which causes sub-optimal performance. We then proposed an innovative SMoE framework for multi-attribute fair image classification, which further improves multi-attribute fairness by redesigning the MoE layer and routing policy with fairness consideration. Extensive experiments demonstrated the effectiveness. Taking a DeiT-Small as the backbone, we achieve 77.25% and 86.01% accuracy on the ISIC2019 and CelebA dataset respectively with Multi-attribute Predictive Quality Disparity (PQD) score of 0.801 and 0.787, beating current state-of-the-art methods Muffin, InfoFair and MultiFair.
7315: Decoupling and Reconstructing: A Multimodal Sentiment Analysis Framework Towards Robustness
Authors: Mingzheng Yang, Kai Zhang, Yuyang Ye, Yanghai Zhang, Runlong Yu, Min Hou
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal sentiment analysis (MSA) has shown promising results but often poses significant challenges in real-world applications due to its dependence on the complete and aligned multimodal sequences. While existing approaches attempt to address missing modalities through feature reconstruction, they often neglect the complex interplay between homogeneous and heterogeneous relationships in multimodal features. To address this problem, we propose Decoupled-Adaptive Reconstruction (DAR), a novel framework that explicitly addresses these limitations through two key components: (1) a mutual information-based decoupling module that decomposes features into common and independent representations, and (2) a reconstruction module that independently processes these decoupled features before fusion for downstream tasks. Extensive experiments on two benchmark datasets demonstrate that DAR significantly outperforms existing methods in both modality reconstruction and sentiment analysis tasks, particularly in scenarios with missing or unaligned modalities. Our results show improvements of 2.21% in bi-classification accuracy and 3.9% in regression error compared to state-of-the-art baselines on the MOSEI dataset.
7317: Learning Optimal Oblique Decision Trees with (Max)SAT
Authors: Florent Avellaneda
Location: Montreal | Day: August 21st | Time: 11:30 | Session: Constraint Satisfaction and Optimization (3/3)
Show Abstract
Decision trees are widely used in machine learning for their interpretability and effectiveness in classification tasks. Traditional axis-parallel decision trees partition data using single-feature thresholds at each node, but they often struggle to represent complex, non-axis-aligned decision boundaries efficiently. This limitation can result in unnecessarily large and less interpretable trees. Oblique decision trees address this limitation by using linear combinations of features at each node, allowing a more natural representation of complex decision boundaries while maintaining interpretability through sparse linear combinations. However, learning optimal oblique decision trees poses a significant computational challenge, as existing methods predominantly rely on suboptimal greedy heuristics. In this paper, we propose a novel approach to learning globally optimal oblique decision trees by reformulating the problem as a (Max)SAT instance. By leveraging state-of-the-art (Max)SAT solvers, our method efficiently explores the solution space to identify optimal trees. Experiments on benchmark datasets demonstrate that our approach generates optimal oblique decision trees within reasonable computational time for small to medium-sized datasets.
7323: Stochasticity-aware No-Reference Point Cloud Quality Assessment
Authors: Songlin Fan, Wei Gao, Zhineng Chen, Ge Li, Guoqing Liu, Qicheng Wang
Location: Guangzhou | Day: TBD
Show Abstract
The evolution of point cloud processing algorithms necessitates an accurate assessment for their quality. Previous works consistently regard point cloud quality assessment (PCQA) as a MOS regression problem and devise a deterministic mapping, ignoring the stochasticity in generating MOS from subjective tests. This work presents the first probabilistic architecture for no-reference PCQA, motivated by the labeling process of existing datasets. The proposed method can model the quality judging stochasticity of subjects through a tailored conditional variational autoencoder (CVAE) and produces multiple intermediate quality ratings. These intermediate ratings simulate the judgments from different subjects and are then integrated into an accurate quality prediction, mimicking the generation process of a ground truth MOS. Specifically, our method incorporates a Prior Module, a Posterior Module, and a Quality Rating Generator, where the former two modules are introduced to model the judging stochasticity in subjective tests, while the latter is developed to generate diverse quality ratings. Extensive experiments indicate that our approach outperforms previous cutting-edge methods by a large margin and exhibits gratifying crossdataset robustness. Codes are available at https://git.openi.org.cn/OpenPointCloud/nrpcqa.
7330: Solving QNP and FOND+ with Generating, Testing and Forbidding
Authors: Zheyuan Shi, Hao Dong, Yongmei Liu
Location: Guangzhou | Day: TBD
Show Abstract
Qualitative Numerical Planning (QNP) extends classical planning with numerical variables that can be changed by arbitrary amounts. FOND+ extends Fully Observable Non-Deterministic (FOND) planning by introducing explicit fairness assumptions, resulting in a more expressive model that can also capture QNP as a special case. However, existing QNP and FOND+ solvers still face significant scalability challenges. To address this, we propose a novel framework for solving QNP and FOND+ by generating strong cyclic solutions of the associated FOND problem, testing their validity, and forbidding non-solutions in conducting further searches. For this, we propose a procedure called SIEVE*, which generalizes the QNP termination testing algorithm SIEVE to determine whether a strong cyclic solution is a FOND+ solution. Additionally, we propose several optimization techniques to further improve the performance of our basic framework. We implemented our approach based on the advanced FOND solver PRP; experimental results show that our solver shows superior scalability over the existing QNP and FOND+ solvers.
7350: Enhancing Table Recognition with Vision LLMs: A Benchmark and Neighbor-Guided Toolchain Reasoner
Authors: Yitong Zhou, Mingyue Cheng, Qingyang Mao, Jiahao Wang, Feiyang Xu, Xin Li
Location: Guangzhou | Day: TBD
Show Abstract
Pre-trained foundation models have recently made significant progress in table-related tasks such as table understanding and reasoning. However, recognizing the structure and content of unstructured tables using Vision Large Language Models (VLLMs) remains under-explored. To bridge this gap, we propose a benchmark based on a hierarchical design philosophy to evaluate the recognition capabilities of VLLMs in training-free scenarios. Through in-depth evaluations, we find that low-quality image input is a significant bottleneck in the recognition process. Drawing inspiration from this, we propose the Neighbor-Guided Toolchain Reasoner (NGTR) framework, which is characterized by integrating diverse lightweight tools for visual operations aimed at mitigating issues with low-quality images. Specifically, we transfer a tool selection experience from a similar neighbor to the input and design a reflection module to supervise the tool invocation process. Extensive experiments on public datasets demonstrate that our approach significantly enhances the recognition capabilities of the vanilla VLLMs. We believe that the benchmark and framework could provide an alternative solution to table recognition.
7378: MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion
Authors: Wei Hua, Chenlin Zhou, Jibin Wu, Yansong Chua, Yangyang Shu
Location: Guangzhou | Day: TBD
Show Abstract
The combination of Spiking Neural Networks (SNNs) with Vision Transformer architectures has attracted significant attention due to the great potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT, a novel spike-driven Transformer architecture, which firstly uses multi-scale spiking attention (MSSA) to enrich the capability of spiking attention blocks. We validate our approach across various main data sets. The experimental results indicate that our MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among NN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.
7387: Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning
Authors: Hao Tang, Shengfeng He, Jing Qin
Location: Guangzhou | Day: TBD
Show Abstract
Few-shot learning (FSL) addresses the challenge of classifying novel classes with limited training samples. While some methods leverage semantic knowledge from smaller-scale models to mitigate data scarcity, these approaches often introduce noise and bias due to the data’s inherent simplicity. In this paper, we propose a novel framework, Synergistic Knowledge Transfer (SynTrans), which effectively transfers diverse and complementary knowledge from large multimodal models to empower the off-the-shelf few-shot learner. Specifically, SynTrans employs CLIP as a robust teacher and uses a few-shot vision encoder as a weak student, distilling semantic-aligned visual knowledge via an unsupervised proxy task. Subsequently, a training-free synergistic knowledge mining module facilitates collaboration among large multimodal models to extract high-quality semantic knowledge. Building upon this, a visual-semantic bridging module enables bi-directional knowledge transfer between visual and semantic spaces, transforming explicit visual and implicit semantic knowledge into category-specific classifier weights. Finally, SynTrans introduces a visual weight generator and a semantic weight reconstructor to adaptively construct optimal multimodal FSL classifiers. Experimental results on four FSL datasets demonstrate that SynTrans, even when paired with a simple few-shot vision encoder, significantly outperforms current state-of-the-art methods.
7398: RTdetector: Deep Transformer Networks for Time Series Anomaly Detection Based on Reconstruction Trend
Authors: Xinhong Liu, Xiaoliang Li, Yangfan Li, Fengxiao Tang, Ming Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Anomaly detection in multivariate time series data is critical across a variety of real-life applications. The predominant anomaly detection techniques currently rely on reconstruction-based methods. However, these methods often overfit the abnormal pattern and fail to diagnose the anomaly. Although some studies have attempted to prevent the incorrect fitting of anomalous data by enabling models to learn the trend of data variations, they fail to account for the dynamic nature of data distribution. This oversight can lead to the erroneous reconstruction of anomalies that do not exist. To address these challenges, we propose RTdetector, a Transformer-based time series anomaly detection model leveraging reconstruction trends. RTdetector employs a novel global attention mechanism based on reconstruction trends to learn distinguishable attention from the original sequence, thereby preserving the global trend information intrinsic to the time series. Additionally, it incorporates a self-conditioning transformer, based on reconstruction trend enhancement to achieve superior predictive performance. Extensive experiments on four datasets demonstrate that RTdetector achieves state-of-the-art results in multivariate time series data anomaly detection. Our code is available at https://github.com/CSUFUNLAB/RTdetector.
7408: Neuromorphic Sequential Arena: A Benchmark for Neuromorphic Temporal Processing
Authors: Xinyi Chen, Chenxiang Ma, Yujie Wu, Kay Chen Tan, Jibin Wu
Location: Guangzhou | Day: TBD
Show Abstract
Temporal processing is vital for extracting meaningful information from time-varying signals. Recent advancements in Spiking Neural Networks (SNNs) have shown immense promise in efficiently processing these signals. However, progress in this field has been impeded by the lack of effective and standardized benchmarks, which complicates the consistent measurement of technological advancements and limits the practical applicability of SNNs. To bridge this gap, we introduce the Neuromorphic Sequential Arena (NSA), a comprehensive benchmark that offers an effective, versatile, and application-oriented evaluation framework for neuromorphic temporal processing. The NSA includes seven real-world temporal processing tasks from a diverse range of application scenarios, each capturing rich temporal dynamics across multiple timescales. Utilizing NSA, we conduct extensive comparisons of recently introduced spiking neuron models and neural architectures, presenting comprehensive baselines in terms of task performance, training speed, memory usage, and energy efficiency. Our findings emphasize an urgent need for efficient SNN designs that can consistently deliver high performance across tasks with varying temporal complexities while maintaining low computational costs. NSA enables systematic tracking of advancements in neuromorphic algorithm research and paves the way for developing effective and efficient neuromorphic temporal processing systems.
7431: DASS: A Dual-Branch Attention-based Framework for Trajectory Similarity Learning with Spatial and Semantic Fusion
Authors: Jiayi Li, Junhua Fang, Pingfu Chao, Jiajie Xu, Pengpeng Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Trajectory similarity aims to identify pairs of similar trajectories, serving as a crucial operation in spatial-temporal data mining. Although several approaches have been proposed, they encounter the following two issues: 1) An overemphasis on spatial similarity in road networks while the rich semantic information embedded in trajectories is not fully exploited; 2) Dependence on Recurrent Neural Network (RNN) architectures would struggle to capture long-term dependencies. To address these limitations, we propose a Dual-branch Attention-based framework with Spatial and Semantic information (DASS) based on self-supervised learning. Specifically, DASS comprises two core components: 1) A trajectory representation module that models spatial-temporal adjacent relationships in the form of graph and converts semantics into numerical embeddings. 2) A backbone encoder with a co-attention module to independently process two features before they are integrated. Extensive experiments on real-world datasets demonstrate that DASS outperforms state-of-the-art methods, establishing itself as a novel paradigm.
7432: RepObE: Representation Learning-Enhanced Obfuscation Encryption Modular Semantic Task Framework
Authors: Limei Lin, Jinpeng Xu, Xiaoding Wang, Liang Chen, Sun-Yuan Hsieh, Jie Wu
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Machine Learning (1/4)
Show Abstract
Model inversion and adversarial attacks in semantic communication pose risks, such as content leaks, alterations, and prediction inaccuracies, which threaten security and reliability. This paper introduces, from an attacker’s viewpoint, a novel framework called RepObE (Representation Learning-Enhanced Obfuscation Encryption Modular Semantic Task Framework) to secure semantic communication. This framework employs dynamic encryption during semantic extraction and feature transmission to hinder attackers from reconstructing data through eavesdropping, thus strengthening system privacy. To combat image communication task challenges, we propose a prototype adversarial collaborative alignment training approach enhanced by representation learning. This method extracts and encodes semantic features while using dynamic perturbation and robust optimization to improve system resilience against adversarial threats. The approach ensures reliable semantic communication in complex environments, maintaining performance while countering attacks using feature obfuscation, adversarial training, and representation learning. Experimental results demonstrate that our method surpasses existing techniques by more than 2% in resisting model inversion attacks on classification tasks. Visually, our method excels with minimal decipherable images for attackers. It also shows a 3% to 5% improvement in countering adversarial attacks on classification tasks.
7459: Attribute Association Driven Multi-Task Learning for Session-based Recommendation
Authors: Xinyao Wang, Zhizhi Yu, Dongxiao He, Liang Yang, Jianguo Wei, Di Jin
Location: Guangzhou | Day: TBD
Show Abstract
Session-based Recommendation (SBR) aims to predict users’ next interaction based on their current session without relying on long-term profiles. Despite its effectiveness in privacy-preserving and real-time scenarios, SBR remains challenging due to limited behavioral signals. Prior methods often overfit co-occurrence patterns, neglecting semantic priors like item attributes. Recent studies have attempted to incorporate item attributes (e.g., category) by assigning fixed embeddings shared across all sessions. However, such approaches suffer from two key limitations: 1) Static attribute encoding fails to reflect semantic shifts under different session contexts. 2) Semantic misalignment between attribute and item ID embeddings. To address these issues, we propose attribute association driven multi-task learning for SBR, dubbed A²D-MTL. It explicitly models item categories using cross-session context to capture user potential interests and designs an adaptive sparse attention mechanism to suppress noise. Experimental results on three public datasets demonstrate the superiority of our method in recommendation accuracy (P@20) and ranking quality (MRR@20), validating the model’s effectiveness.
7487: CSAHFL:Clustered Semi-Asynchronous Hierarchical Federated Learning for Dual-layer Non-IID in Heterogeneous Edge Computing Networks
Authors: Aijing Li, Junping Du, Dandan Liu, Yingxia Shao, Tong Zhao, Guanhua Ye
Location: Guangzhou | Day: TBD
Show Abstract
Federated Learning (FL) enables collaborative model training across distributed devices without sharing raw data. Hierarchical Federated Learning (HFL) is a new paradigm of FL that leverages the Edge Servers (ESs) layer as an intermediary to perform partial local model aggregation in proximity, reducing core network transmission overhead. However, HFL faces new challenges: (1) The two-stage aggregation process between client-edge and edge-cloud results in a dual-layer non-IID issue, which may significantly compromise model training accuracy. (2) The heterogeneity and mobility of clients further impact model training efficiency. To address these challenges, we propose a novel Clustered Semi-Asynchronous Hierarchical Federated Learning (CSAHFL) framework that integrates adaptive semi-asynchronous intra-cluster aggregation at client-edge layer and dynamic distribution-aware inter-cluster aggregation at edge-cloud layer, collaboratively enhancing model performance and scalability in heterogeneous and mobile environments. We conducte experiments under varying degrees of dual-layer non-IID in both static and high-mobility scenarios. The results demonstrate significant advantages of CSAHFL over representative state-of-the-art methods.
7495: Fine-Grained and Efficient Self-Unlearning with Layered Iteration
Authors: Hongyi Lyu, Xuyun Zhang, Hongsheng Hu, Shuo Wang, Chaoxiang He, Lianyong Qi
Location: Montreal | Day: August 21st | Time: 10:00 | Session: MTA: Security and privacy
Show Abstract
As machine learning models become widely deployed in data-driven applications, ensuring compliance with the ‘right to be forgotten’ as required by many privacy regulations is vital for safeguarding user privacy. To forget the given data, existing re-labeling based unlearning methods employ a single-step adjustment scheme that revises the decision boundaries in one re-labeling phase. However, such single-step approaches lead to coarse-grained changes in decision boundaries among the remaining classes and impose adverse effects on the model utility. To address these limitations, we propose ‘Self-Unlearning with Layered Iteration (SULI),’ a novel unlearning approach that introduces a layered iteration strategy to re-label the forgetting data iteratively and refine the decision boundaries progressively. We further develop a ‘Selective Probability Adjustment (SPA)’ technique, which uses a soft-label mechanism to promote smoother decision-boundary transitions. Comprehensive experiments on three benchmark datasets demonstrate that SULI achieves superior performance in effectiveness, efficiency, and privacy compared to the state-of-the-art baselines in both class-wise and instance-wise unlearning scenarios. The source code is released at https://github.com/Hongyi-Lyu-MQ/SULI.
7502: From General Relation Patterns to Task-Specific Decision-Making in Continual Multi-Agent Coordination
Authors: Chang Yao, Youfang Lin, Shoucheng Song, Hao Wu, Yuqing Ma, Sheng Han, Kai Lv
Location: Guangzhou | Day: TBD
Show Abstract
Continual Multi-Agent Reinforcement Learning (Co-MARL) requires agents to address catastrophic forgetting issues while learning new coordination policies with the dynamics team. In this paper, we delve into the core of Co-MARL, namely Relation Patterns, which refer to agents’ general understanding of interactions. In addition to generality, relation patterns exhibit task-specificity when mapped to different action spaces. To this end, we propose a novel method called General Relation Patterns-Guided Task-specific Decision-Maker (RPG). In RPG, agents extract relation patterns from dynamic observation spaces using a relation capturer. These task-agnostic relation patterns are then mapped to different action spaces via a task-specific decision-maker generated by a conditional hypernetwork. To combat forgetting, we further introduce regularization items on both the relation capturer and the conditional hypernetwork. Results on SMAC and LBF demonstrate that RPG effectively prevents catastrophic forgetting when learning new tasks and achieves zero-shot generalization to unseen tasks.
7503: Run Like a Neural Network, Explain Like k-Nearest Neighbor
Authors: Xiaomeng Ye, David Leake, Yu Wang, David Crandall
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Machine Learning (1/4)
Show Abstract
Deep neural networks have achieved remarkable performance across a variety of applications. However, their decision-making processes are opaque. In contrast, k-nearest neighbor (k-NN) provides interpretable predictions by relying on similar cases, but it lacks important capabilities of neural networks.
The neural network k-nearest neighbor (NN-kNN) model is designed to bridge this gap, combining the benefits of neural networks with the instance-based interpretability of k-NN. However, the initial formulation of NN-kNN had limitations including scalability issues, reliance on surface-level features, and an excessive number of parameters. This paper improves NN-kNN by enhancing its scalability, parameter efficiency, ease of integration with feature extractors, and training simplicity.
An evaluation of the revised architecture for image and language classification tasks illustrates its promise as a flexible and interpretable method.
7513: M^2LLM: Multi-view Molecular Representation Learning with Large Language Models
Authors: Jiaxin Ju, Yizhen Zheng, Huan Yee Koh, Can Wang, Shirui Pan
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: LLM applications
Show Abstract
Accurate molecular property prediction is a critical challenge with wide-ranging applications in chemistry, materials science, and drug discovery. Molecular representation methods, including fingerprints and graph neural networks (GNNs), achieve state-of-the-art results by effectively deriving features from molecular structures. However, these methods often overlook decades of accumulated semantic and contextual knowledge. Recent advancements in large language models (LLMs) demonstrate remarkable reasoning abilities and prior knowledge across scientific domains, leading us to hypothesize that LLMs can generate rich molecular representations when guided to reason in multiple perspectives. To address these gaps, we propose M^2LLM, a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view. These views are fused dynamically to adapt to task requirements, and experiments demonstrate that M^2LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks. Moreover, we demonstrate that representation derived from LLM achieves exceptional performance by leveraging two core functionalities: the generation of molecular embeddings through their encoding capabilities and the curation of molecular features through advanced reasoning processes.
7517: Category-aware EEG Image Generation Based on Wavelet Transform and Contrast Semantic Loss
Authors: Enshang Zhang, Zhicheng Zhang, Takashi Hanakawa
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Multidisciplinary Topics and Applications (2/2)
Show Abstract
Reconstructing visual stimuli from EEG signals is a crucial step in realizing brain-computer interfaces. In this paper, we propose a transformer-based EEG signal encoder integrating the Discrete Wavelet Transform (DWT) and the gating mechanism. Guided by the feature alignment and category-aware fusion losses, this encoder is used to extract features related to visual stimuli from EEG signals. Subsequently, with the aid of a pre-trained diffusion model, these features are reconstructed into visual stimuli. To verify the effectiveness of the model, we conducted EEG-to-image generation and classification tasks using the THINGS-EEG dataset. To address the limitations of quantitative analysis at the semantic level, we combined WordNet-based classification and semantic similarity metrics to propose a novel semantic-based score, emphasizing the ability of our model to transfer neural activities into visual representations. Experimental results show that our model significantly improves semantic alignment and classification accuracy, which achieves a maximum single-subject accuracy of 43%, outperforming other state-of-the-art methods. The source code is available at https://github.com/zes0v0inn/DWT_EEG_Reconstruction/.
7518: Escaping Saddle Point Efficiently in Minimax and Bilevel Optimizations
Authors: Wenhan Xian, Feihu Huang, Heng Huang
Location: Montreal | Day: August 20th | Time: 14:00 | Session: ML: Machine Learning 6/8
Show Abstract
Hierarchical optimization is attracting significant attentions as it can be applied to a broad range of machine learning tasks. Recently, many algorithms are proposed to improve the theoretical results of minimax and bilevel optimizations. Among these works, a core issue that has not been well studies is to escape saddle point and find local minimum. In this paper, thus, we investigate the methods to achieve second-order optimality for nonconvex minimax and bilevel optimization. Specifically, we propose a new algorithm named PRGDA without the computation of second order derivative of the primal function. In nonconvex-strongly-concave minimax optimization, we prove that our algorithm can find a second-order stationary point with the gradient complexity that matches state-of-the-art result to find first-order stationary point. To our best knowledge, PRGDA is the first stochastic algorithm that is guaranteed to obtain the second-order stationary point for nonconvex minimax problems. In nonconvex-strongly-convex bilevel optimization, our method also achieves better gradient complexity to find local minimum. Finally, we conduct two numerical experiments to validate the performance of our new method.
7524: CLLMRec: Contrastive Learning with LLMs-based View Augmentation for Sequential Recommendation
Authors: Fan Lu, Xiaolong Xu, Haolong Xiang, Lianyong Qi, Xiaokang Zhou, Fei Dai, Wanchun Dou
Location: Guangzhou | Day: TBD
Show Abstract
Sequential recommendation generates embedding representations from historical user-item interactions to recommend the next potential interaction item. Due to the complexity and variability of historical user-item interactions, extracting effective user features is quite challenging. Recent studies have employed sequential networks such as time series networks and Transformers to capture the intricate dependencies and temporal patterns in historical user-item interactions, extracting more effective user features. However, limited by the scarcity and suboptimal quality of data, these methods struggle to capture subtle differences in user sequences, which results in diminished recommendation accuracy. To address the above issue, we propose a contrastive learning framework with LLMs-based view augmentation (CLLMRec), which effectively mines differences in behavioral sequences through sample generation. Specifically, CLLMRec utilizes LLMs (Large Language Models) to augment views and expand user behavior sequence representations, providing high-quality positive and negative samples. Subsequently, CLLMRec employs the augmented views for effective contrastive learning, capturing subtle differences in behavioral sequences to suppress interference from irrelevant noise. Experimental results on three public datasets demonstrate that the proposed method outperforms state-of-the-art baseline models, and significantly enhances recommendation performance.
7578: MEGAD: A Memory-Efficient Framework for Large-Scale Attributed Graph Anomaly Detection
Authors: Yifan Zhang, Haolong Xiang, Xiaolong Xu, Zishun Rui, Xiaoyong Li, Lianyong Qi, Fei Dai
Location: Guangzhou | Day: TBD
Show Abstract
Graph anomaly detection (GAD), with its ability to accurately identify anomalous patterns in graph data, plays a vital role in areas such as network security, social media platforms, and fraud detection. Graph autoencoder-based methods are widely used for GAD due to their efficiency and effectiveness in capturing complex patterns and learning meaningful representations. However, the above methods are constrained by hardware memory, hindering the detection for large-scale graph data. In this paper, we propose a Memory-Efficient framework for large-scale attributed Graph Anomaly Detection (MEGAD). Specifically, MEGAD first generates node embeddings and then refines them through a lightweight joint optimization model, ensuring minimal memory overhead. The optimized embeddings are subsequently fed into a detector to compute anomaly scores. Extensive experiments demonstrate that our framework achieves comparable accuracy to state-of-the-art methods across multiple datasets while significantly reducing memory consumption on large-scale graphs.
7585: Attractor-based Closed List Search: Sparsifying the Closed List for Efficient Memory-Constrained Planning
Authors: Alvin Zou, Muhammad Suhail Saleem, Maxim Likhachev
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Search
Show Abstract
Best-first search algorithms such as A* and Weighted A* are widely used tools. However, their high memory requirements often make them impractical for memory-constrained applications, such as on-board planning for interplanetary rovers, drones, and embedded systems. One popular strategy among memory-efficient approaches developed to address this challenge is to eliminate or sparsify the Closed list, a structure that tracks states explored by the search. However, such methods often incur substantial overhead in runtime, requiring recursive searches for solution reconstruction. In this work, we propose Attractor-based Closed List Search (ACLS), a novel framework that sparsely represents the Closed list using a small subset of states, termed attractors. ACLS intelligently identifies attractor states in a way that enables efficient solution reconstruction while preserving theoretical guarantees on the quality of the solution. Furthermore, we also introduce a lazy variant, Lazy-ACLS, which defers the computation of attractor states until necessary, substantially improving planning speed. We demonstrate the efficacy of ACLS used in conjunction with A*, Weighted A*, and Dijkstra’s searches across multiple domains including 2D and 3D navigation, Sliding Tiles, and Towers of Hanoi. Our experimental results demonstrate that ACLS significantly reduces memory usage, maintaining only 9% of the states typically stored in a Closed list, while achieving comparable planning times and outperforming state-of-the-art approaches. Source code can be found at github.com/alvin-ruihua-zou/ACLS.
7588: BinMetric: A Comprehensive Binary Code Analysis Benchmark for Large Language Models
Authors: Xiuwei Shang, Guoqiang Chen, Shaoyin Cheng, Benlong Wu, Li Hu, Gangyang Li, Weiming Zhang, Nenghai Yu
Location: Guangzhou | Day: TBD
Show Abstract
Binary analysis is crucial for software security, offering insights into compiled programs without source code. As large language models (LLMs) excel in language tasks, their potential for complex decoding binary data structures is growing. However, the lack of standardized benchmarks hinders their evaluation and progress in this domain.
To bridge this gap, we introduce BinMetric, a first comprehensive benchmark designed specifically to evaluate LLMs performance on binary analysis tasks. BinMetric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks, including decompilation, code summarization, etc., which reflect actual reverse engineering scenarios. Our empirical study on this benchmark investigates various state-of-the-art LLMs, revealing their strengths and limitations. The findings indicate that while LLMs show strong potential, challenges still exist, particularly in the areas of precise binary lifting and assembly synthesis. In summary, BinMetric makes a significant step forward in measuring binary analysis capabilities of LLMs, establishing a new benchmark leaderboard, and our study offers valuable insights for advancing LLMs in software security.
7592: GPI-Net: Gestalt-Guided Parallel Interaction Network via Orthogonal Geometric Consistency for Robust Point Cloud Registration
Authors: Weikang Gu, Mingyue Han, Li Xue, Heng Dong, Changcai Yang, Riqing Chen, Lifang Wei
Location: Guangzhou | Day: TBD
Show Abstract
The accurate identification of high-quality correspondences is a prerequisite task in feature-based point cloud registration. However, it is extremely challenging to handle the fusion of local and global features due to feature redundancy and complex spatial relationships. Given that Gestalt principles provide key advantages in analyzing local and global relationships, we propose a novel Gestalt-guided Parallel Interaction Network via orthogonal geometric consistency (GPI-Net) in this paper. It utilizes Gestalt principles to facilitate complementary communication between local and global information. Specifically, we introduce an orthogonal integration strategy to optimally reduce redundant information and generate a more compact global structure for high-quality correspondences. To capture geometric features in correspondences, we leverage a Gestalt Feature Attention (GFA) block through a hybrid utilization of self-attention and cross-attention mechanisms. Furthermore, to facilitate the integration of local detail information into the global structure, we design an innovative Dual-path Multi-Granularity parallel interaction aggregation (DMG) block to promote information exchange across different granularities. Extensive experiments on various challenging tasks demonstrate the superior performance of our proposed GPI-Net in comparison to existing methods. The code will be released at https://github.com/XXX/GPI-Net.
7601: Incentivizing Safer Actions in Policy Optimization for Constrained Reinforcement Learning
Authors: Somnath Hazra, Pallab Dasgupta, Soumyajit Dey
Location: Montreal | Day: August 21st | Time: 15:00 | Session: ML: Reinforcement Learning (2/2)
Show Abstract
Constrained Reinforcement Learning (RL) aims to maximize the return while adhering to predefined constraint limits, which represent domain-specific safety requirements. In continuous control settings, where learning agents govern system actions, balancing the trade-off between reward maximization and constraint satisfaction remains a significant challenge. Policy optimization methods often exhibit instability near constraint boundaries, resulting in suboptimal training performance. To address this issue, we introduce a novel approach that integrates an adaptive incentive mechanism in addition to the reward structure to stay within the constraint bound before approaching the constraint boundary. Building on this insight, we propose Incrementally Penalized Proximal Policy Optimization (IP3O), a practical algorithm that enforces a progressively increasing penalty to stabilize training dynamics. Through empirical evaluation on benchmark environments, we demonstrate the efficacy of IP3O compared to the performance of state-of-the-art Safe RL algorithms. Furthermore, we provide theoretical guarantees by deriving a bound on the worst-case error of the optimality achieved by our algorithm.
7616: Temporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action Recognition
Authors: Ping Li, Jianan Ni, Bo Pang
Location: Guangzhou | Day: TBD
Show Abstract
Action recognition models using deep learning are vulnerable to adversarial examples, which are transferable across other models trained on the same data modality. Existing transferable attack methods face two major challenges: 1) they heavily rely on the assumption that the decision boundaries of the surrogate (a.k.a., source) model and the target model are similar, which limits the adversarial transferability; and 2) their decision boundary difference makes the attack direction uncertain, which may result in the gradient oscillation, weakening the adversarial attack. This motivates us to propose a Background Mixup-induced Temporal Consistency (BMTC) attack method for action recognition. From the input transformation perspective, we design a model-agnostic background adversarial mixup module to reduce the surrogate-target model dependency. In particular, we randomly sample one video from each category and make its background frame, while selecting the background frame with the top attack ability for mixup with the clean frame by reinforcement learning. Moreover, to ensure an explicit attack direction, we leverage the background category as guidance for updating the gradient of adversarial example, and design a temporal gradient consistency loss, which strengthens the stability of the attack direction on subsequent frames. Empirical studies on two video datasets, i.e., UCF101 and Kinetics-400, and one image dataset, i.e., ImageNet, demonstrate that our method significantly boosts the transferability of adversarial examples across several action/image recognition models.
7622: External Memory Matters: Generalizable Object-Action Memory for Retrieval-Augmented Long-Term Video Understanding
Authors: Jisheng Dang, Huicheng Zheng, Xudong Wu, Jingmei Jiao, Bimei Wang, Jun Yang, Bin Hu, Jianhuang Lai, Tat Seng Chua
Location: Guangzhou | Day: TBD
Show Abstract
Long video understanding with Large Language Models (LLMs) enables the description of objects that are not explicitly present in the training data. However, continuous changes in known objects and the emergence of new ones require up-to-date knowledge of objects and their dynamics for effective understanding of the open world. To alleviate this, we propose an efficient Retrieval-Enhanced Video Understanding method, dubbed REVU, which leverages external knowledge to enhance the performance of open-world learning. First, REVU introduces an extensible external text-object memory with minimal text-visual mapping, involving static and dynamic multimodal information to help LLMs-based models align text and vision features. Second, REVU retrieves object information from external databases and dynamically integrates frame-specific data from videos, enabling effective knowledge aggregation to comprehend the open world. We conducted experiments on multiple benchmark datasets, and our model demonstrates strong adaptability to out-of-domain data without requiring additional fine-tuning or re-training. Experiments on benchmark video understanding datasets reveal that our model achieves state-of-the-art performance and robust generalization.
7628: Multi-Agent Communication with Information Preserving Graph Contrastive Learning
Authors: Wei Du, Shifei Ding, Wei Guo, Yuqing Sun, Guoxian Yu, Lizhen Cui
Location: Guangzhou | Day: TBD
Show Abstract
Recent research in cooperative Multi-Agent Reinforcement Learning (MARL) has shown significant interest in utilizing Graph Neural Networks (GNNs) for communication learning due to their strong ability to process feature and topological information of agents into message representations for downstream action selection and coordination. However, GNNs generally assume network homogeneity that nodes of the same class tend to be interconnected. In real-world multi-agent systems, such assumptions are often unrealistic, as agents within the same class can be distant from each other. Furthermore, GNN-based MARL methods overlook the crucial role of feature similarity of agents in action coordination, which also restricts their performance. To overcome these limitations, we propose a Multi-Agent communication mechanism with Information preserving graph contrastive Learning (MAIL), which enhances message representation by preserving the comprehensive features of adjacent agents while integrating topological information. Specifically, MAIL considers three distinct graph views: original view, agent feature view, and global topological view. MAIL performs contrastive learning across three views to extract comprehensive information. MAIL effectively learns robust and expressive message representations for downstream tasks. Extensive experiments across various environments demonstrate that MAIL outperforms existing GNN-based MARL methods.
7631: Dirichlet Process-Based Robust Clustering Using the Median-of-Means Estimator
Authors: Supratik Basu, Jyotishka Ray Choudhury, Debolina Paul, Swagatam Das
Location: Guangzhou | Day: TBD
Show Abstract
Clustering stands as one of the most prominent challenges in unsupervised machine learning. Among centroid-based methods, the classic $k$-means algorithm, based on Lloyd’s heuristic, is widely used. Nonetheless, it is a well-known fact that $k$-means and its variants face several challenges, including heavy reliance on initial cluster centroids, susceptibility to converging into local minima of the objective function, and sensitivity to outliers and noise in the data. When data contains noise or outliers, the Median-of-Means (MoM) estimator offers a robust alternative for stabilizing centroid-based methods. On a different note, another limitation in many commonly used clustering methods is the need to specify the number of clusters beforehand. Model-based approaches, such as Bayesian nonparametric models, address this issue by incorporating infinite mixture models, eliminating the predefined cluster count requirement. Motivated by these facts, we propose an efficient and automatic clustering technique in this article by integrating the strengths of model-based and centroid-based methodologies. Our method mitigates the effect of noise on the quality of clustering while simultaneously estimating the number of clusters. Statistical guarantees on an upper bound of clustering error and rigorous assessment through simulated and real datasets suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms.
7635: Diff-LMM: Diffusion Teacher-Guided Spatio-Temporal Perception for Video Large Multimodal Models
Authors: Jisheng Dang, Ligen Chen, Jingze Wu, Ronghao Lin, Bimei Wang, Yun Wang, Liting Wang, Nannan Zhu, Teng Wang
Location: Guangzhou | Day: TBD
Show Abstract
Dynamic spatio-temporal understanding is essential for video-based multimodal tasks, yet existing methods often struggle to capture fine-grained temporal and spatial relationships in long videos. Current approaches primarily rely on pre-trained CLIP encoders, which excel in semantic understanding but lack spatially-aware visual context. This leads to hallucinated results when interpreting fine-grained objects or scenes. To address these limitations, we propose a novel framework that integrates diffusion models into multimodal video models. By employing diffusion encoders at intermediate layers, we enhance visual representations through feature alignment and knowledge distillation losses, significantly improving the model’s ability to capture spatial patterns over time. Additionally, we introduce a multi-level alignment strategy to learn robust feature correspondence from pre-trained diffusion models. Extensive experiments on benchmark datasets demonstrate our approach’s state-of-the-art performance across multiple video understanding tasks. These results establish diffusion models as a powerful tool for enhancing multimodal video models in complex, dynamic scenarios.
7647: ARMR: Adaptively Responsive Network for Medication Recommendation
Authors: Feiyue Wu, Tianxing Wu, Shenqi Jing
Location: Guangzhou | Day: TBD
Show Abstract
Medication recommendation is a crucial task in healthcare, especially for patients with complex medical conditions. However, existing methods often struggle to effectively balance the reuse of historical medications with the introduction of new drugs in response to the changing patient conditions. In order to address this challenge, we propose an Adaptively Responsive network for Medication Recommendation (ARMR), a new method which incorporates 1) a piecewise temporal learning component that distinguishes between recent and distant patient history, enabling more nuanced temporal understanding, and 2) an adaptively responsive mechanism that dynamically adjusts attention to new and existing drugs based on the patient’s current health state and medication history. Experiments on the MIMIC-III and MIMIC-IV datasets indicate that ARMR has better performance compared with the state-of-the-art baselines in different evaluation metrics, which contributes to more personalized and accurate medication recommendations. The source code is publicly avaiable at: https://github.com/seucoin/armr2.
7661: M3ANet: Multi-scale and Multi-Modal Alignment Network for Brain-Assisted Target Speaker Extraction
Authors: Cunhang Fan, Ying Chen, Jian Zhou, Zexu Pan, Jingjing Zhang, Youdian Gao, Xiaoke Yang, Zhengqi Wen, Zhao Lv
Location: Guangzhou | Day: TBD
Show Abstract
The brain-assisted target speaker extraction (TSE) aims to extract the attended speech from mixed speech by utilizing the brain neural activities, for example Electroencephalography (EEG). However, existing models overlook the issue of temporal misalignment between speech and EEG modalities, which hampers TSE performance. In addition, the speech encoder in current models typically uses basic temporal operations (e.g., one-dimensional convolution), which are unable to effectively extract target speaker information. To address these issues, this paper proposes a multi-scale and multi-modal alignment network (M3ANet) for brain-assisted TSE. Specifically, to eliminate the temporal inconsistency between EEG and speech modalities, the modal alignment module that uses a contrastive learning strategy is applied to align the temporal features of both modalities. Additionally, to fully extract speech information, multi-scale convolutions with GroupMamba modules are used as the speech encoder, which scans speech features at each scale from different directions, enabling the model to capture deep sequence information. Experimental results on three publicly available datasets show that the proposed model outperforms current state-of-the-art methods across various evaluation metrics, highlighting the effectiveness of our proposed method. The source code is available at: https://github.com/fchest/M3ANet.
7669: WDMIR: Wavelet-Driven Multimodal Intent Recognition
Authors: Weiyin Gong, Kai Zhang, Yanghai Zhang, Qi Liu, Xinjie Sun, Junyu Lu, Linbo Zhu
Location: Guangzhou | Day: TBD
Show Abstract
Multimodal intent recognition (MIR) seeks to accurately interpret user intentions by integrating verbal and non-verbal information across video, audio and text modalities. While existing approaches prioritize text analysis, they often overlook the rich semantic content embedded in non-verbal cues. This paper presents a novel Wavelet-Driven Multimodal Intent Recognition (WDMIR) framework that enhances intent understanding through frequency-domain analysis of non-verbal information. To be more specific, we propose: (1) a wavelet-driven fusion module that performs synchronized decomposition and integration of video-audio features in the frequency domain, enabling fine-grained analysis of temporal dynamics; (2) a cross-modal interaction mechanism that facilitates progressive feature enhancement from bimodal to trimodal integration, effectively bridging the semantic gap between verbal and non-verbal information. Extensive experiments on MIntRec demonstrate that our approach achieves state-of-the-art performance, surpassing previous methods by 1.13% on accuracy. Ablation studies further verify that the wavelet-driven fusion module significantly improves the extraction of semantic information from non-verbal sources, with a 0.41% increase in recognition accuracy when analyzing subtle emotional cues.
7683: Where Does This Data Come From? Enhanced Source Inference Attacks in Federated Learning
Authors: Haiyang Chen, Xiaolong Xu, Xiang Zhu, Xiaokang Zhou, Fei Dai, Yansong Gao, Xiao Chen, Shuo Wang, Hongsheng Hu
Location: Guangzhou | Day: TBD
Show Abstract
Federated learning (FL) enables collaborative model training without exposing raw data, offering a privacy-aware alternative to centralized learning. However, FL remains vulnerable to various privacy attacks that exploit shared model updates, including membership inference, property inference, and gradient inversion. Source inference attacks further threaten FL by identifying which client contributed a specific training sample, posing severe risks to user and institutional privacy. Existing source inference attacks mainly assume passive adversaries and overlook more realistic scenarios where the server actively manipulates the training process. In this paper, we present an enhanced source inference attack that demonstrates how a malicious server can amplify behavioral differences between clients to more accurately infer data origin. Our approach introduces active training manipulation and data augmentation to expose client-specific patterns. Experimental results across five representative FL algorithms and multiple datasets show that our method significantly outperforms prior passive attacks. These findings reveal a deeper level of privacy vulnerability in FL and call for stronger defense mechanisms under active threat models.
7685: Inferring Causal Protein Signaling Networks with Reinforcement Learning via Artificial Bee Colony Neural Architecture Search
Authors: Jihao Zhai, Junzhong Ji, Jinduo Liu
Location: Guangzhou | Day: TBD
Show Abstract
Inferring causal protein signaling networks from human immune system cellular data is an important approach to reveal underlying tissue signaling biology and dysfunction in diseased cells. In recent years, reinforcement learning (RL) methods have shown excellent performance in the field of causal protein signaling network inference. However, the complexity of RL models and the need for manual hyperparameter tuning can hinder performance. In this paper, we propose a actor-critic RL model via artificial bee colony (ABC) neural architecture search, called ABCNAS-RL. Specifically, the entire method is divided into two phases: ABC neural architecture search and actor-critic RL search. In phase one, we represent each bee as a set of hyperparameter, utilizing the ABC algorithm searching for optimal hyperparameters of the actor-critic RL model on the training set. In phase two, we use the actor-critic RL model to infer the causal protein signaling network on the test set. The actor network consists of an encoder-decoder architecture, composed of a transformer and a bidirectional gated recurrent unit (BiGRU) with an integrated attention mechanism. The critic network consists of a fully connected neural network that estimates the output state of the actor network. By maximizing cumulative rewards, we ultimately derive the causal protein signaling network. Extensive experimental results on simulated and real datasets verify that ABCNAS-RL outperforms the comparison methods and has superior performance.
7700: Always Clear Depth: Robust Monocular Depth Estimation Under Adverse Weather
Authors: Kui Jiang, Jing Cao, Zhaocheng Yu, Junjun Jiang, Jingchun Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Monocular depth estimation is critical for applications such as autonomous driving and scene reconstruction. While existing methods perform well under normal scenarios, their performance declines in adverse weather, due to challenging domain shifts and difficulties in extracting scene information. To address this issue, we present a robust monocular depth estimation method called ACDepth from the perspective of high-quality training data generation and domain adaptation. Specifically, we introduce a one-step diffusion model for generating samples that simulate adverse weather conditions, constructing a multi-tuple degradation dataset during training. To ensure the quality of the generated degradation samples, we employ LoRA adapters to fine-turn the generation weights of diffusion model. Additionally, we integrate circular consistency loss and adversarial training to guarantee the fidelity and naturalness of the scene contents. Furthermore, we elaborate on a multi-granularity knowledge distillation strategy (MKD) that encourages the student network to absorb knowledge from both the teacher model and pretrained Depth Anything V2. This strategy guides the student model in learning degradation-agnostic scene information from various degradation inputs. In particular, we introduce an ordinal guidance distillation mechanism (OGD) that encourages the network to focus on uncertain regions through differential ranking, leading to a more precise depth estimation. Experimental results demonstrate that our ACDepth surpasses md4all-DD by 2.50% for night scene and 2.61% for rainy scene on the nuScenes dataset in terms of the absRel metric.
7706: Contamination Budget: Trade-offs Between Breadth, Depth and Difficulty
Authors: Behzad Mehrbakhsh, Fernando Martínez-Plumed, José Hernández-Orallo
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: Natural Language Processing (1/2)
Show Abstract
Contamination in large language models (LLMs), and machine learning more broadly, refers to the inclusion of equal –or very similar– examples in both training and test sets. This phenomenon usually translates into better test performance. Here we explore when this contamination is performed intentionally, for purposes that can be malicious (e.g., get better scores in evaluations) or benevolent (e.g., fix some mistakes). These interventions, usually in the form of fine-tuning memorisations, come with a budget in the size of the fine-tuning dataset. Several trade-offs appear between the breadth of the intervention (how many examples to be memorised), its depth (how many repetitions of each example) and the difficulty of the examples. By studying several LLMs and datasets, we observe some monotonic behaviour (more difficult items require more depth to be `fixed’) but also some non-monotonic phenomena (very high depth levels have negative effects on non-contaminated examples). This suggests that trade-offs should be found not only in terms of the budget but also according to model specifics, the task and the item difficulty at hand.
7709: Imputation-free Incomplete Multi-view Clustering via Knowledge Distillation
Authors: Benyu Wu, Wei Du, Jun Wang, Guoxian Yu
Location: Guangzhou | Day: TBD
Show Abstract
Incomplete multi-view data presents a significant challenge for multi-view clustering (MVC). Existing incomplete MVC solutions commonly rely on data imputation to convert incomplete data into complete data. However, this paradigm suffers from the risk of error accumulation when clustering unreliable imputed data, causing suboptimal clustering performance. Moreover, using imputation to fulfill missing data is inefficient, while inferring data categories based solely on the existing views is extremely challenging. To this end, we propose an Imputation-free Incomplete MVC (I2MVC) via pseudo-supervised knowledge distillation. Specifically, I2MVC decomposes the incomplete MVC problem into two tasks: an MVC task for complete data and a pseudo-supervised classification task for fully incomplete data. A self-supervised simple contrastive Teacher network is trained for clustering complete data, and its knowledge is distilled into a lightweight pseudo-supervised Student network. The Student network, unrestricted by view completeness, further guides the clustering of fully incomplete data. Finally, the clustering results from both tasks are merged to generate the final clustering outcome. Experimental results on benchmark datasets demonstrate the effectiveness of I2MVC.
7711: Aligning Contrastive Multiple Clusterings with User Interests
Authors: Shan Zhang, Liangrui Ren, Jun Wang, Yanyu Xu, Carlotta Domeniconi, Guoxian Yu
Location: Guangzhou | Day: TBD
Show Abstract
Multiple clustering approaches aim to partition complex data in different ways. These methods often exhibit a one-to-many relationship in their results, and relying solely on the data context may be insufficient to capture the patterns relevant to the user. User’s expectation is key for the multiple clustering task. Two main challenges exist: identifying the significant features to represent user interests and aligning those interests with the clustering results. To address this issue, we propose Contrastive Multiple Clusterings (CMClusts), which extends contrastive learning to multiple clustering by elevating traditional instance-level contrast to clustering-level contrast. Furthermore, CMClusts integrates user expectations or interests by extracting desired features through tailored data augmentations, enabling the model to effectively capture user-relevant clustering features. Experimental results on benchmark datasets show that CMClusts can generate interpretable and high-quality clusterings, which reflect different user interests.
7733: Robult: Leveraging Redundancy and Modality-Specific Features for Robust Multimodal Learning
Authors: Duy A. Nguyen, Abhi Kamboj, Minh N. Do
Location: Montreal | Day: August 20th | Time: 10:00 | Session: Multidisciplinary Topics and Applications (2/2)
Show Abstract
Addressing missing modalities and limited labeled data is crucial for advancing robust multimodal learning. We propose Robult, a scalable framework designed to mitigate these challenges by preserving modality-specific information and leveraging redundancy through a novel information-theoretic approach. Robult optimizes two core objectives: (1) a soft Positive-Unlabeled (PU) contrastive loss that maximizes task-relevant feature alignment while effectively utilizing limited labeled data in semi-supervised settings, and (2) a latent reconstruction loss that ensures unique modality-specific information is retained. These strategies, embedded within a modular design, enhance performance across various downstream tasks and ensure resilience to incomplete modalities during inference. Experimental results across diverse datasets validate that Robult achieves superior performance over existing approaches in both semi-supervised learning and missing modality contexts. Furthermore, its lightweight design promotes scalability and seamless integration with existing architectures, making it suitable for real-world multimodal applications.
7736: Diffuse&Refine: Intrinsic Knowledge Generation and Aggregation for Incremental Object Detection
Authors: Jianzhou Wang, Yirui Wu, Lixin Yuan, Wenxiao Zhang, Jun Liu, Junyang Chen, Huan Wang, Wenhai Wang
Location: Guangzhou | Day: TBD
Show Abstract
Incremental Object Detection(IOD) targets at progressively extending capability of object detectors to recognize new classes. However, representation confusion between old and new classes leads to catastrophic forgetting. To alleviate this problem, we propose DiffKA, with intrinsic knowledge generated and aggregated by forward and backward diffusion, gradually establishing rigid class boundary. With incremental streaming data, forward diffusion spreads information to generate potential inter-class associations among new- and old-class prototypes within a hierarchical tree, named as Intrinsic Correlation Tree(ICTree), to store intrinsic knowledge. Afterwards, backward diffusion refines and aggregates the generated knowledge in ICTree, explicitly establishing rigid class boundary to mitigate representation confusion. To keep semantic consistency with extreme IOD settings, we reorganize semantic relevance of old- and new-class prototypes in paradigms to adaptively and effectively update DiffKA. Experiments on MS COCO dataset show DiffKA achieves state-of-the-art performance on IOD tasks with significant advantages.
7737: ExpTalk: Diverse Emotional Expression via Adaptive Disentanglement and Refined Alignment for Speech-Driven 3D Facial Animation
Authors: Zhan Qu, Shengyu Zhang, Mengze Li, Zhuo Chen, Chengfei Lv, Zhou Zhao, Fei Wu
Location: Guangzhou | Day: TBD
Show Abstract
Speech-driven 3D facial animation aims to create lifelike facial expressions that synchronize accurately with speech. Despite significant progress, many existing methods may focus on generating facial animation with a fixed emotional state, neglecting the diverse transformations of facial emotions under a given speech input. To solve this issue, we focus on exploring the refined alignment between speech representations and multiple domains in facial expression information. We aim to disentangle the spoken language and emotion facial priors from speech expressions, to guide the refinement of the facial vertices based on speech. To accomplish this objective, we propose ExpTalk, which first applies an Adaptive Disentanglement Variational Autoencoder (AD-VAE) to decouple facial expression aligned with spoken language and emotions of speech through contrastive learning. Then a Refined Alignment Diffusion (RAD) is employed to iteratively refine the decoupled facial expression priors through diffusion-based perturbations, producing facial animations that align with the emotional variations of the given speech. Extensive experiments prove the effectiveness of our ExpTalk by surpassing state-of-the-arts by a large margin.
7753: Cause-Effect Driven Optimization for Robust Medical Visual Question Answering with Language Biases
Authors: Huanjia Zhu, Yishu Liu, Xiaozhao Fang, Guangming Lu, Bingzhi Chen
Location: Guangzhou | Day: TBD
Show Abstract
Existing Medical Visual Question Answering (Med-VQA) models often suffer from language biases, where spurious correlations between question types and answer categories are inadvertently established.
To address these issues, we propose a novel Cause-Effect Driven Optimization framework called CEDO, that incorporates three well-established mechanisms, i.e., Modality-driven Heterogeneous Optimization (MHO), Gradient-guided Modality Synergy (GMS), and Distribution-adapted Loss Rescaling (DLR), for comprehensively mitigating language biases from both causal and effectual perspectives.
Specifically, MHO employs adaptive learning rates for specific modalities to achieve heterogeneous optimization, thus enhancing robust reasoning capabilities.
Additionally, GMS leverages the Pareto optimization method to foster synergistic interactions between modalities and enforce gradient orthogonality to eliminate bias updates, thereby mitigating language biases from the effect side, i.e., shortcut bias. Furthermore, DLR is designed to assign adaptive weights to individual losses to ensure balanced learning across all answer categories, effectively alleviating language biases from the cause side, i.e., imbalance biases within datasets. Extensive experiments on multiple traditional and bias-sensitive benchmarks consistently demonstrate the robustness of CEDO over state-of-the-art competitors.
7755: Balancing Imbalance: Data-Scarce Urban Flow Prediction via Spatio-Temporal Balanced Transfer Learning
Authors: Xinyan Hao, Huaiyu Wan, Shengnan Guo, Youfang Lin
Location: Guangzhou | Day: TBD
Show Abstract
Advanced deep spatio-temporal networks have become the mainstream for traffic prediction, but the widespread adoption of these models is impeded by the prevalent scarcity of available data. Despite cross-city transfer learning emerging as a common strategy to address this issue, it overlooks the inherent distribution imbalances within each city, which could potentially hinder the generalization capabilities of pre-trained models. To overcome this limitation, we propose a Spatio-Temporal Balanced Transfer Learning (STBaT) framework to enhance existing spatio-temporal prediction networks, ensuring both universality and precision in predictions for new urban environments. A Regional Imbalance Acquisition Module is designed to model the regional imbalances of source cities. Besides, to promote generalizable knowledge acquisition, a Spatio-Temporal Balanced Learning Module is devised to balance the predictive learning process. Extensive experiments on real-world datasets validate the efficacy of our proposed approach compared with several state-of-the-art methods.
7756: Enhancing Nighttime Semantic Segmentation with Visual-Linguistic Priors and Wavelet Transform
Authors: Jianhou Zhou, Xiaolong Zhou, Sixian Chan, Zhaomin Chen, Xiaoqin Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Nighttime semantic segmentation is a critical yet challenging task in autonomous driving. Most existing methods are designed for daytime scenarios, resulting in poor nighttime performance due to texture loss and decreased object visibility. Low-light enhancement was applied before segmentation but failed to recover nighttime-specific details, introducing noise or losing delicate structures. Recent work shows that large-scale image-text pairs can effectively leverage natural language priors to guide visual representation, achieving remarkable performance across various downstream visual tasks. However, effectively employing visual-linguistic priors for nighttime semantic segmentation remains underexplored. To address these issues, we propose Text-WaveletFormer, a novel end-to-end framework that integrates text prompts and wavelet-based texture enhancement. Specifically, to compensate for the low recognizability of objects in nighttime scenes, we design a Text-Image Fusion Module (TIFM) to incorporate textual priors to improve nighttime object recognition. In addition, to alleviate the lack of texture details in nighttime conditions, we introduce a Wavelet Guided Texture Amplifier Module (WTAM) to fuse wavelet and raw image features via cross-attention, restoring low-light details. Finally, extensive experiments on benchmarks including NightCity, NightCity-fine, BDD100K, and CityScapes demonstrate our method’s superior performance over existing approaches.
7759: Representation Learning with Mutual Influence of Modalities for Node Classification in Multi-Modal Heterogeneous Networks
Authors: Jiafan Li, Jiaqi Zhu, Liang Chang, Yilin Li, Miaomiao Li, Yang Wang, Yi Yang, Hongan Wang
Location: Guangzhou | Day: TBD
Show Abstract
Nowadays, numerous online platforms can be described as multi-modal heterogeneous networks (MMHNs), such as Douban’s movie networks and Amazon’s product review networks. Accurately categorizing nodes within these networks is crucial for analyzing the corresponding entities, which requires effective representation learning on nodes. However, existing multi-modal fusion methods often adopt either early fusion strategies which may lose the unique characteristics of individual modalities, or late fusion approaches overlooking the cross-modal guidance in GNN-based information propagation. In this paper, we propose a novel model for node classification in MMHNs, named Heterogeneous Graph Neural Network with Inter-Modal Attention (HGNN-IMA). It learns node representations by capturing the mutual influence of multiple modalities during the information propagation process, within the framework of heterogeneous graph transformer. Specifically, a nested inter-modal attention mechanism is integrated into the inter-node attention to achieve adaptive multi-modal fusion, and modality alignment is also taken into account to encourage the propagation among nodes with consistent similarities across all modalities. Moreover, an attention loss is augmented to mitigate the impact of missing modalities. Extensive experiments validate the superiority of the model in the node classification task, providing an innovative view to handle multi-modal data, especially when accompanied with network structures. The full version including Appendix is available at http://arxiv.org/abs/2505.07895.
7768: Mitigating Over-Smoothing in Graph Neural Networks via Separation Coefficient-Guided Adaptive Graph Structure Adjustment
Authors: Hanyang Meng, Jielong Yang, Li Peng
Location: Guangzhou | Day: TBD
Show Abstract
As the number of layers in Graph Neural Networks (GNNs) increases, over-smoothing becomes more severe, causing intra-class feature distances to shrink, while heterogeneous representations tend to converge. Most existing methods attempt to address this issue by employing heuristic shortcut mechanisms or optimizing objectives to constrain inter-class feature differences. However, these approaches fail to establish a theoretical connection between message passing and the variation in inter-class feature differences, making it challenging to design methods that target the key influencing factors. To address this gap, this paper first introduces the concept of the separation coefficient, which quantifies the contraction of feature distances between classes during multi-layer message passing. Based on this theory, we propose a low-complexity, pluggable, pseudo-label-based adaptive graph structure adjustment method. This approach effectively enhances the separation coefficient of inter-class features while maintaining intra-class compactness, thereby alleviating the convergence of heterogeneous representations caused by multi-layer aggregation. Experimental results demonstrate that the proposed method significantly improves the discriminability of node representations and enhances node classification performance across various datasets and foundational models.
7783: Adaptive Gradient Learning for Spiking Neural Networks by Exploiting Membrane Potential Dynamics
Authors: Jiaqiang Jiang, Lei Wang, Runhao Jiang, Jing Fan, Rui Yan
Location: Montreal | Day: August 20th | Time: 10:00 | Session: ML: Spiking Neural Networks
Show Abstract
Recent advancements have focused on directly training high-performance spiking neural networks (SNNs) by estimating the approximate gradients of spiking activity through a continuous function with constant sharpness, known as surrogate gradient (SG) learning. However, as spikes propagate within neurons and among layers, the distribution of membrane potential dynamics (MPD) will deviate from the gradient-available interval of fixed SG, hindering SNNs from searching the optimal solution space. To maintain the stability of gradient flows, SG needs to align with evolving MPD. Here, we propose a novel adaptive gradient learning for SNNs by exploiting MPD, namely MPD-AGL. It fully accounts for the underlying factors contributing to membrane potential shifts and establishes a dynamic association between SG and MPD at different timesteps to relax gradient estimation, which provides a new degree of freedom for SG learning. Experimental results demonstrate that our method achieves excellent performance at low latency. Moreover, it increases the proportion of neurons that fall into the gradient-available interval compared to fixed SG, effectively mitigating the gradient vanishing problem. Code is available at https://github.com/jqjiang1999/MPD-AGL.
7793: Semantic-Space-Intervened Diffusive Alignment for Visual Classification
Authors: Zixuan Li, Lei Meng, Guoqing Chao, Wei Wu, Yimeng Yang, Xiaoshuo Yan, Zhuang Qi, Xiangxu Meng
Location: Guangzhou | Day: TBD
Show Abstract
Cross-modal alignment is an effective approach to improving visual classification. Existing studies typically enforce a one-step mapping that uses deep neural networks to project the visual features to mimic the distribution of textual features. However, they typically face difficulties in finding such a projection due to the two modalities in both the distribution of class-wise samples and the range of their feature values. To address this issue, this paper proposes a novel Semantic-Space-Intervened Diffusive Alignment method, termed SeDA, models a semantic space as a bridge in the visual-to-textual projection, considering both types of features share the same class-level information in classification. More importantly, a bi-stage diffusion framework is developed to enable the progressive alignment between the two modalities. Specifically, SeDA first employs a Diffusion-Controlled Semantic Learner to model the semantic feature space of visual features by constraining the interactive features of the diffusion model and the category centers of visual features. In the later stage of SeDA, the Diffusion-Controlled Semantic Translator focuses on learning the distribution of textual features from the semantic space. Meanwhile, the Progressive Feature Interaction Network introduces stepwise feature interactions at each alignment step, progressively integrating textual information into mapped features. Experimental results show that SeDA achieves stronger cross-modal feature alignment, leading to superior performance over existing methods across multiple scenarios.
7802: Variety-Seeking Jump Games on Graphs
Authors: Lata Narayanan, Jaroslav Opatrny, Shanmukha Tummala, Alexandros A. Voudouris
Location: Montreal | Day: August 21st | Time: 10:00 | Session: GTEP: Noncooperative games
Show Abstract
We consider a class of jump games in which agents of different types occupy the nodes of a graph aiming to maximize the variety of types in their neighborhood. In particular, each agent derives a utility equal to the number of types different from its own in its neighborhood. We show that the jump game induced by the strategic behavior of the agents (who aim to maximize their utility) may in general have improving response cycles, but is a potential game under any of the following four conditions: there are only two types of agents; or exactly one empty node; or the graph is of degree at most 2; or the graph is 3-regular and there are two empty nodes. Additionally, we show that on trees, cylinder graphs, and tori, there is always an equilibrium. Finally, we show tight bounds on the price of anarchy with respect to two different measures of diversity: the social welfare (the total utility of the agents) and the number of colorful edges (that connect agents of different types).
7815: General Incomplete Time Series Analysis via Patch Dropping Without Imputation
Authors: Yangyang Wu, Yi Yuan, Mengying Zhu, Xiaoye Miao, Meng Xi
Location: Guangzhou | Day: TBD
Show Abstract
Missing values in multivariate time series data present significant challenges to effective analysis. Existing methods for multivariate time series analysis either ignore missing data, sacrificing performance, or follow the impute-then-analyze paradigm, which suffers from redundant training and error accumulation, leading to biased results and suboptimal performance. In this paper, we propose INTER, a novel end-to-end framework for incomplete multivariate time series analysis, which bypasses imputation by leveraging pre-trained language models to learn the distribution of incomplete time series data. INTER incorporates two novel components: the missing-rate-aware time series patch-dropping (MPD) strategy and the missing-aware Transformer block, both of which we propose to enhance model generalization, robustness, and the ability to capture underlying patterns in the observed incomplete time series. Moreover, we theoretically prove that the MPD strategy exhibits lower sample variance for time series with the same dropout rate compared to other dropping strategies. Extensive experiments on 11 public real-world time series datasets demonstrate that INTER improves accuracy by over 20% compared to state-of-the-art methods, while maintaining competitive computational efficiency.
7822: NeSyA: Neurosymbolic Automata
Authors: Nikolaos Manginas, George Paliouras, Luc De Raedt
Location: Montreal | Day: August 21st | Time: 11:30 | Session: ML: Neurosymbolic AI
Show Abstract
Neurosymbolic (NeSy) AI has emerged as a promising direction to integrate
neural and symbolic reasoning. Unfortunately, little effort has been given
to developing NeSy systems tailored to sequential/temporal problems. We identify
symbolic automata (which combine the power of automata for temporal reasoning
with that of propositional logic for static reasoning) as a suitable formalism for
expressing knowledge in temporal domains. Focusing on the task of sequence classification
and tagging we show that symbolic automata can be integrated with neural-based
perception, under probabilistic semantics towards an end-to-end differentiable model.
Our proposed hybrid model, termed NeSyA (Neuro Symbolic
Automata) is shown to either scale or perform more accurately than previous
NeSy systems in a synthetic benchmark and to provide benefits in terms of generalization
compared to purely neural systems in a real-world event recognition task.
Code is available at: https://github.com/nmanginas/nesya
7823: Do Mentioned Items Truly Matter? Enhancing Conversational Recommender Systems with Causal Intervention and Large Language Models
Authors: Lingzhi Wang, Xingshan Zeng, Kam-Fai Wong
Location: Guangzhou | Day: TBD
Show Abstract
Conversational Recommender Systems (CRS) have become increasingly important due to their ability to recommend items through interactive dialogue, adapting to user preferences in real time. Traditional CRS approaches face challenges in generating high-quality, diverse responses due to the limited availability of training data and the inherited biases from domain-specific fine-tuning. Furthermore, existing systems often overlook the impact of confounding variables during user interactions, leading to suboptimal recommendations. In this work, we propose a novel hybrid framework that integrates large language models (LLMs) with traditional recommendation techniques to address these limitations. Our approach leverages the strengths of LLMs in generating fluent, contextually appropriate responses while employing a traditional recommendation module to capture complex interaction structures. To ensure unbiased recommendations, we introduce causal interventions that disentangle confounding variables, improving recommendation accuracy. We evaluate our framework on established CRS datasets, demonstrating significant improvements in recommendation quality and response generation. Our results highlight the effectiveness of the causal intervention mechanism in producing more reliable and personalized recommendations, while the LLM-based response generation offers scalability across multiple domains.
7827: Empowering Vision Transformers with Multi-Scale Causal Intervention for Long-Tailed Image Classification
Authors: Xiaoshuo Yan, Zhaochuan Li, Lei Meng, Zhuang Qi, Wei Wu, Zixuan Li, Xiangxu Meng
Location: Guangzhou | Day: TBD
Show Abstract
Causal inference has emerged as a promising approach to mitigate long-tail classification by handling the biases introduced by class imbalance. However, along with the change of advanced backbone models from Convolutional Neural Networks (CNNs) to Visual Transformers (ViT), existing causal models may not achieve an expected performance gain. This paper investigates the influence of existing causal models on CNNs and ViT variants, highlighting that ViT’s global feature representation makes it hard for causal methods to model associations between fine-grained features and predictions, which leads to difficulties in classifying tail classes with similar visual appearance. To address these issues, this paper proposes TSCNet, a two-stage causal modeling method to discover fine-grained causal associations through multi-scale causal interventions. Specifically, in the hierarchical causal representation learning stage (HCRL), it decouples the background and objects, applying backdoor interventions at both the patch and feature level to prevent model from using class-irrelevant areas to infer labels which enhances fine-grained causal representation. In the counterfactual logits’ bias calibration stage (CLBC), it refines the optimization of model’s decision boundary by adaptive constructing counterfactual balanced data distribution to remove the spurious associations in the logits caused by data distribution. Extensive experiments conducted on various long-tail benchmarks demonstrate that the proposed TSCNet can eliminate multiple biases introduced by data imbalance, which outperforms existing methods.
7833: AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents
Authors: Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Andrey Kravchenko, Mikhail Burtsev, Evgeny Burnaev
Location: Guangzhou | Day: TBD
Show Abstract
Advancements in the capabilities of Large Language Models (LLMs) have created a promising foundation for developing autonomous agents. With the right tools, these agents could learn to solve tasks in new environments by accumulating and updating their knowledge. Current LLM-based agents process past experiences using a full history of observations, summarization, retrieval augmentation. However, these unstructured memory representations do not facilitate the reasoning and planning essential for complex decision-making. In our study, we introduce AriGraph, a novel method wherein the agent constructs and updates a memory graph that integrates semantic and episodic memories while exploring the environment. We demonstrate that our Ariadne LLM agent, consisting of the proposed memory architecture augmented with planning and decision-making, effectively handles complex tasks within interactive text game environments difficult even for human players. Results show that our approach markedly outperforms other established memory methods and strong RL baselines in a range of problems of varying complexity. Additionally, AriGraph demonstrates competitive performance compared to dedicated knowledge graph-based methods in static multi-hop question-answering.
7835: Identifying Drivers of Predictive Aleatoric Uncertainty
Authors: Pascal Iversen, Simon Witzke, Katharina Baum, Bernhard Y. Renard
Location: Montreal | Day: August 21st | Time: 10:00 | Session: ML: Explainable/Interpretable machine learning
Show Abstract
Explainability and uncertainty quantification are key to trustable artificial intelligence. However, the reasoning behind uncertainty estimates is generally left unexplained. Identifying the drivers of uncertainty complements explanations of point predictions in recognizing model limitations and enhancing transparent decision-making. So far, explanations of uncertainties have been rarely studied. The few exceptions rely on Bayesian neural networks or technically intricate approaches, such as auxiliary generative models, thereby hindering their broad adoption. We propose a straightforward approach to explain predictive aleatoric uncertainties. We estimate uncertainty in regression as predictive variance by adapting a neural network with a Gaussian output distribution. Subsequently, we apply out-of-the-box explainers to the model’s variance output. This approach can explain uncertainty influences more reliably than complex published approaches, which we demonstrate in a synthetic setting with a known data-generating process. We substantiate our findings with a nuanced, quantitative benchmark including synthetic and real, tabular and image datasets. For this, we adapt metrics from conventional XAI research to uncertainty explanations. Overall, the proposed method explains uncertainty estimates with little modifications to the model architecture and outperforms more intricate methods in most settings.
7842: Indirect Alignment and Relationship Preservation for Domain Generalization
Authors: Wei Wei, Zixiong Li, Jing Yan, Mingwen Shao, Lin Li
Location: Guangzhou | Day: TBD
Show Abstract
Domain generalization (DG) aims to train models on multiple source domains to generalize effectively to unseen target domains, addressing performance degradation caused by domain shifts. Many existing methods rely on direct feature alignment, which disrupts natural sequence relationships, causes misalignment and feature distortion, and leads to overfitting, especially with significant domain gaps. To tackle these issues, we propose a novel DG approach with two key modules: the Sample Difference Keeping (SDK) module, which preserves natural sequence relationships to enhance feature diversity and separability, and the Sample Consistency Alignment (SCA) module, which achieves indirect alignment by modeling inter-class and inter-domain relationship consistencies. This approach mitigates overfitting and misalignment, ensuring adaptability to significant domain gaps. Extensive experiments demonstrate that our framework consistently outperforms state-of-the-art methods.
7847: Maximum Entropy Softmax Policy Gradient via Entropy Advantage Estimation
Authors: Jean Seong Bjorn Choe, Jong-kook Kim
Location: Montreal | Day: August 21st | Time: 15:00 | Session: ML: Reinforcement Learning (2/2)
Show Abstract
Entropy Regularisation is a widely adopted technique that enhances policy optimisation performance and stability. Maximum entropy reinforcement learning (MaxEnt RL) regularises policy evaluation by augmenting the objective with an entropy term, showing theoretical benefits in policy optimisation. However, its practical application in straightforward direct policy gradient settings remains surprisingly underexplored. We hypothesise that this is due to the difficulty of managing the entropy reward in practice. This paper proposes Entropy Advantage Policy Optimisation (EAPO), a simple method that facilitates MaxEnt RL implementation by separately estimating task and entropy objectives. Our empirical evaluations demonstrate that extending Proximal Policy Optimisation (PPO) and Trust Region Policy Optimisation (TRPO) within the MaxEnt framework improves optimisation performance, generalisation, and exploration in various environments. Moreover, our method provides a stable and performant MaxEnt RL algorithm for discrete action spaces.
7876: Federated Deconfounding and Debiasing Learning for Out-of-Distribution Generalization
Authors: Zhuang Qi, Sijin Zhou, Lei Meng, Han Hu, Han Yu, Xiangxu Meng
Location: Guangzhou | Day: TBD
Show Abstract
Attribute bias in federated learning (FL) typically leads local models to optimize inconsistently due to the learning of non-causal associations, resulting degraded performance. Existing methods either use data augmentation for increasing sample diversity or knowledge distillation for learning invariant representations to address this problem. However, they lack a comprehensive analysis of the inference paths, and the interference from confounding factors limits their performance. To address these limitations, we propose the Federated Deconfounding and Debiasing Learning (FedDDL) method. It constructs a structured causal graph to analyze the model inference process, and performs backdoor adjustment to eliminate confounding paths. Specifically, we design an intra-client deconfounding learning module for computer vision tasks to decouple background and objects, generating counterfactual samples that establish a connection between the background and any label, which stops the model from using the background to infer the label. Moreover, we design an inter-client debiasing learning module to construct causal prototypes to reduce the proportion of the background in prototype components. Notably, it bridges the gap between heterogeneous representations via causal prototypical regularization. Extensive experiments on 2 benchmarking datasets demonstrate that FedDDL significantly enhances the model capability to focus on main objects in unseen data, leading to 4.5% higher Top-1 Accuracy on average over 9 state-of-the-art existing methods.
7881: Can We Translate Code Better with LLMs and Call Graph Analysis?
Authors: Yang Luo
Location: Guangzhou | Day: TBD
Show Abstract
This paper proposes an innovative code translation method aimed at addressing the accuracy issues encountered by large language models (LLMs) in translating code of complex large-scale software projects. The method utilizes the Language Server Protocol to obtain the call graph of the entire codebase, and optimizes the input prompt of the LLM accordingly, significantly improving the correctness of translation at the compilation stage. Moreover, this method introduces the bridged debuggers technique based on the Debug Adapter Protocol and dynamic test case generation, effectively fixing runtime errors. Experiments on multiple mainstream datasets demonstrate that, compared to existing code translation methods and LLMs, this method achieves a significant improvement in translation accuracy.
7888: Approximate Verification of Strategic Abilities under Imperfect Information Using Local Models
Authors: Damian Kurpiewski, Wojciech Jamroga, Yan Kim
Location: Montreal | Day: August 21st | Time: 10:00 | Session: MAS: Formal verification, validation and synthesis
Show Abstract
Verification of strategic ability under imperfect information is challenging, with complexity ranging from NP-complete to undecidable. This is partly because traditional fixpoint equivalences fail in this setting. Some years ago, an interesting idea of fixpoint approximation was proposed for model checking of ATL_ir, i.e., the logic of strategic ability for agents with imperfect information and imperfect recall.
In this paper, we propose a new variant of the approximation, that uses the agent’s local model rather than the global model of the system. We prove correctness of the construction, and demonstrate its effectiveness through experimental results on scalable models of voting.
7893: DepthART: Monocular Depth Estimation as Autoregressive Refinement Task
Authors: Bulat Gabdullin, Nina Konovalova, Nikolay Patakin, Dmitry Senushkin, Anton Konushin
Location: Guangzhou | Day: TBD
Show Abstract
Monocular depth estimation has seen significant advances through discriminative approaches, yet their performance remains constrained by the limitations of training datasets. While generative approaches have addressed this challenge by leveraging priors from internet-scale datasets, with recent studies showing state-of-the-art results using fine-tuned text-to-image diffusion models, there is still room for improvement. Notably, autoregressive generative approaches, particularly Visual AutoRegressive modeling, have demonstrated superior results compared to diffusion models in conditioned image synthesis, while offering faster inference times.
In this work, we apply Visual Autoregressive Transformer (VAR) to the monocular depth estimation problem. However, the conventional GPT-2-style training procedure (teacher forcing) inherited by VAR yields suboptimal results for depth estimation. To address this limitation, we introduce DepthART – a novel training method formulated as a Depth Autoregressive Refinement Task. Unlike traditional VAR training with static inputs and targets, our method implements a dynamic target formulation based on model outputs, enabling self-refinement. By utilizing the model’s own predictions as inputs instead of ground truth token maps during training, we frame the objective as residual minimization, effectively reducing the discrepancy between training and inference procedures.
Our experimental results demonstrate that the proposed training approach significantly enhances the performance of VAR in depth estimation tasks. When trained on Hypersim dataset using our approach, the model achieves superior results across multiple unseen benchmarks compared to existing generative and discriminative baselines.
7895: Abstraction Heuristics for Classical Planning Tasks with Conditional Effects
Authors: Martín Pozo, Jendrik Seipp
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Planning and Scheduling (2/5)
Show Abstract
In planning tasks, conditional effects model action outcomes that depend on the current state of the world. Conditional effects are a crucial modeling feature since compiling them away can cause an exponential growth in task size. However, only a few admissible heuristics support them. To add abstraction heuristics to this set, we show how to compute projections, Cartesian abstractions and merge-and-shrink abstractions for tasks with conditional effects. Our experiments show that these heuristics are competitive with, and often surpass, the state-of-the-art for conditional-effect tasks.
7903: Wave-driven Graph Neural Networks with Energy Dynamics for Over-smoothing Mitigation
Authors: Peihan Wu, Hongda Qi, Sirong Huang, Dongdong An, Jie Lian, Qin Zhao
Location: Guangzhou | Day: TBD
Show Abstract
Over-smoothing is a persistent challenge in Graph Neural Networks (GNNs), where node embeddings become indistinguishable as network depth increases, fundamentally limiting their effectiveness on tasks requiring fine-grained distinctions. This issue arises from the reliance on diffusion-based propagation mechanisms, which suppress high-frequency information essential for preserving feature diversity. To mitigate this, we propose a wave-driven GNN framework that redefines feature propagation through the wave equation. Unlike diffusion, the wave equation incorporates second-order dynamics, balancing smoothing with oscillatory behavior to retain high-frequency components while ensuring effective information flow. To enhance the stability and convergence of wave equation discretization on graphs, an energy-based mechanism inspired by kinetic and potential energy dynamics is introduced, balancing temporal evolution and structural alignment to stabilize propagation. Extensive experiments on benchmark datasets, including Cora, Citeseer, and PubMed, as well as real-world graphs, demonstrate that the proposed framework achieves state-of-the-art performance, effectively mitigating over-smoothing and enabling deeper, more expressive architectures. The code is available at https://github.com/rene0329/EWGNN/.
7930: A Fast Neural Architecture Search Method for Multi-Modal Classification via Knowledge Sharing
Authors: Zhihua Cui, Shiwu Sun, Qian Guo, Xinyan Liang, Yuhua Qian, Zhixia Zhang
Location: Guangzhou | Day: TBD
Show Abstract
Neural architecture search-based multi-modal classification (NAS-MMC) aims to automatically find optimal network structures for improving the multi-modal classification performance. However, most current NAS-MMC methods are quite time-consuming during the training process. In this paper, we propose a knowledge sharing-based neural architecture search (KS-NAS) method for multi-modal classification. The KS-NAS optimizes the search process by introducing a dynamically updated knowledge base to reduce the consumption of computational resource. Specifically, during the deep evolutionary search, individuals in the initial population acquire initial parameters from a knowledge base, and then undergo training and optimization until convergence is reached, avoiding the need for training from scratch. The knowledge base is dynamically updated by aggregating the parameters of high-quality individuals trained within the population, thus progressively improving the quality of the knowledge base. As the population evolves, the knowledge base continues to optimize, ensuring that subsequent individuals can obtain higher-quality initialization parameters, which significantly accelerates the training speed of the population. Experimental results show that the KS-NAS method achieves state-of-the-art results in terms of classification performance and training efficiency across multiple popular multi-modal tasks.
7934: Handling Infinite Domain Parameters in Planning Through Best-First Search with Delayed Partial Expansions
Authors: Ángel Aso-Mollar, Diego Aineto, Enrico Scala, Eva Onaindia
Location: Montreal | Day: August 19th | Time: 15:00 | Session: Planning and Scheduling (2/5)
Show Abstract
In automated planning, control parameters extend standard action representations through the introduction of continuous numeric decision variables. Existing state-of-the-art approaches have primarily handled control parameters as embedded constraints alongside other temporal and numeric restrictions, and thus have implicitly treated them as additional constraints rather than as decision points in the search space. In this paper, we propose an efficient alternative that explicitly handles control parameters as true decision points within a systematic search scheme. We develop a best-first, heuristic search algorithm that operates over infinite decision spaces defined by control parameters and prove a notion of completeness in the limit under certain conditions. Our algorithm leverages the concept of delayed partial expansion, where a state is not fully expanded but instead incrementally expands a subset of its successors. Our results demonstrate that this novel search algorithm is a competitive alternative to existing approaches for solving planning problems involving control parameters.
7957: DGL: Dynamic Global-Local Information Aggregation for Scalable VRP Generalization with Self-Improvement Learning
Authors: Yubin Xiao, Yuesong Wu, Rui Cao, Di Wang, Zhiguang Cao, Xuan Wu, Peng Zhao, Yuanshu Li, You Zhou, Yuan Jiang
Location: Guangzhou | Day: TBD
Show Abstract
The Vehicle Routing Problem (VRP) is a critical combinatorial optimization problem with wide-reaching real-world applications, particularly in logistics, transportation. While neural network-based VRP solvers have shown impressive results on test instances similar to training data, their performance often degrades when faced with varying scales and unseen distributions, limiting their practical applicability. To overcome these limitations, we introduce DGL (Dynamic Global-Local Information Aggregation), a novel model that combines global and local information to effectively solve VRPs. DGL dynamically adjusts local node selections within a localized range, capturing local invariance across problems of different scales and distributions, thereby enhancing generalization. At the same time, DGL integrates global context into the decision-making process, providing richer information for more informed decisions. Additionally, we propose a replacement-based self-improvement learning framework that leverages data augmentation and random replacement techniques, further enhancing DGL’s robustness. Extensive experiments on synthetic datasets, benchmark datasets, and real-world country map instances demonstrate that DGL achieves state-of-the-art performance, particularly in generalizing to large-scale VRPs and real-world scenarios. These results showcase DGL’s effectiveness in solving complex, realistic optimization challenges and highlight its potential for practical applications.
7964: Improving Generalization in Meta-Learning via Meta-Gradient Augmentation
Authors: Ren Wang, Haoliang Sun, Yuxiu Lin, Xinxin Zhang, Yilong Yin
Location: Guangzhou | Day: TBD
Show Abstract
Meta-learning methods typically follow a two-loop framework, where each loop potentially suffers from notorious overfitting, hindering rapid adaptation and generalization to new tasks. Existing methods address this by enhancing the mutual-exclusivity or diversity of training samples, but these data manipulation strategies are data-dependent and insufficiently flexible. This work proposes a data-independent Meta-Gradient Augmentation (MGAug) method from the perspective of gradient regularization. The key idea is first to break the rote memories by network pruning to address memorization overfitting in the inner loop, then use the gradients of pruned sub-networks to augment meta-gradients, alleviating overfitting in the outer loop. Specifically, we explore three pruning strategies, including random width pruning, random parameter pruning, and a newly proposed catfish pruning that measures a Meta-Memorization Carrying Amount (MMCA) score for each parameter and prunes high-score ones to break rote memories. The proposed MGAug is theoretically guaranteed by the generalization bound from the PAC-Bayes framework. Extensive experiments on multiple few-shot learning benchmarks validate MGAug’s effectiveness and significant improvement over various meta-baselines.
8000: DM-POSA: Enhancing Open-World Test-Time Adaptation with Dual-Mode Matching and Prompt-Based Open Set Adaptation
Authors: Shiji Zhao, Shao-Yuan Li, Chuanxing Geng, Sheng-Jun Huang, Songcan Chen
Location: Guangzhou | Day: TBD
Show Abstract
The need to generalize the pre-trained deep learning models to unknown test-time data distributions has spurred research into test-time adaptation (TTA). Existing studies have mainly focused
on closed-set TTA with only covariate shifts, while largely overlooking open-set TTA that involves semantic shifts, i.e., unknown open-set classes. However, addressing adaptation to unknown classes is crucial for open-world safety-critical applications such as autonomous driving. In this paper, we emphasize that accurate identification of the open-set samples is rather challenging in TTA. The entanglement of semantic shift and covariate shift mutually confuse the network’s discriminative capability. This co-interference further exacerbates considering the single-pass data nature and low latency requirements. With this under standing, we propose Dual-mode Matching and Prompt-based Open Set Adaptation (DM-POSA) for open-set TTA to enhance discriminative feature learning and unknown classes distinguishment with minimal time cost. DM-POSA identifies open-set samples via dual-mode matching strategies, including model-parameter-based and feature space-based matching. It also optimizes the model with a random pairing discrepancy loss, enhancing the distributional difference between open-set and closed-set samples, thus improving the model’s ability to recognize unknown categories. Extensive
experiments show the superiority of DM-POSA over state-of-the-art baselines on both closed-set class adaptation and open-set class detection.
8016: A Datalog Rewriting Algorithm for Warded Ontologies
Authors: Davide Benedetto, Marco Calautti, Hebatalla Hammad, Emanuel Sallinger, Adriano Vlad-Starrabba
Location: Montreal | Day: August 22nd | Time: 10:00 | Session: KR: ontologies
Show Abstract
Existential rules, a.k.a. tuple-generating dependencies (TGDs), form a well-established formalism for specifying ontologies. In particular, the warded language is a well-behaved fragment of TGD-based ontologies, striking a good balance between expressive power and computational complexity of answering Ontology-Mediated Queries (OMQs). The theoretical foundations of answering OMQs over warded ontologies are by now well-understood, but to the best of our knowledge, very few efforts exist that exploit such a rich theory for building practical query answering algorithms. Our goal is to fill the above gap by designing a novel Datalog rewriting algorithm for OMQs over warded ontologies which is amenable to practical implementations, as well as providing an implementation and an experimental evaluation, with the aim of understanding how key input parameters affect the performance of this approach, and what are its limits when combined with off-the-shelf Datalog-based engines.
8018: Parameterized Approximation Algorithm for Doubly Constrained Fair Clustering
Authors: Xiaoliang Wu, Qilong Feng, Junyu Huang, Jianxin Wang
Location: Guangzhou | Day: TBD
Show Abstract
Fair clustering has recently received considerable attention where numerous distinct fairness notions are developed. Despite being well-justified, these fairness notions are frequently studied in isolation, leaving the need to explore how they can be combined. Building on prior work, we focus on the doubly constrained fair clustering that incorporates two widely adopted demographic representation fairness notions in clustering: group fairness and data summarization fairness. Both fairness notions extend classical clustering formulation by associating each data point with a demographic label, where group fairness requires each cluster to proportionally reflect the population-level distribution of demographic groups, and data summarization fairness ensures the chosen facilities maintaining the population-level demographic representation of each group. In this paper, we study the Fixed-Parameter Tractable (FPT) approximation algorithms for doubly constrained fair clustering under the k-median objective, referred to Df-k-Med. The previous algorithms typically enumerate different demographic groups or construct fairness coreset, parameterized by both the number of opened facilities and demographic labels. By further leveraging the local fairness information, we propose a color-agnostic structural method that obtains the parameterized result independent of the number of demographic labels while effectively handling the combination of both fairness constraints. Specifically, we design a constant factor approximation for the Df-k-Med problem with fairness violation by one, which runs in FPT(k)-time, where k is the number of opened facilities.
8023: Towards Safer Pretraining: Analyzing and Filtering Harmful Content in Webscale Datasets for Responsible LLMs
Authors: Sai Krishna Mendu, Harish Yenala, Aditi Gulati, Shanu Kumar, Parag Agrawal
Location: Montreal | Day: August 20th | Time: 10:00 | Session: AI Ethics, Trust, Fairness (1/3)
Show Abstract
Large language models (LLMs) have become integral to various real-world applications, leveraging massive, web-sourced datasets like Common Crawl, C4, and FineWeb for pretraining. While these datasets provide linguistic data essential for high-quality natural language generation, they often contain harmful content, such as hate speech, misinformation, and biased narratives. Training LLMs on such unfiltered data risks perpetuating toxic behaviors, spreading misinformation, and amplifying societal biases which can undermine trust in LLM-driven applications and raise ethical concerns about their use. This paper presents a large-scale analysis of inappropriate content across these datasets, offering a comprehensive taxonomy that categorizes harmful webpages into Topical and Toxic based on their intent. We also introduce a prompt evaluation dataset, a high-accuracy Topical and Toxic Prompt (TTP), and a transformer-based model (HarmFormer) for harmful content filtering. Additionally, we create a new multi-harm open-ended toxicity benchmark (HAVOC) and provide crucial insights into how models respond to adversarial toxic inputs. Our work offers insights into ensuring safer LLM pretraining and serves as a resource for Responsible AI (RAI) compliance.
Disclaimer: This paper includes potentially offensive content due to the nature of the research.
8041: Filling the Missings: Spatiotemporal Data Imputation by Conditional Diffusion
Authors: Wenying He, Jieling Huang, Junhua Gu, Ji Zhang, Yude Bai
Location: Guangzhou | Day: TBD
Show Abstract
Missing data in spatiotemporal systems presents a significant challenge for modern applications, ranging from environmental monitoring to urban traffic management. The integrity of spatiotemporal data often deteriorates due to hardware malfunctions and software failures in real-world deployments. Current approaches based on machine learning and deep learning struggle to model the intricate interdependencies between spatial and temporal dimensions effectively and, more importantly, suffer from cumulative errors during the data imputation process, which propagate and amplify through iterations. To address these limitations, we propose CoFILL, a novel Conditional Diffusion Model for spatiotemporal data imputation. CoFILL builds on the inherent advantages of diffusion models to generate high-quality imputations without relying on potentially error-prone prior estimates. It incorporates an innovative dual-stream architecture that processes temporal and frequency domain features in parallel. By fusing these complementary features, CoFILL captures both rapid fluctuations and underlying patterns in the data, which enables more robust imputation. The extensive experiments demonstrate that CoFILL’s noise prediction network successfully transforms random noise into meaningful values that align with the true data distribution. The results also show that CoFILL outperforms state-of-the-art methods in terms of imputation accuracy. The source code is publicly available at https://github.com/joyHJL/CoFILL.
8054: Towards VLM-based Hybrid Explainable Prompt Enhancement for Zero-Shot Industrial Anomaly Detection
Authors: Weichao Cai, Weiliang Huang, Yunkang Cao, Chao Huang, Fei Yuan, Bob Zhang, Jie Wen
Location: Guangzhou | Day: TBD
Show Abstract
Zero-Shot Industrial Anomaly Detection (ZSIAD) aims to identify and localize anomalies in industrial images from unseen categories. Owing to the powerful generalization capabilities, Vision-Language Models (VLMs) have achieved growing interest in ZSIAD. To guide the model toward understanding and localizing the semantically complex industrial anomalies, existing VLM-based methods have attempted to provide additional prompts to the model through learnable text prompt templates. However, these zero-shot methods lack detailed descriptions of specific anomalies, making it difficult to classify and segment the diverse range of industrial anomalies accurately. To address the aforementioned issue, we firstly propose the multi-stage prompt generation agent for ZSIAD. Specifically, we leverage the Multi-modal Language Large Model (MLLM) to articulate the detailed differential information between normal and test samples, which can provide detailed text prompts to the model through further refinement and anti-false alarm constraint. Moreover, we introduce the Visual Fundamental Model (VFM) to generate anomaly-related attention prompts for more accurate localization of anomalies with varying sizes and shapes. Extensive experiments on seven real-world industrial anomaly detection datasets have shown that the proposed method not only outperforms recent SOTA methods, but also its explainable prompts provide the model with a more intuitive basis for anomaly identification.
8064: DGExplainer: Explaining Dynamic Graph Neural Networks via Relevance Back-propagation
Authors: Yezi Liu, Jiaxuan Xie, Yanning Shen
Location: Montreal | Day: August 21st | Time: 15:00 | Session: DM: Graph Data Mining
Show Abstract
Dynamic graph neural networks (dynamic GNNs) have demonstrated remarkable effectiveness in analyzing time-varying graph-structured data. However, their black-box nature often hinders users from understanding their predictions, which can limit their applications. In recent years, there has been a surge in research aimed at explaining GNNs, but most studies have focused on static graphs, leaving the explanation of dynamic GNNs relatively unexplored. Explaining dynamic GNNs presents a unique challenge due to their complex spatial and temporal structures. As a result, existing approaches designed for explaining static graphs are not directly applicable to dynamic graphs because they ignore temporal dependencies among graph snapshots. To address this issue, we propose DGExplainer, which offers a reliable explanation of dynamic GNN predictions. DGExplainer utilizes the relevance back-propagation technique both time-wise and layer-wise. Specifically, it incorporates temporal information by computing the relevance of node representations along the inverse of the time evolution. Additionally, for each time step, it calculates layer-wise relevance from a graph-based module by redistributing the relevance of node representations along the back-propagation path. Quantitative and qualitative experimental results on six real-world datasets demonstrate the effectiveness of DGExplainer in identifying important nodes for link prediction and node regression in dynamic GNNs.
8068: What Makes You Special? Contrastive Heuristics Based on Qualified Dominance
Authors: Rasmus G. Tollund, Kim G. Larsen, Alvaro Torralba
Location: Montreal | Day: August 21st | Time: 11:30 | Session: Planning and Scheduling (4/5)
Show Abstract
In cost-optimal planning, dominance pruning methods discard states during the search that are dominated by others. However, the binary nature of pruning fails to exploit information when we cannot prove that a state is fully dominated. To this end, we introduce qualified dominance, an automatic method that given a pair of states s,t synthetizes a finite state automaton that represents a language of plans from s that are dominated by t. This not only explains why s cannot be pruned, but also can be used to improve the heuristic function to guide the search. This results in a new type of heuristic, which we call contrastive heuristics, that are dependent on the search performed so far. We provide the theoretical foundation for showing that contrastive heuristics can be used to find optimal plans even when their more informative estimates are not admissible.
8088: Maximin Share Guarantees for Few Agents with Subadditive Valuations
Authors: George Christodoulou, Vasilis Christoforidis, Symeon Mastrakoulis, Alkmini Sgouritsa
Location: Montreal | Day: August 21st | Time: 11:30 | Session: GTEP: Fair division
Show Abstract
We study the problem of fairly allocating a set of indivisible items among a set of agents. We consider the notion of (approximate) maximin share (MMS) and we provide an improved lower bound of 1/2 (which is tight) for the case of subadditive valuations when the number of agents is at most four. We also provide a tight lower bound for the case of multiple agents, when they are equipped with one of two possible types of valuations. Moreover, we propose a new model that extends previously studied models in the area of fair division, which will hopefully give rise to further research. We demonstrate the usefulness of this model by employing it as a technical tool to derive our main result, and we provide a thorough analysis for this model for the case of three agents. Finally, we provide an improved impossibility result for the case of three submodular agents.
8098: Formal Synthesis of Safe Kolmogorov-Arnold Network Controllers with Barrier Certificates
Authors: Xiongqi Zhang, Ning Lv, Wang Lin, Zuohua Ding
Location: Guangzhou | Day: TBD
Show Abstract
Control barrier certificate generation is an efficient and powerful technique for the safe control of cyber-physical systems. Feed-forward neural networks (FNNs) are commonly used to synthesize control barrier certificates and safe controllers, but they struggle to effectively address the challenges posed by high-dimensional complex systems. In this paper, we propose a novel method for generating control barrier certificates and controllers using Kolmogorov-Arnold Networks (KANs). Specifically, it utilizes KANs to replace FNNs as the template of control barrier certificates and contrllers. Since KAN has learnable activation functions, it can efficiently improve the representation power. Then, it leverages the pruning and symbolization properties of KANs, which significantly simplify the network structure, allowing for more efficient formal verification of the simplified candidate KAN control barrier certificates and controllers using Satisfiability Modulo Theories. We implement the tool KAN4CBC, and evaluate its performance over a set of benchmarks. The experimental results demonstrate that our method addresses the issues of system dimension expansion and improved solution efficiency.
8107: LPDetective: Dusting the LLM Chats for Prompt Template Abusers
Authors: Yang Luo, Qingni Shen, Zhonghai Wu
Location: Guangzhou | Day: TBD
Show Abstract
The abuse of LLM Chatbot interfaces by web robots leads to a significant waste of GPU and server resources, posing a serious security challenge. To address this issue, we propose LPDetective, an unsupervised method for detecting robot prompt templates. This method is based on the assumption that robot-generated text repeatedly uses the same or highly similar phrases and sentence structures across multiple sessions, differing from human natural conversations. We design a multi-stage workflow, including message grouping, text similarity measurement, hierarchical clustering analysis, and regular expression extraction, to automatically extract potential robot behavior patterns from chat logs. LPDetective does not require predefined templates or rely on training data, enabling it to adaptively discover new, unknown patterns. We conduct systematic experiments on three large-scale real-world datasets: Bing Copilot, Wildchat, and ChatLog. The results show that LPDetective can efficiently and accurately detect robot prompt templates in various scenarios, achieving a 7.5% improvement in F1 score compared to the state-of-the-art XLNet method and reducing detection latency by 178 times on the Bing Copilot dataset.
8147: Understanding Matters: Semantic-Structural Determined Visual Relocalization for Large Scenes
Authors: Jingyi Nie, Liangliang Cai, Qichuan Geng, Zhong Zhou
Location: Guangzhou | Day: TBD
Show Abstract
Scene Coordinate Regression (SCR) estimates 3D scene coordinates from 2D images, and has become an important approach in visual relocalization. Existing methods exhibit high localization accuracy in small scenes, but still face substantial challenges in large-scale scenes, which usually have significant variations in depth, scale, and occlusion. Although structure-guided scene partitioning is commonly adopted, the over-partitioned elements and large feature variances within subscenes impede the estimation of the 3D coordinates, introducing misleading information for subsequent processing. To address the above-mentioned issues, we propose the Semantic-Structural Determined Visual Relocalization method for SCR, which leverages semantic-structural partition learning and partition-determined pose refinement to better understand the semantic and structural information on large scenes. Firstly, we partition the scene into small subscenes with label assignments, ensuring semantic consistency and structural continuity within each subscene. A classifier is then trained with sampling-based learning to predict these labels. Secondly, the partition predictions are encoded into embeddings and integrated with local features for intra-class compactness and inter-class separation, producing partition-aware features. To further decrease feature variances, we employ a discriminability metric and suppress ambiguous points, improving subsequent computations. Experimental results on the Cambridge Landmarks dataset demonstrate that the proposed method achieves significant improvements with fewer training costs on large-scale scenes, reducing the median error by 38% compared to the state-of-the-art SCR method DSAC*. Code is available: https://gitee.com/VR_NAVE/ss-dvr.
8176: Q-Detection: A Quantum-Classical Hybrid Poisoning Attack Detection Method
Authors: Haoqi He, Xiaokai Lin, Jiancai Chen, Yan Xiao
Location: Guangzhou | Day: TBD
Show Abstract
Data poisoning attacks pose significant threats to machine learning models by introducing malicious data into the training process, thereby degrading model performance or manipulating predictions. Detecting and sifting out poisoned data is an important method to prevent data poisoning attacks. Limited by classical computation frameworks, upcoming larger-scale and more complex datasets may pose difficulties for detection. We introduce the unique speedup of quantum computing for the first time in the task of detecting data poisoning. We present Q-Detection, a quantum-classical hybrid defense method for detecting poisoning attacks. Q-Detection also introduces the Quantum Weight-Assigning Network, which is optimized using quantum computing devices. Experimental results using multiple quantum simulation libraries show that Q-Detection effectively defends against label manipulation and backdoor attacks. The metrics demonstrate that Q-Detection consistently outperforms the baseline methods and is comparable to the state-of-the-art. Theoretical analysis shows that Q-Detection is expected to achieve more than a 20% speedup using quantum computing power.
8227: Evolvable Conditional Diffusion
Authors: Zhao Wei, Chin Chun Ooi, Abhishek Gupta, Jian Cheng Wong, Pao-Hsiung Chiu, Sheares Xue Wen Toh, Yew-Soon Ong
Location: Montreal | Day: August 19th | Time: 11:30 | Session: ML: Difussion Models
Show Abstract
This paper presents an evolvable conditional diffusion method such that black-box, non-differentiable multi-physics models, as are common in domains like computational fluid dynamics and electromagnetics, can be effectively used for guiding the generative process to facilitate autonomous scientific discovery. We formulate the guidance as an optimization problem where one optimizes for a desired fitness function through updates to the descriptive statistic for the denoising distribution, and derive an evolution-guided approach from first principles through the lens of probabilistic evolution. Interestingly, the final derived update algorithm is analogous to the update as per common gradient-based guided diffusion models, but without ever having to compute any derivatives. We validate our proposed evolvable diffusion algorithm in two AI for Science scenarios: the automated design of fluidic topology and meta-surface. Results demonstrate that this method effectively generates designs that better satisfy specific optimization objectives without reliance on differentiable proxies, providing an effective means of guidance-based diffusion that can capitalize on the wealth of black-box, non-differentiable multi-physics numerical models common across Science.