Special Track on AI, the Arts and Creativity Papers

386: Synthesizing Composite Hierarchical Structure from Symbolic Music Corpora
Authors: Ilana Shapiro, Ruanqianqian (Lisa) Huang, Zachary Novack, Cheng-i Wang, Hao-Wen Dong, Taylor Berg-Kirkpatrick, Shlomo Dubnov, Sorin Lerner
Location: Montreal | Day: August 20th | Time: 14:00 | Session: AI and Arts (2/2)
Show Abstract
Western music is an innately hierarchical system of interacting levels of structure, from fine-grained melody to high-level form. In order to analyze music compositions holistically and at multiple granularities, we propose a unified, hierarchical meta-representation of musical structure called the structural temporal graph (STG). For a single piece, the STG is a data structure that defines a hierarchy of progressively finer structural musical features and the temporal relationships between them. We use the STG to enable a novel approach for deriving a representative structural summary of a music corpus, which we formalize as a dually NP-hard combinatorial optimization problem. Our approach first applies simulated annealing to develop a measure of structural distance between two music pieces rooted in graph isomorphism. Our approach then combines the formal guarantees of SMT solvers with nested simulated annealing over structural distances to produce a structurally sound, representative centroid STG for an entire corpus of STGs from individual pieces. To evaluate our approach, we conduct experiments verifying that structural distance accurately differentiates between music pieces, and that derived centroids accurately structurally characterize their corpora.
2855: SynthRL: Cross-domain Synthesizer Sound Matching via Reinforcement Learning
Authors: Wonchul Shin, Kyogu Lee
Location: Montreal | Day: August 20th | Time: 14:00 | Session: AI and Arts (2/2)
Show Abstract
Generalization of synthesizer sound matching to external instrument sounds is highly challenging due to the non-differentiability of sound synthesis process which prohibits the use of out-of-domain sounds for training with synthesis parameter loss. We propose SynthRL, a novel reinforcement learning (RL)-based approach for cross-domain synthesizer sound matching. By incorporating sound similarity into the reward function, SynthRL effectively optimizes synthesis parameters without ground-truth labels, allowing fine-tuning on out-of-domain sounds. Furthermore, we introduce a transformer-based model architecture and reward-based prioritized experience replay to enhance RL training efficiency, considering the unique characteristics of the task. Experimental results demonstrate that SynthRL outperforms state-of-the-art methods on both in-domain and out-of-domain tasks. Further experimental analysis validates the effectiveness of our reward design, showing a strong correlation with human perception of sound similarity.
8369: ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning
Authors: Zhongjie Duan, Hong Zhang, Wenmeng Zhou, Cen Chen, Yaliang Li, Yu Zhang, Yingda Chen
Location: Montreal | Day: August 19th | Time: 15:00 | Session: CV: Difusion models
Show Abstract
Recently, advancements in video synthesis have attracted significant attention. Video synthesis models have demonstrated the practical applicability of diffusion models in creating dynamic visual content. Despite these advancements, the extension of video lengths remains constrained by computational resources. Most existing video synthesis models are limited to generating short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we trained ExSVD, an extended model based on Stable Video Diffusion model. Our approach enhances the model’s capacity to generate up to 5x its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn’t compromise the model’s innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We have released the source code and the enhanced model publicly.
8370: FastBlend: Enhancing Video Stylization Consistency via Model-Free Patch Blending
Authors: Zhongjie Duan, Chengyu Wang, Cen Chen, Weining Qian, Jun Huang, Mingyi Jin
Location: Montreal | Day: August 21st | Time: 15:00 | Session: CV: videos
Show Abstract
With the emergence of diffusion models and the rapid development of image processing, generating artistic images in style transfer tasks has become effortless. However, these impressive image processing approaches face consistency issues in video processing due to the independent processing of each frame. In this paper, we propose a powerful, model-free approach called FastBlend to address the consistency problem in video stylization. FastBlend functions as a post-processor and can be seamlessly integrated with diffusion models to create a robust video stylization pipeline. Based on a patch-matching algorithm, we remap and blend the aligned content across multiple frames, thus compensating for inconsistent content with neighboring frames. Moreover, we propose a tree-like data structure and a specialized loss function, aiming to optimize computational efficiency and visual quality for different application scenarios. Extensive experiments have demonstrated the effectiveness of FastBlend. Compared with both independent video deflickering algorithms and diffusion-based video processing methods, FastBlend is capable of synthesizing more coherent and realistic videos.
8393: Large Language Model Meets Constraint Propagation
Authors: Alexandre Bonlarron, Florian Régin, Elisabetta De Maria, Jean-Charles Régin
Location: Montreal | Day: August 20th | Time: 10:00 | Session: AI and Arts (1/2)
Show Abstract
Large Language Models (LLMs) excel at generating fluent text but struggle to enforce external constraints because they generate tokens sequentially without explicit control mechanisms. GenCP addresses this limitation by combining LLM predictions with Constraint Programming (CP) reasoning, formulating text generation as a Constraint Satisfaction Problem (CSP). In this paper, we improve GenCP by integrating Masked Language Models (MLMs) for domain generation, which allows bidirectional constraint propagation that leverages both past and future tokens. This integration bridges the gap between token-level prediction and structured constraint enforcement, leading to more reliable and constraint-aware text generation. Our evaluation on COLLIE benchmarks demonstrates that incorporating domain preview via MLM calls significantly improves GenCP’s performance. Although this approach incurs additional MLM calls and, in some cases, increased backtracking, the overall effect is a more efficient use of LLM inferences and an enhanced ability to generate feasible and meaningful solutions, particularly in tasks with strict content constraints.
8437: Precarity and Solidarity: Preliminary Results on a Study of Queer and Disabled Fiction Writers’ Experiences with Generative AI
Authors: Carolyn Lamb, Daniel G. Brown, Maura R. Grossman
Location: Montreal | Day: August 20th | Time: 10:00 | Session: AI and Arts (1/2)
Show Abstract
We present a mixed-methods study of professional fiction writers’ experiences with generative AI (genAI), primarily focused on queer and disabled writers. Queer and disabled writers are markedly more pessimistic than others about the impact of genAI on their industry, although pessimism is the majority attitude for all groups. We explore how genAI exacerbates existing causes of precarity for writers, reasons why writers are opposed to its use, and strategies used by marginalized fiction writers to safeguard their industry.
8503: METEOR: Melody-aware Texture-controllable Symbolic Music Re-Orchestration via Transformer VAE
Authors: Dinh-Viet-Toan Le, Yi-Hsuan Yang
Location: Montreal | Day: August 20th | Time: 14:00 | Session: AI and Arts (2/2)
Show Abstract
Re-orchestration is the process of adapting a music piece for a different set of instruments. By altering the original instrumentation, the orchestrator often modifies the musical texture while preserving a recognizable melodic line and ensures that each part is playable within the technical and expressive capabilities of the chosen instruments.

In this work, we propose METEOR, a model for generating Melody-aware Texture-controllable re-Orchestration with a Transformer-based variational auto-encoder (VAE). This model performs symbolic instrumental and textural music style transfers with a focus on melodic fidelity and controllability. We allow bar- and track-level controllability of the accompaniment with various textural attributes while keeping a homophonic texture. With both subjective and objective evaluations, we show that our model outperforms style transfer models on a re-orchestration task in terms of generation quality and controllability. Moreover, it can be adapted for a lead sheet orchestration task as a zero-shot learning model, achieving performance comparable to a model specifically trained for this task.
8615: NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms
Authors: Yashan Wang, Shangda Wu, Jianhuai Hu, Xingjian Du, Yueqi Peng, Yongxin Huang, Shuai Fan, Xiaobing Li, Feng Yu, Maosong Sun
Location: Montreal | Day: August 20th | Time: 14:00 | Session: AI and Arts (2/2)
Show Abstract
We introduce NotaGen, a symbolic music generation model aiming to explore the potential of producing high-quality classical sheet music. Inspired by the success of Large Language Models (LLMs), NotaGen adopts pre-training, fine-tuning, and reinforcement learning paradigms (henceforth referred to as the LLM training paradigms). It is pre-trained on 1.6M pieces of music in ABC notation, and then fine-tuned on approximately 9K high-quality classical compositions conditioned on "period-composer-instrumentation" prompts. For reinforcement learning, we propose the CLaMP-DPO method, which further enhances generation quality and controllability without requiring human annotations or predefined rewards. Our experiments demonstrate the efficacy of CLaMP-DPO in symbolic music generation models with different architectures and encoding schemes. Furthermore, subjective A/B tests show that NotaGen outperforms baseline models against human compositions, greatly advancing musical aesthetics in symbolic music generation.
8642: Leveraging Large Language Models for Active Merchant Non-player Characters
Authors: Byungjun Kim, Minju Kim, Dayeon Seo, Bugeun Kim
Location: Montreal | Day: August 22nd | Time: 11:30 | Session: Game Theory and Economic Paradigms
Show Abstract
We highlight two significant issues leading to the passivity of current merchant non-player characters (NPCs): pricing and communication. While immersive interactions with active NPCs have been a focus, price negotiations between merchant NPCs and players remain underexplored. First, passive pricing refers to the limited ability of merchants to modify predefined item prices. Second, passive communication means that merchants can only interact with players in a scripted manner. To tackle these issues and create an active merchant NPC, we propose a merchant framework based on large language models (LLMs), called MART, which consists of an appraiser module and a negotiator module. We conducted two experiments to explore various implementation options under different training methods and LLM sizes, considering a range of possible game environments. Our findings indicate that finetuning methods, such as supervised finetuning (SFT) and knowledge distillation (KD), are effective in using smaller LLMs to implement active merchant NPCs. Additionally, we found three irregular cases arising from the responses of LLMs.
8676: SmartSpatial: Enhancing 3D Spatial Awareness in Stable Diffusion with a Novel Evaluation Framework
Authors: Mao Xun Huang, Brian J Chan, Hen-Hsen Huang
Location: Montreal | Day: August 19th | Time: 15:00 | Session: CV: Difusion models
Show Abstract
Stable Diffusion models have made remarkable strides in generating photorealistic images from text prompts but often falter when tasked with accurately representing complex spatial arrangements, particularly involving intricate 3D relationships.
To address this limitation, we introduce SmartSpatial, an innovative approach that not only enhances the spatial arrangement capabilities of Stable Diffusion but also fosters AI-assisted creative workflows through 3D-aware conditioning and attention-guided mechanisms.
SmartSpatial incorporates depth information injection and cross-attention control to ensure precise object placement, delivering notable improvements in spatial accuracy metrics.
In conjunction with SmartSpatial, we present SmartSpatialEval, a comprehensive evaluation framework that bridges computational spatial accuracy with qualitative artistic assessments.
Experimental results show that SmartSpatial significantly outperforms existing methods, setting new benchmarks for spatial fidelity in AI-driven art and creativity.
8700: Intoner: For Chinese Poetry Intoning Synthesis
Authors: Heda Zuo, Liyao Sun, Zeyu Lai, Weitao You, Pei Chen, Lingyun Sun
Location: Montreal | Day: August 20th | Time: 10:00 | Session: AI and Arts (1/2)
Show Abstract
Chinese Poetry Intoning, with improvised melodies devoid of fixed musical scores, is crucial for emotional expression and prosodic rendition. However, this cultural heritage faces challenges in propagation due to scant audio records and a scarcity of domain experts. Existing text-to-speech models lack the ability to generate melodious audio, while singing-voice-synthesis models rely on predetermined musical scores, which are all unsuitable for intoning synthesis. Hence, we introduce Chinese Poetry Intoning Synthesis (PIS) as a novel task to reproduce intoning audio and preserve this age-old cultural art. Corresponding to this task, we summarize three-level principles from poetry metrical patterns and construct a diffusion PIS model Intoner based on them. We also collect a multi-style Chinese poetry intoning dataset of text-audio pairs accompanied by feature annotations. Experimental results show that our model effectively learns diverse intoning styles and contents which can synthesize more melodious and vibrant intoning audio. To the best of our knowledge, we are the first to work on poetry intoning synthesis task.
8860: Towards a Practical Tool for Music Composition: Using Constraint Programming to Model Chord Progressions and Modulations
Authors: Damien Sprockeels, Peter Van Roy
Location: Montreal | Day: August 20th | Time: 14:00 | Session: AI and Arts (2/2)
Show Abstract
The Harmoniser project aims to provide a practical tool to aid music composers in creating complete musical works. In this paper, we present a formal model of its second layer, tonal chord progressions and modulations to neighbouring tonalities, and a practical implementation using the Gecode constraint solver. Since music composition is too complex to formalize in its entirety, the Harmoniser project makes two assumptions for tractability: first, it focuses on tonal music (the basis of Western classical and popular music); second, it defines a simplified four-layer composition process that is relevant for a significant number of composers. Previous work on using constraint programming for music composition was limited to exploring the formalisation of different musical aspects and did not address the overall problem of building a practical composer tool. Harmoniser’s four layers are global structure (tonal development of the whole piece), chord progressions (diatonic and chromatic) and modulations, voicing (four-voice chord layout), and ornaments (e.g., passing notes, appoggiaturas), all allowing iterative refinement by the composer. This paper builds on prior work for voicing layer 3, Diatony, and presents a model for layer 2, chord progressions and modulations. The results of the present paper can be used as input to Diatony to generate voicing. Future work will define models for the remaining layers, and combine all layers together with a graphical user interface as a plug-in for a DAW.
8971: Algorithmic Composition Using Narrative Structure and Tension
Authors: Francisco Braga, Gilberto Bernardes, Roger B. Dannenberg, Nuno Correia
Location: Montreal | Day: August 20th | Time: 14:00 | Session: AI and Arts (2/2)
Show Abstract
This paper describes an approach to algorithmic music composition that takes narrative structures as input, allowing composers to create music directly from narrative elements.
Creating narrative development in music remains a challenging task in algorithmic composition.
Our system addresses this by combining leitmotifs to represent characters, generative grammars for harmonic coherence, and evolutionary algorithms to align musical tension with narrative progression.
The system operates at different scales, from overall plot structure to individual motifs, enabling both autonomous composition and co-creation with varying degrees of user control.
Evaluation with compositions based on tales demonstrated the system’s ability to compose music that supports narrative listening and aligns with its source narratives, while being perceived as familiar and enjoyable.
9025: A Picture is Worth a Thousand Prompts? Efficacy of Iterative Human-Driven Prompt Refinement in Image Regeneration Tasks
Authors: Khoi Trinh, Scott Seidenberger, Raveen Wijewickrama, Murtuza Jadliwala, Anindya Maiti
Location: Montreal | Day: August 20th | Time: 10:00 | Session: AI and Arts (1/2)
Show Abstract
With AI-generated content becoming widespread across digital platforms, it is important to understand how such content is inspired and produced. This study explores the underexamined task of image regeneration, where a human operator iteratively refines prompts to recreate a specific target image. Unlike typical image generation, regeneration begins with a visual reference. A key challenge is whether existing image similarity metrics (ISMs) align with human judgments and can serve as useful feedback in this process. We conduct a structured user study to evaluate how iterative prompt refinement affects similarity to target images and whether ISMs reflect the improvements perceived by human observers. Our results show that prompt adjustments significantly improve alignment, both subjectively and quantitatively, highlighting the potential of iterative workflows in enhancing generative image quality.
9218: Pay Attention to the Keys: Visual Piano Transcription Using Transformers
Authors: Uros Zivanovic, Ivan Pilkov, Carlos Cancino-Chacón
Location: Montreal | Day: August 20th | Time: 14:00 | Session: AI and Arts (2/2)
Show Abstract
Visual piano transcription (VPT) is the task of obtaining a symbolic representation of a piano performance from visual information alone (e.g., from a top-down video of the piano keyboard). In this work we propose a VPT system based on the vision transformer (ViT), which surpasses previous methods based on convolutional neural networks (CNNs). Our system is trained on the newly introduced R3 dataset, consisting of ca.~31 hours of synchronized video and MIDI recordings of piano performances. We additionally introduce an approach to predict note offsets, which has not been previously explored in this context. We show that our system outperforms the state-of-the-art on the PianoYT dataset for onset prediction and on the R3 dataset for both onsets and offsets.