Skip to content

System Architecture for High-Fidelity Symbolic-Conditioned Audio Synthesis: A Framework for Rapid Prototyping and Validation

1. Introduction: The Divergence of Generative Audio

The landscape of neural audio synthesis has undergone a radical bifurcation in the mid-2020s. On one trajectory lies the domain of loose text-conditioned generation, typified by large-scale autoregressive models like Meta's MusicGen 1 and latent diffusion models such as Stability AI's Stable Audio.3 These systems excel at capturing high-level semantic abstractions—generating a "lo-fi hip hop beat" or a "cinematic orchestral swell" from natural language prompts. They operate on the principle of creative hallucination, where the model fills in the vast structural voids left by vague textual descriptions with statistically plausible acoustic patterns. While impressive for ideation and background ambience, these architectures fundamentally fail in professional production workflows that demand rigorous adherence to specific musical compositions.

On the opposing trajectory lies symbolic-conditioned generation, a domain that treats audio synthesis not as a creative completion task, but as a precise rendering task. Here, the input is not a vague prompt but a strict structural definition: a MIDI score, a MusicXML file, or a phoneme-duration sequence. The objective is to translate these discrete symbols into continuous waveforms with high fidelity, preserving the exact timing, pitch, and timbral characteristics dictated by the input. This is the domain of DiffSinger 5, MIDI-DDSP 7, and emerging architectures like Music ControlNet.9

This report serves as a comprehensive architectural guide for rapid prototyping within this second domain. The primary engineering challenge—and the "riskiest assumption" identified for validation—is the capability of current neural architectures to maintain strict symbolic alignment (the precise synchronization of linguistic and musical content) while achieving state-of-the-art (SOTA) acoustic naturalness at interactive inference latencies. The analysis suggests that while text-to-music models have captured the public imagination, the true frontier for controllable media lies in the convergence of diffusion probabilistic models and differentiable digital signal processing (DDSP), specifically applied to the rigorous constraints of Singing Voice Synthesis (SVS). By dissecting over 300 research artifacts, we delineate a reference architecture combining Montreal Forced Aligner (MFA) for data preparation, DiffSinger for acoustic modeling, and StreamDiffusion for real-time inference, offering a definitive pathway to validate the feasibility of highly controllable neural audio.

2. The Riskiest Assumption: Alignment, Fidelity, and Latency

In the context of rapid prototyping for symbolic audio generation, the "riskiest assumption" is a composite hypothesis. It posits that a neural system can simultaneously satisfy three competing constraints that have historically required trade-offs:

  1. Temporal and Pitch Precision (Alignment): The model must adhere to the input score with frame-level accuracy. If a MIDI note dictates a duration of 450ms and a pitch of 440Hz (A4), the generated audio must reflect this exactly. In vocal synthesis, this extends to the phoneme level; the /s/ in "sing" must align with the note onset, and the vowel /i/ must sustain for the specified duration. Failures here manifest as "glitches," slurring, or rhythmic drift, which render the output unusable for musical production.11
  2. Acoustic Fidelity (Naturalness): The output must be indistinguishable from a human performance. This requires modeling not just the fundamental frequency and gross spectral envelope, but also the micro-dynamics, breath noises, and complex timbral evolutions that characterize real instruments and voices. Early parametric models (like concatenative synthesis) achieved alignment but sounded robotic; early neural models (like WaveNet) sounded natural but were uncontrollable.11
  3. Interactive Latency (Real-Time Utility): To be useful in a creative loop, the generation must occur at speeds approaching real-time. Diffusion models, while offering SOTA quality and alignment, historically suffer from slow inference due to their iterative denoising nature (often requiring 100+ steps). The assumption is that modern acceleration techniques (Consistency Distillation, StreamDiffusion) can bridge this gap without catastrophic quality loss.14

The validation of this composite assumption requires a prototype that pushes the boundaries of all three constraints. The analysis indicates that Singing Voice Synthesis (SVS) is the optimal test case. Unlike instrumental generation, where slight timbral deviations are acceptable, the human ear is hyper-sensitive to vocal artifacts. Furthermore, SVS requires the synchronization of two independent symbolic streams—linguistic content (lyrics) and musical content (melody)—making it the most rigorous stress test for any symbolic conditioning architecture.5

---

3. Landscape Analysis of Generative Architectures

To substantiate the selection of specific components for the prototype, we must analyze the current state of generative architectures, categorizing them by their handling of symbolic control.

3.1 Text-Conditioned Autoregressive Models: The Control Gap

Models like MusicGen 1 and AudioLM 16 utilize a single-stage autoregressive Transformer architecture. They operate over compressed discrete audio tokens (derived from neural audio codecs like EnCodec) and condition generation on text embeddings from T5 or CLAP.

While MusicGen represents a significant leap in audio quality and coherence, it is fundamentally ill-suited for strict symbolic control. The autoregressive nature of the model means it generates audio token-by-token based on probability distributions. While it can be "steered" by a melody (via chroma features in MusicGen-Melody) 17, it lacks the mechanism to enforce precise start/stop times for specific notes or to map specific phonemes to specific time windows. The analysis of benchmark data indicates that while MusicGen Large (3.3B parameters) achieves impressive Frechet Audio Distance (FAD) scores of ~5.48 18, its inference speed on consumer hardware is often slower than real-time (e.g., generating 10s of audio takes ~10-30s on older GPUs) 19, and it offers no native support for detailed MIDI or lyric alignment. Fine-tuning efforts, such as MusiConGen 20, attempt to inject chord and rhythm conditioning, but these remain "soft" controls rather than the "hard" constraints required for score rendering.

3.2 Differentiable Digital Signal Processing (DDSP): The Interpretability Approach

MIDI-DDSP 7 represents a paradigm shift away from "black box" neural networks. Instead of predicting audio samples or spectrograms directly, the neural network predicts the parameters (amplitude, harmonic distribution, noise filter coefficients) for a Digital Signal Processor (DSP).

This approach has distinct advantages for prototyping. Because the signal generation is handled by a DSP (additive synthesis + subtractive noise), the audio is alias-free and highly controllable. The hierarchical structure—predicting "Performance" controls (vibrato, dynamics) from "Score" data (MIDI notes)—allows for explicit disentanglement of musical intent and expression. However, benchmarks suggest that DDSP models can sound "thin" or "buzzy" compared to diffusion models, as the DSP oscillators struggle to capture the full transient complexity of sounds like breath or pick attack, essentially placing a ceiling on acoustic fidelity.22 While excellent for solo instruments (violin, flute), DDSP struggles with the complex, non-harmonic textures of the human voice or percussion compared to the raw modeling power of diffusion.

3.3 Score-Conditioned Diffusion: The Convergence Point

The architecture of choice for high-fidelity symbolic generation is Score-Conditioned Diffusion, exemplified by DiffSinger.6 Unlike autoregressive models, which generate audio sequentially, diffusion models generate audio (or mel-spectrograms) by iteratively refining a noisy signal. This non-autoregressive parallel generation capability is crucial for three reasons:

  1. Global Coherence: The model "sees" the entire utterance or musical phrase at once during training, allowing it to plan breath and intonation curves that span multiple notes.
  2. Explicit Conditioning: DiffSinger incorporates a "Variance Adaptor" module (borrowed from FastSpeech 2) that explicitly adds pitch, duration, and energy embeddings to the phoneme sequence before the diffusion process begins. This essentially "locks" the generation to the symbolic grid.
  3. Quality Ceiling: Diffusion models currently hold the SOTA for audio quality, capable of generating the subtle stochastic details (phase incoherence, breathiness) that DDSP misses and the structural coherence that GANs struggle with.25

The trade-off has historically been speed. A standard diffusion process might require 100 or 1000 denoising steps. However, recent advances in Consistency Distillation 27 and StreamDiffusion 14 have reduced this to 1-4 steps or enabled streaming inference, effectively neutralizing the latency penalty.

3.4 Comparative Architecture Analysis

The following table summarizes the capabilities of the leading architectures relative to the prototyping goals.

Feature MusicGen (Text-to-Audio) MIDI-DDSP (DSP-Based) DiffSinger (Diffusion-Based)
Input Modality Text / Loose Melody MIDI (Notes) Phonemes + Pitch + Duration
Control Precision Low (Global Style) High (Note-wise) Extreme (Frame-wise)
Acoustic Fidelity High (but hallucinations) Medium (Synthetic texture) Very High (SOTA)
Inference Mechanism Autoregressive (Slow) DSP (Fast) Diffusion (Variable Speed)
Alignment Stability Poor (No explicit duration) Excellent (Deterministic) Excellent (Explicit Predictor)
Prototype Viability Low (Hard to align) Medium (Limited textures) High (Best balance)

Table 1: Comparative analysis of generative audio architectures based on control, fidelity, and viability for symbolic conditioning.

---

4. The Symbolic Front-End: Data Engineering for Control

The success of a symbolic-conditioned prototype is determined upstream of the generative model. The "garbage in, garbage out" principle is acute here; if the alignment between the symbolic representation (text/MIDI) and the ground truth audio is flawed during training, the model will fail to learn the association, leading to the "glitching" artifacts identified as a key risk.

4.1 Linguistic Processing and Phonetization

For the recommended SVS prototype, raw text is insufficient. It must be converted into a phonetic representation that the model can acoustically map. The International Phonetic Alphabet (IPA) is emerging as the standard for multilingual synthesis, allowing models to generalize across languages (e.g., Visinger2 5 and Transinger 5 use IPA to bridge Chinese and English).

However, for a rapid prototype focused on a single language (likely English or Mandarin), specific Grapheme-to-Phoneme (G2P) tools are more efficient.

  • English: The g2p_en library 28 is the industry standard. It converts "Hello world" into HH AH0 L OW1 W ER1 L D. Crucially, it handles lexical stress (0, 1, 2), which is vital for natural prosody.
  • Mandarin: pypinyin is the standard, often converting characters to Pinyin with tone markers, which are then mapped to phonemes.
  • Handling Polyphony in Text: A unique challenge in SVS is the "slur" or "melisma," where one syllable spans multiple notes. The data preprocessing pipeline must explicitly flag these events. The Opencpop dataset 30 solves this by annotating "slur" as a binary flag for each note, a feature that the prototype should replicate to prevent the model from re-articulating the consonant at every pitch change.

4.2 The Alignment Engine: Montreal Forced Aligner (MFA)

The critical component for data preparation is the Montreal Forced Aligner (MFA).32 MFA uses Hidden Markov Models (HMMs) to align audio with text. While deep learning aligners exist, MFA remains the robust workhorse for generating the "ground truth" durations required to train the DiffSinger duration predictor.

Mechanism of Action:

  1. Acoustic Model: MFA uses a pre-trained acoustic model (usually trained on thousands of hours of speech, like Librispeech) to recognize phonemes in the audio.
  2. Lexicon: It uses a pronunciation dictionary to expand the text transcript into a phoneme graph.
  3. Viterbi Alignment: It finds the optimal path through the HMM states that matches the audio features (MFCCs), producing exact start and end times for every phoneme.

Prototyping Risk & Mitigation:
MFA is optimized for speech. Singing voice contains elongated vowels and vibrato that can confuse speech-trained HMMs, leading to alignment errors (e.g., cutting a vowel short).

  • Mitigation 1: Use a "singing-adapted" acoustic model if available (rare).
  • Mitigation 2: Iterative Realignment. Train a small acoustic model on the specific singing dataset (if >1 hour) and realign. This "bootstrapping" significantly improves alignment accuracy for singing.34
  • Mitigation 3: Spectral verification. The prototype pipeline should include a sanity check where the average energy of the "silence" segments is calculated. If a segment marked "silence" has high energy, the alignment is likely shifted.

4.3 Pitch and Feature Extraction

Symbolic conditioning requires explicit pitch information. Relying on MIDI alone is often insufficient for training because human singers drift in pitch, use vibrato, and slide between notes (portamento).

  • Pitch Extraction (F0): The analysis strongly recommends RMVPE (Robust Model for Vocal Pitch Estimation) 35 over older algorithms like CREPE or Harvest. RMVPE is a deep learning-based pitch tracker trained specifically on vocal data with noise and reverb, making it far more robust for "in-the-wild" data processing.
  • Mel-Spectrograms: The target output for the diffusion model is the Mel-spectrogram. Standard parameters for 44.1kHz audio are: 1024 FFT size, 1024 window size, and a hop size of 512 (approx 11.6ms resolution). This hop size is a critical hyperparameter; smaller hops (e.g., 256) offer better temporal resolution for fast runs but double the inference computational load.35

---

5. Acoustic Model Architecture: Deep Dive into DiffSinger

The core of the proposed prototype is the DiffSinger architecture. Understanding its internal mechanism is crucial for effective implementation and debugging.

5.1 The Shallow Diffusion Mechanism

Standard diffusion models (DDPM) learn to reverse a noise process starting from pure Gaussian noise (\(x\_T \\sim \\mathcal{N}(0, I)\)). DiffSinger introduces the concept of Shallow Diffusion.6

Instead of starting from pure noise, the model starts from a "coarse" Mel-spectrogram generated by a simple, fast auxiliary model (like a simple regression model or a low-quality TTS output). The diffusion process then performs only \(K\) steps (where \(K \< T\), e.g., 100 steps instead of 1000) to "refine" this coarse mel into a high-fidelity one.

  • Implication for Prototyping: This drastically reduces the "riskiest assumption" of latency. By needing fewer steps to reach high quality, the model is inherently faster. It also eases the modeling burden; the diffusion model focuses on texture and detail rather than basic structure (which the auxiliary model handles).

5.2 The Variance Adaptor

To ensure the "Control" requirement, DiffSinger does not rely on the diffusion process to hallucinate duration or pitch. It uses a Variance Adaptor module 12, placed between the phoneme encoder and the diffusion decoder.

  1. Duration Predictor: A small convolutional network that takes the phoneme embedding and predicts the duration (in frames). During inference, this dictates exactly how many frames each phoneme occupies. This is where the symbolic control is enforced.
  2. Length Regulator: Expands the phoneme sequence based on the predicted (or provided) durations.
  3. Pitch Predictor: Predicts the F0 curve.
  4. Feature Fusion: The expanded phoneme embeddings are summed with the Pitch embeddings and Position embeddings. This combined vector \(c\) serves as the condition for the diffusion process \(p\_\\theta(x\_{t-1}|x\_t, c)\).

This architecture provides the "knobs" for the prototype. If the user wants to stretch a note, they simply modify the input duration. If they want to change the melody, they modify the F0 curve. The diffusion model is mathematically constrained to generate texture around this skeletal structure.

---

6. Inference Acceleration: Real-Time Strategies

The third pillar of the riskiest assumption is latency. A diffusion model that takes 5 seconds to generate 1 second of audio is a failure for interactive prototyping.

6.1 Consistency Distillation

Consistency Models 15 represent a breakthrough in accelerating diffusion. The core idea is to train a model to map any point on the probability flow trajectory directly to the origin (the clean data), enforcing a "consistency" property: \(f(x\_t, t) \= f(x\_{t'}, t')\).

  • ConsistencyTTA (Text-to-Audio): Research shows that distilling a diffusion model into a consistency model can reduce inference steps to just 1 or 2, achieving a 400x speedup with negligible quality loss (FAD scores remain competitive).15
  • Prototyping Path: Start with a standard DiffSinger. Once quality is verified, apply Consistency Distillation (using the pretrained DiffSinger as the "Teacher") to compress the inference loop.

6.2 StreamDiffusion

For applications requiring continuous generation (e.g., a virtual avatar singing live), StreamDiffusion 14 offers a pipeline-level solution.

  • Stream Batching: Instead of a simple queue, StreamDiffusion processes a batch of frames where each element in the batch corresponds to a different denoising step. Frame 1 is at step \(T\), Frame 2 is at step \(T-1\), etc. In one forward pass of the GPU, the model advances all frames by one step.
  • Residual Classifier-Free Guidance (R-CFG): CFG usually doubles computation (one pass conditional, one unconditional). R-CFG approximates the negative residual, reducing this cost significantly.
  • Performance: Benchmarks on an NVIDIA RTX 4090 show StreamDiffusion achieving up to 91 FPS for image generation. Given that audio spectrograms are 1D/2D data significantly smaller than 1024x1024 images, this approach theoretically supports high-sample-rate audio generation well faster than real-time.40

6.3 Hardware Considerations: A100 vs. RTX 4090

For the prototype, hardware selection is critical.

Metric NVIDIA A100 (80GB) NVIDIA RTX 4090 (24GB) Analysis for Audio
VRAM 80 GB 24 GB A100 wins for Training. Large batches of long audio require massive VRAM.
Memory Bandwidth 2039 GB/s 1008 GB/s A100 wins for massive parameter models.
Clock Speed ~1.4 GHz ~2.5 GHz 4090 wins for Inference. For small batch sizes (real-time interaction), the raw clock speed allows for lower latency.41
Cost ~$15,000 ~$1,800 4090 is the clear winner for prototyping stations.

Conclusion: Train on cloud A100s. Deploy and demonstrate the prototype on a local RTX 4090.

---

7. Implementation Roadmap: The Prototyping Sprint

This section outlines the step-by-step engineering process to build the validation prototype.

Phase 1: Environment and Data Preparation (Days 1-3)

Objective: Create a verified, aligned dataset from Opencpop.

  1. Environment Setup:
  2. OS: Linux (Ubuntu 20.04/22.04) is mandatory for MFA and efficient CUDA handling.
  3. Python: 3.8 or 3.9 (compatibility with espnet and fairseq tools).
  4. GPU: Minimum 1x RTX 3090/4090.
  5. Dataset Acquisition: Download Opencpop.30
  6. Preprocessing Pipeline (Python Script):
  7. Step A: Parse transcriptions.txt to get Pinyin/Phonemes.
  8. Step B: Parse .TextGrid files to get durations.
  9. Step C: Parse .wav files to get Mel-spectrograms (using librosa or torchaudio) and F0 (using RMVPE).
  10. Step D: Crucial Sanity Check: Visualize alignment. Overlay the F0 curve on the spectrogram. Overlay phoneme boundaries. If they don't visually align, stop. Re-run MFA or fix sample rates.
  11. Format Export: Save as serialized dictionaries (or .tfrecord / .npy) containing [mel, f0, duration, phoneme_id] for each utterance.

Phase 2: Model Training (Days 4-7)

Objective: Train the DiffSinger acoustic model to convergence.

  1. Configuration:
  2. Encoder: Transformer (4-6 layers).
  3. Decoder: Diffusion (20-40 steps for prototype).
  4. Variance Adaptor: 2-layer Conv1D for Duration/Pitch prediction.
  5. Loss Balancing:
  6. Assign higher weight to Duration Loss initially. If the model can't predict duration, the diffusion step has no structural frame to work with.
  7. Metric: Watch duration_loss and f0_loss. They should plateau relatively quickly. diffusion_loss will decrease slower.
  8. Vocoder Training:
  9. Do not train from scratch. Fine-tune a pre-trained HiFi-GAN (trained on universal data like LibriTTS) on the Opencpop wavs. This "adaption" takes only ~10k steps and dramatically improves the "wetness" and presence of the voice.43

Phase 3: Inference and Validation (Days 8-10)

Objective: Run the prototype and measure against the "Riskiest Assumptions."

  1. Inference Pipeline:
  2. Input: MusicXML file (unseen song).
  3. Process: MusicXML -> music21 -> Phonemes/Note Durations -> DiffSinger -> Mel-Spec -> HiFi-GAN -> Audio.
  4. Metric Calculation:
  5. F0 RMSE: Extract pitch from generated audio. Compare to input MIDI pitch. Target: \< 15 cents.
  6. Duration Error: Use MFA on the generated audio to recover phoneme boundaries. Compare to input durations. Target: \< 20ms drift.
  7. RTF: Measure wall-clock time for generation. Target: \< 1.0 (Real-time).

---

8. Datasets and Resources

A prototyping effort is defined by its data. The following datasets are identified as the most high-value resources for this specific architecture.

Dataset Type Content License Utility for Prototype
Opencpop 30 SVS 5.2 hrs Mandarin, Prof. Singer CC-BY-NC Critical. The gold standard for aligned SVS data.
M4Singer 5 SVS Multi-singer, Multi-style CC-BY-NC Good for testing generalization across timbres.
GTSinger 5 SVS 4 languages, controlled techniques Research Excellent for multilingual IPA testing.
Lakh MIDI (LMD) 45 Symbolic 176k MIDI files Creative Commons Useful for instrumental pre-training (Music ControlNet).
MusicNet 46 Audio/Symbolic Classical, aligned notes CC-BY Good for validating polyphonic piano/string alignment.

---

9. Conclusion

The "riskiest assumption"—that neural models can achieve strict symbolic alignment with SOTA fidelity at low latency—is robustly testable using the architecture defined in this report. The convergence of DiffSinger for precise acoustic modeling, MFA for ground-truth alignment, and StreamDiffusion/Consistency Distillation for inference acceleration creates a complete technical stack that addresses the historical trade-offs of the field.

While text-conditioned models like MusicGen garner headlines for their generative breadth, they currently fail the precision test required for professional media production. The proposed prototype, focusing on the rigorous constraints of Singing Voice Synthesis, serves as the ultimate "canary in the coal mine." If a system can align the complex spectral textures of the human voice to a millisecond-precise phoneme grid in real-time, it has effectively solved the core challenge of symbolic audio generation, opening the door to a new era of highly controllable, AI-assisted music production.

The research strongly suggests that the bottleneck is no longer theoretical but engineering-focused: specifically, the quality of the data preprocessing (alignment) and the optimization of the inference pipeline (distillation). By following the implementation roadmap outlined above, the prototype will directly attack these bottlenecks, providing definitive confirmation of the technology's viability.

Works cited

  1. MusicGen: Simple and Controllable Music Generation - AudioCraft, accessed on December 6, 2025, https://audiocraft.metademolab.com/musicgen.html
  2. MusicGen - Hugging Face, accessed on December 6, 2025, https://huggingface.co/docs/transformers/en/model_doc/musicgen
  3. Fast Timing-Conditioned Latent Audio Diffusion - arXiv, accessed on December 6, 2025, https://arxiv.org/html/2402.04825v3
  4. Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion - Stability AI, accessed on December 6, 2025, https://stability.ai/research/stable-audio-efficient-timing-latent-diffusion
  5. Transinger: Cross-Lingual Singing Voice Synthesis via IPA-Based Phonetic Alignment, accessed on December 6, 2025, https://www.mdpi.com/1424-8220/25/13/3973
  6. Visinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation - Semantic Scholar, accessed on December 6, 2025, https://www.semanticscholar.org/paper/VISinger2%2B%3A-End-to-End-Singing-Voice-Synthesis-by-Yu-Shi/afb8cb54b4a8dca27eadee19afbc04aeb518dfee
  7. "MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling", accessed on December 6, 2025, https://midi-ddsp.github.io/
  8. MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling, accessed on December 6, 2025, https://magenta.withgoogle.com/midi-ddsp
  9. Music ControlNet: Multiple Time-Varying Controls for Music Generation - Semantic Scholar, accessed on December 6, 2025, https://www.semanticscholar.org/paper/Music-ControlNet%3A-Multiple-Time-Varying-Controls-Wu-Donahue/42239e71a712d70cd24e06ffc0cf0d22fc628a36
  10. Music ControlNet: Multiple Time-Varying Controls for Music Generation, accessed on December 6, 2025, https://gclef-cmu.org/static/pdfs/2024musiccontrolnet.pdf
  11. (PDF) AI-Enabled Text-to-Music Generation: A Comprehensive Review of Methods, Frameworks, and Future Directions - ResearchGate, accessed on December 6, 2025, https://www.researchgate.net/publication/389965213_AI-Enabled_Text-to-Music_Generation_A_Comprehensive_Review_of_Methods_Frameworks_and_Future_Directions
  12. arXiv:2303.08607v1 [cs.SD] 15 Mar 2023, accessed on December 6, 2025, https://arxiv.org/pdf/2303.08607
  13. arXiv:2406.08761v2 [cs.SD] 14 Dec 2024, accessed on December 6, 2025, https://arxiv.org/pdf/2406.08761
  14. StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation - arXiv, accessed on December 6, 2025, https://arxiv.org/html/2312.12491v2
  15. ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation - Microsoft, accessed on December 6, 2025, https://www.microsoft.com/applied-sciences/uploads/publications/128/consistencytta.pdf
  16. AI-Enabled Text-to-Music Generation: A Comprehensive Review of Methods, Frameworks, and Future Directions - MDPI, accessed on December 6, 2025, https://www.mdpi.com/2079-9292/14/6/1197
  17. audiocraft/docs/MUSICGEN.md at main - GitHub, accessed on December 6, 2025, https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md
  18. facebook/musicgen-large - Hugging Face, accessed on December 6, 2025, https://huggingface.co/facebook/musicgen-large
  19. Low-latency Music Generation Using AI - DiVA portal, accessed on December 6, 2025, http://www.diva-portal.org/smash/get/diva2:1887231/FULLTEXT01.pdf
  20. MuseControlLite: Multifunctional Music Generation with Lightweight Conditioners - arXiv, accessed on December 6, 2025, https://arxiv.org/html/2506.18729v1
  21. MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation, accessed on December 6, 2025, https://arxiv.org/html/2407.15060v1
  22. Designing Neural Synthesizers for Low-Latency Interaction - arXiv, accessed on December 6, 2025, https://arxiv.org/html/2503.11562v2
  23. Neural timbre transfer effects for neutone, accessed on December 6, 2025, https://neutone.ai/blog/neural-timbre-transfer-effects-for-neutone
  24. LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning - arXiv, accessed on December 6, 2025, https://arxiv.org/html/2507.04966v1
  25. A Review on Score-based Generative Models for Audio Applications - arXiv, accessed on December 6, 2025, https://arxiv.org/html/2506.08457v1
  26. arXiv:2410.21641v1 [cs.SD] 29 Oct 2024, accessed on December 6, 2025, https://arxiv.org/pdf/2410.21641
  27. ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation - GitHub, accessed on December 6, 2025, https://github.com/Bai-YT/ConsistencyTTA
  28. [2509.01391] MixedG2P-T5: G2P-free Speech Synthesis for Mixed-script texts using Speech Self-Supervised Learning and Language Model - arXiv, accessed on December 6, 2025, https://arxiv.org/abs/2509.01391
  29. Kyubyong/g2p: g2p: English Grapheme To Phoneme Conversion - GitHub, accessed on December 6, 2025, https://github.com/Kyubyong/g2p
  30. Amphion/egs/datasets/README.md at main - GitHub, accessed on December 6, 2025, https://github.com/open-mmlab/Amphion/blob/main/egs/datasets/README.md
  31. Opencpop - Xinsheng Wang, accessed on December 6, 2025, https://xinshengwang.github.io/opencpop/
  32. User Guide — Montreal Forced Aligner 3.X documentation, accessed on December 6, 2025, https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/index.html
  33. trainable text-speech alignment using Kaldi, accessed on December 6, 2025, https://montreal-forced-aligner.readthedocs.io/en/v3.3.6/_downloads/998b0c31eadaf048e8e3de805b9ef8e6/MFA_paper_Interspeech2017.pdf
  34. (PDF) Research on the Recognition and Application of Montreal Forced Aligner for Singing Audio - ResearchGate, accessed on December 6, 2025, https://www.researchgate.net/publication/381530211_Research_on_the_Recognition_and_Application_of_Montreal_Forced_Aligner_for_Singing_Audio
  35. STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation - ACL Anthology, accessed on December 6, 2025, https://aclanthology.org/2025.findings-acl.781.pdf
  36. PyTorch implementation of DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (focused on DiffSpeech) - GitHub, accessed on December 6, 2025, https://github.com/keonlee9420/DiffSinger
  37. DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism - The Association for the Advancement of Artificial Intelligence, accessed on December 6, 2025, https://cdn.aaai.org/ojs/21350/21350-13-25363-1-2-20220628.pdf
  38. Simplifying, stabilizing, and scaling continuous-time consistency models - OpenAI, accessed on December 6, 2025, https://openai.com/index/simplifying-stabilizing-and-scaling-continuous-time-consistency-models/
  39. StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation - arXiv, accessed on December 6, 2025, https://arxiv.org/pdf/2312.12491
  40. StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation - CVF Open Access, accessed on December 6, 2025, https://openaccess.thecvf.com/content/ICCV2025/papers/Kodaira_StreamDiffusion_A_Pipeline-level_Solution_for_Real-Time_Interactive_Generation_ICCV_2025_paper.pdf
  41. GPU Benchmarks NVIDIA A100 80 GB (PCIe) vs. NVIDIA RTX 4090 - Bizon-tech, accessed on December 6, 2025, https://bizon-tech.com/gpu-benchmarks/NVIDIA-A100-80-GB-(PCIe)-vs-NVIDIA-RTX-4090/624vs637
  42. Why more developers are choosing RTX 4090 over A100 for AI workloads | Hivenet, accessed on December 6, 2025, https://compute.hivenet.com/post/why-more-developers-are-choosing-rtx-4090-over-a100
  43. Silly DiffSinger Training Guide | Part 1: Local Env Setup, Acoustic Config, Training & TensorBoard - YouTube, accessed on December 6, 2025, https://www.youtube.com/watch?v=Sxt11TAflV0
  44. GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks - arXiv, accessed on December 6, 2025, https://arxiv.org/html/2409.13832v2
  45. BUILDING THE METAMIDI DATASET: LINKING SYMBOLIC AND AUDIO MUSICAL DATA - ISMIR, accessed on December 6, 2025, https://archives.ismir.net/ismir2021/paper/000022.pdf
  46. MusicNet-16k + EM for YourMT3 - Zenodo, accessed on December 6, 2025, https://zenodo.org/record/7811639