Skip to content

Architecture A: Single-Instrument Text-to-Music Pipeline (Solo Piano Prototype)

Overview and Scope

We propose a text-to-symbolic-to-audio pipeline focused on a single instrument (here we choose expressive solo piano) that can generate short musical clips (15–60 seconds) from user text prompts. The system will map a natural-language prompt to a structured symbolic composition (MIDI-like representation), then render it into realistic audio using a dedicated instrument model. This Architecture A approach emphasizes symbolic control first, ensuring the output adheres to musical structure (notes, rhythm, etc.) before synthesis:contentReference[oaicite:0]{index=0}. The prototype targets a narrow style range (e.g. “emotional cinematic piano” – slow tempo, emotive dynamics) for manageability:contentReference[oaicite:1]{index=1}. Every major stage of the music generation process is implemented end-to-end (text parsing → symbolic composition → performance humanization → piano rendering → audio output), albeit at prototype quality. The goal is a “walking skeleton” system:contentReference[oaicite:2]{index=2} that demonstrates the full pipeline integration, even if some components are minimal. Non-musician users should be able to enter a simple prompt and receive a coherent piano performance matching the mood/description:contentReference[oaicite:3]{index=3}. We constrain scope to one instrument (solo piano), single-track output (no accompaniment layers), and simple prompts (genre/mood, tempo, a few style hints):contentReference[oaicite:4]{index=4}. This avoids multi-instrument arrangement complexity and licensing issues with multi-track data. The pipeline is designed for interactive use (like an “Udio-style” app where the user types prompts, listens, and refines:contentReference[oaicite:5]{index=5}), so we also consider latency and user experience in the design.

Target Instrument & Style: We choose solo piano as the prototype instrument, focusing on a limited style family such as “emotive cinematic piano” (think sparse, expressive contemporary piano pieces) or “lo-fi piano”. Piano is ideal due to abundant training data (e.g. the MAESTRO dataset of piano performances) and its expressive range (dynamics, pedal, etc.). Focusing on a narrow style (e.g. gentle, ambient piano) helps the model specialize and match prompts (for example “melancholic ¾ piano waltz, lots of space”) without needing to cover all genres:contentReference[oaicite:6]{index=6}. This choice also aligns with available datasets that are either public-domain or licensable for commercial use (classical and pop piano pieces – see Data Plan). Piano’s rich expressive data (timing, velocity variations) supports the performance modeling stage. The overall clip length is kept short (e.g. 8–32 bars, ~30 seconds) to ease generation and evaluation:contentReference[oaicite:7]{index=7}. With these constraints, we can design a concrete pipeline and incrementally improve quality.


Pipeline Design and Modules

Text Prompt Conditioning

The pipeline begins with the user’s text prompt (e.g. “A melancholic solo piano in ¾ time with a sparse left hand”). This prompt is parsed and encoded into a conditioning representation that will guide music generation:contentReference[oaicite:8]{index=8}:contentReference[oaicite:9]{index=9}. In the minimal design, we use a manually designed control schema (Option B in prompt encoding):contentReference[oaicite:10]{index=10}: the system extracts structured controls from the text, such as:

  • Mood/Style tags: e.g. melancholic, cinematic, lo-fi. We maintain a small set of mood/genre tokens.
  • Tempo and Meter: e.g. “¾” → time signature = ¾, “slow” → BPM ≈ 60. If BPM is explicitly given (numeric or adjective), we map it to a tempo value.
  • Texture/Density hints: e.g. “sparse left-hand” implies low note density in bass register; “build up gradually” suggests a dynamic curve.
  • Key or scale (optional): If the user specifies “in C minor”, we set a key signature token or bias the note selection accordingly.

These controls are encoded into a conditioning vector or token set that the symbolic generator can understand. For example, we may prepend special tokens to the music sequence like:

  • <MOOD_SAD> <TEMPO_60> <METER_3/4>

:contentReference[oaicite:11]{index=11}

or use a separate embedding that is fed into the model as global context. This rule-based prompt mapping avoids needing large paired text-music data initially:contentReference[oaicite:12]{index=12}. As an advanced option, we could incorporate a pre-trained text encoder to generate a richer embedding (Option A):contentReference[oaicite:13]{index=13}, but for v0 we favor transparency and controllability.

We can bootstrap the text-conditioning by using metadata from music pieces (e.g. labeling dataset pieces with mood tags and tempo):contentReference[oaicite:14]{index=14}, ensuring the model learns to respond to those control tokens even without natural-language training. In summary, the text prompt is distilled into a small set of musical parameters (tempo, meter, mood/style, length, etc.), which form the input controls for the next stage.

Symbolic Structure & Composition Planning

Next, a symbolic composition module produces the musical content in a MIDI-like symbolic format. For short clips, we can generate the notes in one sequence without a separate high-level planner, but we ensure the model is aware of structural markers like bars and possibly sections:contentReference[oaicite:15]{index=15}. In the minimal design, the composition model directly outputs a linear sequence of notes over time (with bar divisions) given the prompt conditions.

In an advanced design, we consider a hierarchical approach: a high-level structure planner could first outline a form (e.g. 4 bars intro, 8 bars main theme) or a chord progression, and then a lower-level model fills in actual notes and rhythms for each section. However, for 30-second single-instrument pieces, a single-stage generator with embedded bar tokens is likely sufficient initially:contentReference[oaicite:16]{index=16}.

We will include structural cues in the representation (explained in the next section) so the model can learn patterns like phrase lengths or repetition. For example, if a prompt says “start softly and build up,” the generator might internally plan increasing note density or velocity over the bars.

The output of this stage is a quantized symbolic score: essentially notes with pitches, timings (aligned to a metrical grid), and basic dynamics (e.g. loudness markings or velocities). At this point the music is a “clean” score – rhythmically perfect and mechanical, as if typed into MIDI notation software:contentReference[oaicite:17]{index=17}.

Symbolic Music Generation Core

The core generation engine is a machine learning model that produces the sequence of musical events given the encoded prompt. We propose at least one viable model architecture for this symbolic core:

Minimal Model – Autoregressive Transformer

A Transformer decoder model (in the spirit of GPT or Music Transformer) that generates music token-by-token:contentReference[oaicite:18]{index=18}. It takes as input the prompt conditioning (either appended as tokens or via an encoder–decoder attention). Concretely:

  • Represent the music as a sequence of discrete tokens (see Representation section),
  • Train the Transformer to predict the next token.

At generation time, we initialize with tokens encoding the desired style/tempo and possibly a “start-of-music” token, then let the model sample a sequence of notes and other events. This model would learn musical structure from training MIDI data. It can be relatively small (e.g. 8–12 Transformer layers, ~50M parameters) to allow fast inference.

The conditioning can be injected by prefix tokens (e.g. a <STYLE_JAZZ> token in the sequence influences generation) and/or by adding the prompt embedding to the model’s input at each step (similar to classifier-free guidance, but here as context):contentReference[oaicite:19]{index=19}.

This design is implementable now with standard libraries. It uses teacher-forcing training on real MIDI sequences, conditioned on metadata-derived tokens for style and tempo:contentReference[oaicite:20]{index=20}. The primary objective is maximum likelihood (cross-entropy loss) on the next token, possibly with auxiliary losses or conditioning dropout to ensure it listens to the control tokens.

This minimal model should capture local musical syntax (rhythms, melodies) and some global coherence over ~30 seconds.

Advanced Model – Hierarchical / Two-Stage Generator (Optional)

To better handle long-term structure and complex prompt constraints, we consider a more ambitious architecture that plans music in stages:contentReference[oaicite:21]{index=21}:

  • A bar-level Transformer first generates a sequence of high-level embeddings or tokens for each bar (encoding chord or tonal center, intensity, etc.):contentReference[oaicite:22]{index=22}.
  • A second model generates fine-grained notes within each bar conditioned on that bar’s embedding.

This hierarchical Transformer has an internal representation of musical structure (each bar/section is summarized), allowing it to enforce repetition or development over time more easily than a flat model.

Another variant could be a transformer-VAE hybrid: sample a latent representing the entire piece’s style/structure, then decode to tokens – ensuring global consistency.

However, such approaches add complexity in training and might need more data. For the prototype, a simpler structured approach might be to explicitly guide the generator with chord progression input or to split the generation: e.g., have the model generate a chord sequence first, then melody on top (especially relevant if we later target accompaniment guitar).

Because these advanced structures require chord/section annotations and more engineering, we treat them as second-phase enhancements. The v0 prototype will rely on the simpler flat Transformer with bar tokens, but the rest of the system is designed so we can drop in a hierarchical model later without changing downstream modules (it still outputs MIDI-like notes).

Continuation / Partial Regeneration

The symbolic core should support prefix-based continuation so users can regenerate only parts of a clip. This naturally fits the autoregressive Transformer: we feed in the tokens for bars 1–4 and resume sampling from bar 5 onward:contentReference[oaicite:23]{index=23}. The interface will support:

  • Whole-clip regeneration (no prefix, just new sequence),
  • Tail regeneration (prefix fixed, new continuation),
  • Future extension to mid-section replacement (treat earlier + later bars as constraints, generate middle – more advanced).

The main requirement is that the model conditions strongly on the prefix and doesn’t diverge into unrelated keys or tempi, which we’ll check during evaluation.

Performance Humanization Module

The raw symbolic output from the generator is quantized and lacks human feel – e.g., all notes align perfectly on the beat, velocities might be uniform, and for piano, no pedal or phrasing information is present:contentReference[oaicite:24]{index=24}. The performance rendering stage bridges this gap by converting the “score” into a more human-like performance before audio rendering:contentReference[oaicite:25]{index=25}.

This is a distinct module that takes a MIDI sequence and outputs a modified MIDI (or enhanced event list) with nuanced timing, dynamics, and articulations:contentReference[oaicite:26]{index=26}.

Rule-Based Humanization (Minimal Baseline)

We define a set of deterministic or random perturbation rules inspired by musical performance practice:contentReference[oaicite:27]{index=27}:

  • Timing swings: small, bounded random offsets to note onsets (e.g. ±10–20 ms), with patterns like:
  • Slight anticipation or delay for expressive emphasis,
  • Strong beats kept closer to the grid, weak beats allowed more deviation.
  • Velocity shaping:
  • Emphasize downbeats and melody notes with higher velocity,
  • Apply phrase arcs: increasing velocity through a phrase (crescendo) and then dropping (decrescendo),
  • Respect “soft” vs “intense” prompt cues by scaling entire velocity ranges.
  • Articulation & pedal (piano):
  • Lengthen notes slightly for legato lines, shorten for staccato patterns,
  • Insert sustain pedal on/off events to connect harmonically related notes and enforce legato over chords.

We expose a “humanization amount” control that scales these effects. At 0%, the MIDI passes through unchanged (robotic). At 100%, max jitter and expressive variation are applied.

To support looping and clean segment joins, we enforce:

  • Bar-level alignment: total duration per bar remains consistent with the nominal tempo; deviations are zero-mean inside the bar so the bar boundary falls at the expected time:contentReference[oaicite:28]{index=28}.
  • Stable bar starts: the first beat of each bar has minimal offset to ensure good loop points.

This rule-based approach is fast and data-free and should significantly reduce the “MIDI-robotic” feel:contentReference[oaicite:29]{index=29}.

Optional Learned Performance Model

Later, we can train a small model to map from clean score to expressive performance:

  • Inputs: quantized note events (pitch, nominal onset, duration) + bar/beat position + optional style/humanization controls.
  • Outputs: timing offsets, velocity adjustments, articulation flags, pedal curves:contentReference[oaicite:30]{index=30}.

Training data:

  • Derived from expressive MIDI like MAESTRO: quantize to get “score”, align with original to extract deviations:contentReference[oaicite:31]{index=31}.

A small Transformer or BiLSTM could learn patterns such as:

  • Phrase-level rubato (slowing at cadences),
  • Relative accents based on melodic contour,
  • Style-specific timing (e.g., more swing for certain tags).

Given this adds complexity and requires careful training, we start with rules but define a performance representation that can later be driven by a learned model without changing other modules:contentReference[oaicite:32]{index=32}.

Instrument Rendering (Audio Synthesis)

The humanized MIDI performance is fed into a piano rendering engine to produce audio.

Sample-Based Piano (Chosen for v0)

We will use a realistic virtual piano sampler:

  • Input: MIDI (notes, velocities, pedal),
  • Output: audio waveform (e.g. 44.1 kHz stereo).

Options include:

  • SFZ/SF2 soundfonts (e.g. Salamander Grand, other high-quality open libraries),
  • Lightweight sample-based VSTs if licensing permits.

Advantages:

  • High timbral quality without training an audio model,
  • Predictable response to velocity and pedal,
  • Straightforward integration.

Trade-offs:

  • Limited timbral flexibility (one piano sound unless multiple libraries are loaded),
  • Production “flavor” (lo-fi, tape, etc.) must be handled by simple post-FX (EQ, reverb, saturation) rather than learned.

Given the prototype scope, this is the most pragmatic renderer: low risk and easy to drop in:contentReference[oaicite:33]{index=33}.

Neural or Hybrid Synth (Future Option)

For later R&D:

  • MIDI-DDSP–style models:contentReference[oaicite:34]{index=34} can convert MIDI to audio via differentiable oscillators,
  • Score-conditioned diffusion or neural vocoders can generate more detailed and flexible timbres, at the cost of training and inference complexity:contentReference[oaicite:35]{index=35}.

Our architecture keeps the renderer modular so we can swap a neural backend later without touching symbolic or performance layers:contentReference[oaicite:36]{index=36}.

UX Considerations

The user interface is Udio-style:

  • A text box for prompts:
  • e.g. “Slow ¾ melancholic solo piano, sparse left hand, builds slightly at the end.”
  • Basic controls:
  • Length (bars or seconds),
  • Tempo (or let text decide),
  • Density (sparse ↔ busy),
  • Humanization (robotic ↔ expressive):contentReference[oaicite:37]{index=37}.

Key UX features:

  • One-click generation: Prompt → preview in a few seconds.
  • Scoped regeneration:
  • User selects last N bars; system regenerates that section only, keeping earlier bars intact:contentReference[oaicite:38]{index=38}.
  • Export:
  • Audio (WAV),
  • MIDI for further editing in a DAW.

Latency targets:

  • Preview: < 5 seconds for ~30s clip,
  • Full render: < 15 seconds, acceptable for prototype.

We support caching:

  • Store the symbolic composition so we can:
  • Re-humanize with different settings without re-composing,
  • Re-render with different instruments or effects without rerunning the Transformer:contentReference[oaicite:39]{index=39}.

Symbolic Representation and Module Interfaces

A well-defined symbolic representation is central to the pipeline, as it is the interface between modules:contentReference[oaicite:40]{index=40}. We use an event-based, bar-structured representation inspired by REMI / Magenta.

Timeline & Rhythm

  • Bars:
  • BarStart token marks each bar start.
  • Positions:
  • Inside a bar, discrete positions (e.g. 24 ticks per quarter note → 96 ticks per 4/4 bar).
  • Encode as Position_k tokens to indicate where notes occur.
  • Tempo & Meter:
  • Global Tempo_X and Meter_3/4 tokens at the start, derived from prompt or metadata.

Example (conceptual):

  • BarStart, Position_0, NoteOn_C4, Duration_quarter, Velocity_80, Position_24, NoteOn_E4, ...

This ensures the model is aware of bar/beat positions and can respect the desired meter:contentReference[oaicite:41]{index=41}.

Pitch, Duration, Dynamics

We use compound note events via a short sequence:

  • NoteOn_{pitch},
  • Duration_{value},
  • Velocity_{bin}.

Duration values:

  • Quantized rhythmic values (eighth, quarter, dotted quarter, etc.),
  • Mapped from MIDI ticks.

Velocity:

  • Binned into e.g. 8–16 discrete levels to reduce vocab size.

The composition model can produce basic dynamics (pp–ff) via these velocity bins, but the performance module refines them further.

Expressive Controls (Performance Layer)

Expressive controls are primarily added after composition:

  • Timing offsets:
  • Stored as extra attributes on notes (e.g. offset_ms),
  • Articulation:
  • Implied by note overlaps / shortened durations,
  • Pedal:
  • Represented as control change events (MIDI CC64 on/off).

These are not generated by the composition model; they are added by the humanization/performance module.

Prompt Condition Encoding

We encode prompt-derived controls as special tokens at sequence start:contentReference[oaicite:42]{index=42}:

  • Mood: <MOOD_SAD>, <MOOD_CALM>, etc.
  • Density: <DENSITY_LOW>, <DENSITY_HIGH>.
  • Tempo/meter: <TEMPO_60>, <METER_3/4>.

During training:

  • We synthesize these labels from dataset metadata (e.g. EMOPIA emotion tags, MIDI tempo, time signature):contentReference[oaicite:43]{index=43}.
  • The model learns correlations between these tokens and the following music.

At inference, text prompts are parsed into these discrete controls.

Data Structures & APIs

Internally, we convert between:

  • Token sequences (for the Transformer),
  • Structured MIDI-like object (for performance & rendering).

Example (pseudo-JSON):

composition = {
  "tempo": 60,
  "time_sig": "3/4",
  "mood": "sad",
  "notes": [
      {"pitch": 60, "start": 0.0, "duration": 0.5, "velocity": 80},
      {"pitch": 64, "start": 0.5, "duration": 0.5, "velocity": 75}
  ]
}

These map 1:1 to a standard MIDI representation, so we can use existing MIDI libraries and renderers.

Module APIs

  • generate_controls(prompt_text: str) -> ControlDict
    Parse text into {mood, tempo, meter, density, length}.

  • generate_composition(controls: ControlDict, prefix: Optional[MIDI]) -> MIDI
    Use the Transformer to generate a new composition, optionally conditioned on a prefix for continuation.

  • humanize_performance(midi: MIDI, intensity: float, style: str) -> MIDI
    Apply rule-based or learned humanization.

  • synthesize_audio(midi: MIDI, quality: str = "high") -> AudioBuffer
    Use a sampler or neural renderer to produce audio.

These clear boundaries make each component testable and swappable:contentReference[oaicite:44]{index=44}:contentReference[oaicite:45]{index=45}.


Model Architecture Details

Symbolic Composer Model

  • Architecture:
  • Transformer decoder,
  • Embedding size ~512, 8–12 layers, multi-head self-attention,
  • Relative positional encoding for long-range structure:contentReference[oaicite:46]{index=46}.

  • Input:

  • Sequence of tokens:

    • Prompt tokens (mood, tempo, meter, density),
    • Music tokens (BarStart, Position, NoteOn, Duration, Velocity, END).
  • Training:

  • Cross-entropy on next-token prediction,
  • Teacher forcing,
  • Datasets: MAESTRO, EMOPIA, others (see Data Plan).

  • Conditioning:

  • Prompt tokens prepended; model learns their effect,
  • Optionally, prompt text embedding added as extra context later.

  • Generation:

  • Autoregressive sampling,
  • Stopping when desired number of bars or END token is reached,
  • Constrained decoding to respect meter (e.g., enforce correct number of beats per bar where possible).

Control over:

  • Length:
  • Specify target bar count; stop after reaching that many BarStart tokens.
  • Density:
  • Use density tokens and/or adjust sampling temperature for note-on vs rest decisions.
  • Register:
  • Implicit via training data; optional future explicit token if needed.

Advanced Hierarchical Model (Optional)

As an upgrade path:

  • High-level planner:
  • Predict bar-level embeddings (chords, intensity per bar).
  • Low-level note generator:
  • Conditioned on bar embeddings.

Benefits:

  • Better long-range structure (phrase repetition, development),
  • Easier satisfaction of constraints like chord progressions.

Costs:

  • Needs aligned chord/structure annotations or robust automatic extraction,
  • Additional complexity in training and inference.

Given prototype scope, we stick with the flat Transformer but keep data and representation compatible with a hierarchical extension.

Performance Rendering Model

Initial implementation:

  • Purely rule-based humanization (no training required):contentReference[oaicite:47]{index=47}.

Future improvement:

  • Small performance model:
  • Input: quantized notes + position features,
  • Output: timing offset, velocity delta, articulation flags,
  • Trained on MAESTRO (score vs performance pairs):contentReference[oaicite:48]{index=48}.

Architecture candidates:

  • 1–2 layer BiLSTM over sequences,
  • Small Transformer with limited context.

We keep the performance representation stable so that upgrading from rules to a learned model is seamless.

Audio Renderer

  • v0:
  • Sample-based piano via soundfont/VST,
  • Minimal configuration: 1 instrument, 1 reverb send.

  • Future:

  • MIDI-DDSP piano for more expressive timbral control:contentReference[oaicite:49]{index=49},
  • Score-conditioned diffusion as an experimental path for richer production:contentReference[oaicite:50]{index=50}.

Data and Training Plan

Symbolic Composition Data

Candidate datasets:

  • MAESTRO:contentReference[oaicite:51]{index=51}:
  • ~200 hours of aligned piano MIDI + audio,
  • Classical & romantic repertoire,
  • License: CC BY 4.0 (commercial use with attribution),
  • Used for:

    • Composition training (after quantization),
    • Performance training (score vs performance).
  • EMOPIA:

  • 1087 piano clips (~35s each) with emotion labels,
  • Modern/pop/hybrid styles,
  • Excellent for mood conditioning (emotion tags → MOOD tokens),
  • Licensing: research dataset; commercial use needs verification.

  • GiantMIDI-Piano:

  • 10k+ transcribed classical piano pieces,
  • Public-domain compositions; transcription licensing less clear but generally research-friendly.

  • Others:

  • Potentially AILabs1k7 (if licensing clear),
  • Custom in-house MIDI in desired style.

We combine these where licensing allows, balancing classical and modern material.

Performance Data

For training a learned humanization model (optional):

  • Use MAESTRO performance MIDI:
  • Extract tempo and beat grid,
  • Quantize to nearest grid → “score”,
  • Differences vs original → training targets for timing and velocity:contentReference[oaicite:54]{index=54}.

This yields a large set of note-level examples with realistic expressive deviations.

Licensing Strategy

We follow a two-tier model:contentReference[oaicite:55]{index=55}:contentReference[oaicite:56]{index=56}:

  • Research-only models:
  • May use datasets with unclear/non-commercial licensing (e.g., Lakh MIDI, EMOPIA if needed),
  • Used for internal evaluation and iteration only.

  • Production-eligible models:

  • Use only:
    • Public-domain or CC-BY(-SA) datasets (no NC),
    • Explicitly licensed catalogs with commercial rights,
    • User-provided data with explicit opt-in.
  • MAESTRO (CC-BY) is fine with attribution,
  • Lakh MIDI and other ambiguous sets are excluded:contentReference[oaicite:57]{index=57}.

We maintain a dataset registry documenting source, license, and allowed uses:contentReference[oaicite:58]{index=58}, and track which models used which datasets.

Preprocessing & Augmentation

Steps:

  1. Instrument filtering:
  2. Keep only solo piano tracks; drop multitrack-band MIDIs unless we extract piano track only.

  3. Quantization for composition:

  4. Align expressive performances to a regular grid (e.g. 16th notes),
  5. Preserve meter and notated durations.

  6. Tokenization:

  7. Convert MIDI to our token sequence (BarStart, Position, NoteOn, Duration, Velocity).

  8. Label synthesis:

  9. Derive prompt-like controls:

    • Emotion from EMOPIA labels,
    • Tempo from MIDI meta,
    • Meter from time signature,
    • Density approximated from notes-per-bar.
  10. Augmentation:

  11. Transposition: shift MIDI by ±n semitones (within piano range),
  12. Tempo scaling: slightly adjust tempo and update Tempo token,
  13. Optionally random dynamic scaling.

  14. Splits:

  15. Train/val/test split by piece or composer (avoid leakage).

Training

  • Symbolic model:
  • Train on mixed corpus (MAESTRO + EMOPIA),
  • Objective: next token prediction,
  • Monitor:

    • Validation loss,
    • Structural metrics (bar integrity, correct meter),
    • Conditional behavior on held-out prompts.
  • Performance model (if implemented):

  • Train on (score → performance) pairs,
  • Loss: L1/MSE on timing and velocity deviations,
  • Evaluate on separate set of performances.

Model sizes and training durations are adjusted to fit available compute (small prototypes first, scaling up if promising).


Evaluation Plan

Automatic Symbolic Metrics

We evaluate generated MIDI:

  • Tonal consistency:
  • Key detection over time (does it stay in one key?),
  • If key specified in prompt, measure match.

  • Rhythmic integrity:

  • Bars contain correct total beats for given meter,
  • Onset distributions per beat.

  • Density and texture:

  • Notes-per-bar vs intended DENSITY control,
  • Inter-onset interval distribution.

  • Motivic structure:

  • Detect repeated melodic patterns or n-grams,
  • Avoid degenerate looping (e.g., same bar repeated too often).

  • Prompt controllability:

  • For mood tags, use an emotion classifier (like EMOPIA’s baseline) on the generated MIDI and check correlation with given MOOD token:contentReference[oaicite:59]{index=59}.

Automatic Audio Metrics

Especially relevant if we introduce neural renderers:

  • Audio fidelity:
  • Check for clipping, noise, artifacts.
  • Loudness normalization:
  • Output around a target LUFS.
  • Structural alignment:
  • Ensure loop boundaries are seamless (no timing drift).

For sample-based rendering, audio metrics are mainly sanity checks.

Human Evaluation

We design listening tests:

  1. Musicality & naturalness:
  2. Listeners rate:

    • Overall musical coherence,
    • Naturalness / human-likeness (especially with vs without humanization).
  3. Prompt match:

  4. Provide original prompt,
  5. Listeners rate how well the music fits the description:contentReference[oaicite:60]{index=60}.

  6. Baseline comparisons:

  7. Compare:

    • Our system vs quantized playback,
    • Our system vs a naïve template-based generator,
    • Optionally vs a pure audio text-to-music model on the same prompt.
  8. Preference tests:

  9. A/B comparisons:
    • “Which clip better matches: sad slow piano?”
    • “Which sounds more human/expressive?”

Success criteria for the prototype:

  • Majority of listeners prefer humanized outputs over quantized baseline,
  • Most generations rated as at least moderately musical and reasonably prompt-aligned,
  • Clear preference vs trivial baselines.

Integration and System Implementation

End-to-End Flow

  1. User promptgenerate_controlsControlDict.
  2. Symbolic generationgenerate_composition → MIDI (score).
  3. Humanizationhumanize_performance → MIDI (performance).
  4. Renderingsynthesize_audio → WAV/OGG.
  5. UI plays audio; user can:
  6. Regenerate all/part,
  7. Change controls (length, tempo, mood, density, humanization),
  8. Export audio/MIDI.

Caching & Partial Regeneration

  • Store symbolic composition:
  • If user only tweaks humanization or renderer, reuse composition.
  • For partial regen:
  • Identify bar range to replace,
  • Use earlier bars as prefix for Transformer,
  • Generate new continuation of desired bar count,
  • Replace that section in composition,
  • Re-run humanization + render.

Given clip lengths are short, it may be simplest to re-humanize and re-render the entire updated piece; performance and rendering are fast enough.

Latency and Scaling

  • Aim for end-to-end ~2–5s on a single GPU/modern CPU for 30s clips.
  • For multiple users:
  • Run multiple instances of the Transformer service,
  • Use a job queue for heavy renders if needed.

The system is microservice-friendly but can initially be implemented as a monolith for speed of development.

Error Handling

  • If the Transformer produces invalid structure (e.g. bar anomalies), apply post-hoc fixes or trigger a re-generation.
  • If the text parser finds no usable controls, fall back to default mood/tempo and warn user slightly.
  • Log failures and odd outputs for later analysis.

Risks and Experiment Plan

Key Risks

  1. Prompt-Conditioned Generation Quality
  2. Assumption: simple control tokens (mood, tempo, meter, density) are enough for useful prompt control without text–music pairs.
  3. Experiment:

    • Train a small conditioned model,
    • Generate examples for different MOOD/TEMPO/METER tags,
    • Qualitatively and quantitatively check if outputs differ in expected ways (e.g. “sad” pieces use more minor keys, slower tempos).
  4. Symbolic Model Musicality

  5. Assumption: flat Transformer with bar tokens can produce coherent, stylistically appropriate 15–60s piano clips.
  6. Experiment:

    • Train a prototype on MAESTRO+EMOPIA,
    • Evaluate on random prompts and unconditional generation,
    • Manually inspect for issues (excessive repetition, dissonance, meter violations).
  7. Data Sufficiency & Licensing

  8. Assumption: MAESTRO, EMOPIA, and a few others provide enough coverage while being legally usable for at least prototyping, and eventually for production with clean subsets:contentReference[oaicite:61]{index=61}:contentReference[oaicite:62]{index=62}.
  9. Experiment:

    • Perform a licensing audit,
    • Run ablation: train models with/without EMOPIA and evaluate mood controllability,
    • If EMOPIA is critical but not commercial-safe, plan to replace with in-house recorded/emotive data.
  10. Humanization Effectiveness

  11. Assumption: simple rule-based humanization significantly improves perceived naturalness without breaking alignment:contentReference[oaicite:63]{index=63}.
  12. Experiment:

    • Apply humanization with varying intensity to the same compositions,
    • A/B test with listeners: quantized vs humanized,
    • Adjust parameters based on feedback (e.g. reduce timing jitter if perceived as sloppy).
  13. Renderer Quality & Latency

  14. Assumption: an open sample-based piano can deliver acceptable realism at interactive speeds.
  15. Experiment:

    • Integrate a chosen soundfont/VST early,
    • Render example MIDIs (real and generated),
    • Check sound quality and time-to-render for typical clip lengths.
  16. UX & Prompt Handling in Practice

  17. Assumption: users can phrase prompts in the supported space without confusion, and latency feels acceptable.
  18. Experiment:
    • Internal pilot with a small set of users,
    • Observe how they phrase prompts,
    • Gather feedback on response time, control clarity, and output quality.

Prioritized Minimal Experiments

  1. Small conditional Transformer prototype on a subset of MAESTRO+EMOPIA:
  2. Validate prompt-response, musicality (Risks 1 & 2).

  3. Rule-based humanization tuning:

  4. Quick internal listening tests to set humanization defaults (Risk 4).

  5. Renderer integration test:

  6. Hook up sampler, verify audio quality and performance (Risk 5).

  7. Dataset/legal review & data ablation:

  8. Ensure we understand licensing and impact of using/removing certain datasets (Risk 3).

  9. End-to-end UX pilot:

  10. Simple UI prototype to test full flow with real human prompts (Risk 6).

Summary

This design executes the Architecture A single-instrument prototype vision:contentReference[oaicite:64]{index=64}:

  • A symbolic-first pipeline enables:
  • Strong control over musical form, meter, and density,
  • Partial regeneration and editing,
  • Reuse of symbolic artifacts (MIDI, chords, motifs).

  • A performance layer adds human expressiveness without breaking structure:contentReference[oaicite:65]{index=65}.

  • A sample-based renderer ensures high audio fidelity with minimal engineering risk:contentReference[oaicite:66]{index=66}.

The system is:

  • Buildable now with existing tools and datasets,
  • Legally mindful (with a clear separation between research and production data):contentReference[oaicite:67]{index=67}:contentReference[oaicite:68]{index=68},
  • Architected for future evolution:
  • Hierarchical symbolic models,
  • Learned performance models,
  • Neural/instrument refiners.

This prototype provides a concrete, end-to-end demonstration of a text→symbolic→audio music system, and a foundation for scaling to multi-instrument, multi-section, and vocal-enabled architectures.


References for “Architecture A: Single-Instrument Text-to-Music Pipeline”

1. Internal design docs & research prompts

[33] Text-to-symbolic-to-audio system (design doc)
- File: Text-to-symbolic-to-audio music system.md
- Title: Text-to-symbolic-to-audio music system: Udio-style UX with a Magenta-style core
- Used for: overall product framing, licensing stance, and multi-architecture context.

[RP-T2S] Research prompt – full text-to-symbolic-to-audio system
- File: research prompt - Text-to-symbolic-to-audio music system.md
- Title: Text-to-symbolic-to-audio music system: Udio-style UX with a Magenta-style core
- Used for: scope/constraints of the full system and expectations on symbolic core vs audio renderers.

[35] Research prompt – Architecture A single-instrument prototype
- File: research prompt - architecture A - single-instrument prototype.md
- Title: Architecture A single-instrument prototype: text-to-symbolic-to-audio piano/guitar pipeline
- Used for: requirements on the single-instrument pipeline, UX, data plan, evaluation, and success criteria.

[36] Research prompt – Humanization & performance modeling
- File: research prompt - humanization and performance modeling for a single instrument.md
- Title: Humanization and performance modeling for a single instrument
- Used for: performance layer design, expressive controls, and humanization baselines.

[34] Research prompt – Reference-guided continuation
- File: research prompt - reference-guided continuation for a single instrument.md
- Title: Reference-guided continuation for a single instrument
- Used for: continuation/partial regeneration behavior and prefix-based generation assumptions.

[13]/[17]/[20]/[26]/[12] Music generation system architecture review
- File: Music Generation System Design Review.md
- Title: System Architecture for High-Fidelity Symbolic-Conditioned Audio Synthesis: A Framework for Rapid Prototyping and Validation
- Used for: high-level architecture tradeoffs, symbolic-vs-audio approaches, latency/fidelity considerations, and some dataset/licensing discussion.

Note: The numeric IDs in brackets (e.g. [33], [34], [35], [36], [13], [17], [20], [26], [12]) correspond to the internal citation markers used in the research write-up. All of them map to one of the six markdown files listed above.


2. External datasets referenced

These are standard public datasets referenced conceptually in the write-up (for training and evaluation examples):

MAESTRO (piano performance dataset)
- Hawthorne, C. et al. “Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset.”
- Large set of aligned piano MIDI + audio, used as example data source for composition and performance modeling.

EMOPIA (emotion-labeled pop piano dataset)
- Hung, H. et al. “EMOPIA: A Multi-Modal Pop Piano Dataset for Emotion Recognition and Emotion-based Music Generation.”
- Used as an example of emotion-labeled piano clips for mood conditioning.

GiantMIDI-Piano
- Kong, Q. et al. “GiantMIDI-Piano: A Large-scale MIDI Dataset of Classical Piano Music.”
- Referenced as an additional public-domain piano MIDI source.

AILabs1k7 (optional)
- AILabs1k7 multistyle piano dataset, mentioned as a potential additional corpus if licensing is clarified.

Lakh MIDI / Slakh
- Used only as examples of ambiguous-licensing datasets that should be treated as research-only unless legal clears them.


3. External models / methods mentioned

These are not required assets for the prototype, but were referenced as examples or future options:

MIDI-DDSP
- Score-/control-conditioned differentiable synthesis approach, cited as an option for neural instrument rendering.

Score-conditioned diffusion / Music ControlNet-style models
- Referenced generically as future, more complex audio renderers or control-networks for symbolic-to-audio mapping.


All citations in the research doc point to one of the six internal markdown files above or to these well-known external datasets/method families. No ghost sources lurking in the footnotes.

Sources consulted for the Architecture A research write-up

Ref # Used in write-up Type Title / Name Mini summary Link to source
[33] Yes Internal doc Text-to-symbolic-to-audio music system High-level design for a text→symbolic→audio system with Udio-style UX and Magenta-style symbolic core; covers UX, licensing stance, and architecture options. Text-to-symbolic-to-audio music system.md (local)
[RP-T2S] Yes Internal doc research prompt – Text-to-symbolic-to-audio music system Research prompt specifying requirements, constraints, and evaluation goals for the full text-to-symbolic-to-audio system. research prompt - Text-to-symbolic-to-audio music system.md (local)
[35] Yes Internal doc research prompt – architecture A – single-instrument prototype Core spec for Architecture A: single-instrument (piano/guitar) text→symbolic→audio prototype, scope, UX, and success criteria. research prompt - architecture A - single-instrument prototype.md (local)
[36] Yes Internal doc research prompt – humanization and performance modeling for a single instrument Prompt detailing goals and design options for humanization/performance modeling (timing, velocity, pedal, articulation) for one instrument. research prompt - humanization and performance modeling for a single instrument.md (local)
[34] Yes Internal doc research prompt – reference-guided continuation for a single instrument Prompt describing reference-based / prefix-based continuation and partial regeneration behavior for symbolic music. research prompt - reference-guided continuation for a single instrument.md (local)
[13], [17], [20], [26], [12] Yes Internal doc Music Generation System Design Review System-architecture review for symbolic-conditioned audio synthesis, including design tradeoffs, latency/fidelity concerns, and dataset/licensing discussion. Music Generation System Design Review.md (local)
Yes Dataset / paper MAESTRO: MIDI and Audio Edited for Synchronous TRacks and Organization ~200 hours of aligned piano MIDI + audio from competition performances; used conceptually as main example for composition & performance training data. https://magenta.withgoogle.com/datasets/maestro :contentReference[oaicite:0]{index=0}
Yes Dataset / paper EMOPIA: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation Emotion-labeled pop piano dataset (audio+MIDI, 1,087 clips); used conceptually for mood conditioning and as example of emotion tags. https://annahung31.github.io/EMOPIA/ :contentReference[oaicite:1]{index=1}
Yes Dataset / paper GiantMIDI-Piano: A Large-Scale MIDI Dataset for Classical Piano Music Large symbolic classical piano dataset (10k+ pieces); referenced as an additional public-domain-like source for training symbolic models. https://github.com/bytedance/GiantMIDI-Piano :contentReference[oaicite:2]{index=2}
Yes Dataset / paper AILabs.tw Pop1K7 (Ailabs1k7) Transcribed pop piano dataset (~1.7k pieces, ~108h); mentioned as a potential corpus for expressive pop piano and pedal modeling, subject to licensing. https://zenodo.org/records/13167761 :contentReference[oaicite:3]{index=3}
Yes (conceptual) Dataset family Lakh MIDI / similar mixed-license MIDI corpora Large general MIDI collections; referenced only as examples of licensing-ambiguous data that should be research-only unless cleared. Example: https://colinraffel.com/projects/lmd.html (not directly queried this session)
Yes (conceptual) Method family MIDI-DDSP / differentiable synthesis models Score-/control-conditioned differentiable audio synthesis, cited as a future option for neural instrument rendering. Example: https://magenta.tensorflow.org/ddsp (not directly queried this session)
Yes (conceptual) Method family Score-conditioned diffusion / ControlNet-style text-to-music models Family of diffusion-based audio generators controllable by symbolic score or other conditions; mentioned as future upgrade paths for the renderer. Representative examples in current literature (no single canonical URL consulted)

Notes
- “Used in write-up = Yes” means the information from that source directly informed wording, assumptions, or examples in the research text.
- Internal docs are the six markdown files you uploaded; paths are shown as their filenames since they’re local to your environment.
- External datasets/methods were referenced conceptually (based on their published descriptions) and, for this table, their canonical URLs were retrieved via web search.

Addendum: Alignment with Solo Piano Humanization / Performance Module

This addendum aligns the Architecture A pipeline with the “Humanization and Performance Modeling for Solo Piano (Architecture A Pipeline)” design, without modifying existing sections.


1. Canonical Performance Module Contract

Architecture A adopts the following canonical contract for the performance module:

  • Input: MidiLikeScore (quantized symbolic score).
  • Input: PerformanceSettings (humanization and style controls).
  • Output: MidiLikePerformance (expressive performance, ready for rendering).

Formally:

  • humanize_performance(score: MidiLikeScore, settings: PerformanceSettings) -> MidiLikePerformance

The earlier informal style: str parameter is to be treated as a preset key that is expanded into a PerformanceSettings instance (see §3).


2. Shared Performance Representation

Architecture A and the humanization design share a MIDI-like representation:

  • MidiLikeScore (composer → performance module):
  • tempo_bpm
  • time_signature
  • list of note events:

    • id
    • pitch
    • start_beats (quantized grid position)
    • duration_beats
    • velocity (base score velocity)
  • MidiLikePerformance (performance module → renderer):

  • tempo_bpm
  • time_signature
  • list of note events:
    • same fields as above, but with adjusted timing, duration, and velocity
  • list of control events:
    • time_seconds
    • type (e.g. pedal, cc)
    • value (e.g. 0.0–1.0 or MIDI CC value)

Responsibility boundaries:

  • Composer: produces MidiLikeScore.
  • Performance module: converts MidiLikeScore to MidiLikePerformance.
  • Renderer: consumes MidiLikePerformance without further structural changes.

This matches the extended MIDI representation (timing deviations, dynamics, articulation and pedal) assumed in both documents.


3. PerformanceSettings and UX Mapping

Architecture A standardizes on a shared settings object used by the humanization module:

  • PerformanceSettings fields:
  • humanization_amount: float (0.0 = robotic, 1.0 = full effect)
  • swing: bool | None (None = default / straight)
  • tight_vs_loose: "tight" | "medium" | "loose" | None
  • pedal_style: "none" | "light" | "medium" | "heavy" | None
  • rubato_style: "none" | "subtle" | "romantic" | None
  • seed: int | None (controls deterministic randomness)

Interpretation:

  • humanization_amount is the primary global intensity control.
  • Other fields are style dimensions that modulate specific rule/model behaviours (swing feel, tight vs loose timing, pedal usage, rubato).
  • seed defines deterministic randomness for any stochastic elements.

MVP UX (non-musician flow):

  • UI exposes only a single “Humanization” slider (0–100).
  • For v0, Architecture A maps this to:

  • humanization_amount = slider_value / 100

  • swing = None
  • tight_vs_loose = "medium"
  • pedal_style = "medium"
  • rubato_style = "none"

Style presets (optional / future):

  • The previous style: str parameter is now treated as a preset label that expands to a PerformanceSettings instance.
  • Example: "swing_loose" might map to swing = True, tight_vs_loose = "loose", pedal_style = "medium", rubato_style = "subtle".
  • Presets are not required for v0 but are compatible with this schema.

4. Structural Constraints and Regeneration Behaviour

Architecture A explicitly adopts the structural constraints assumed by the humanization design:

  • Loop and bar alignment:
  • Humanization must preserve bar boundaries as seen by the rest of the system.
  • Timing deviations within each bar are local and approximately zero-sum so that bar start times remain aligned to the tempo grid.
  • Loop points at bar boundaries must remain seamless.

  • Partial regeneration:

  • For scoped edits (e.g. selected bars), the system may:
    • Slice MidiLikeScore to the edited region,
    • Call humanize_performance on that region,
    • Splice the resulting MidiLikePerformance segment back into the cached performance.
  • The performance module must avoid introducing tempo or timing discontinuities at region borders.

  • Determinism:

  • For fixed (score, settings) (including seed), the module behaves deterministically.
  • Any random jitter or stochastic elements must be governed by PerformanceSettings.seed (or a documented default if seed is None) to support reproducible renders and replays.

These constraints apply equally to rule-based and learned implementations of the performance module and are assumed by all downstream components.


5. Rule-Based, Learned, and Hybrid Implementations

All performance implementations are interchangeable realizations of the same contract:

  • MidiLikeScore + PerformanceSettings -> MidiLikePerformance

Specifically:

  • Rule-based implementation:
  • Applies the described timing/velocity/articulation/pedal rules conditioned on PerformanceSettings.
  • Enforces the structural constraints in §4.

  • Learned implementation:

  • Reads MidiLikeScore and PerformanceSettings, predicts expressive parameters, and outputs a MidiLikePerformance.
  • Must respect humanization_amount as a global intensity scale, as well as the structural constraints in §4.

  • Hybrid implementation:

  • Combines rules and learned outputs internally but preserves the same external contract and constraints.

This addendum makes explicit the shared assumptions and interfaces already implicit in both the Architecture A and humanization documents.