Skip to content

Reference-Guided Continuation for Solo Piano (Architecture A Integration)

1. Overview

This document specifies a reference-guided symbolic continuation system for solo piano, integrated into the Architecture A text-to-symbolic-to-audio pipeline:contentReference[oaicite:0]{index=0}. The goal is to let users provide a short piano snippet (symbolic) as a reference, optionally with a text prompt, and generate a continuation that:

  • Preserves key, tempo, meter, and overall feel of the reference.
  • Respects stylistic features (rhythm, density, register, articulation tendencies).
  • Avoids trivial copy-paste or near-duplicate continuation.

The continuation operates at the symbolic level, then passes through the existing performance / humanization module and instrument renderer as in Architecture A:contentReference[oaicite:1]{index=1}:contentReference[oaicite:2]{index=2}.

High-level properties:

  • Input: symbolic reference (MIDI / internal tokens) + optional text prompt + continuation controls.
  • Output: symbolic continuation appended to the reference, then humanized and rendered to audio.
  • Models:
  • Minimal: autoregressive Transformer in prefix-continuation mode.
  • Advanced: optional style-aware / motif-aware variant with a reference encoder.
  • UX: select segment → set length/similarity/energy/mood → preview multiple continuation options.
  • Evaluation: symbolic coherence metrics + human listening tests.

This design is aligned with:

  • The Architecture A single-instrument pipeline for solo piano:contentReference[oaicite:3]{index=3}.
  • The Humanization and Performance Modeling for Solo Piano module for expressive rendering:contentReference[oaicite:4]{index=4}.
  • The reference-guided continuation research prompt and its scope/constraints:contentReference[oaicite:5]{index=5}.

2. Input & Representation

2.1 Reference input

Primary input is a symbolic reference segment for solo piano:

  • Format: same event-based, bar-structured tokenization as Architecture A:
  • BarStart tokens for bar boundaries.
  • Position_k tokens for positions within a bar.
  • NoteOn_pitch, Duration_x, Velocity_bin tokens for notes.
  • Global context tokens (tempo, meter, optional key) at sequence start:contentReference[oaicite:6]{index=6}.
  • Source:
  • Imported MIDI (user upload).
  • A segment of a previously generated clip (internal MIDI).
  • Length:
  • Target v0: 2–16 bars, typically 4–8 bars, aligned to bar boundaries for simplicity:contentReference[oaicite:7]{index=7}.

2.2 Global musical context

The reference carries or induces:

  • Tempo (<TEMPO_XXX> token).
  • Meter (<METER_4/4>, <METER_3/4>, etc.).
  • Key / scale (optional explicit key token, or inferred via key detection on reference notes).
  • Clip scope: short local continuation (4–16 bars) rather than whole-song structure:contentReference[oaicite:8]{index=8}.

These are encoded as special tokens at the start:

<TEMPO_80> <METER_4/4> <KEY_CMINOR> <MOOD_SAD> BarStart ...

If the reference lacks metadata, a lightweight analyzer infers tempo, meter, and key; defaults are used if inference fails.

2.3 Text prompt

Optional text prompt refines high-level intent, consistent with Architecture A’s prompt-to-controls mapping:contentReference[oaicite:9]{index=9}:

Examples:

  • “Continue this for 8 bars in the same style.”
  • “Make this pattern evolve into something more dramatic over the next 8 bars.”
  • “Keep the rhythm but make it calmer and sparser.”

Parsed into control fields:

  • mood (e.g. calm, dramatic, melancholic).
  • energy_change (down / same / up).
  • density_change (sparser / similar / denser).
  • similarity_level (low / medium / high adherence to reference style).
  • length_bars (4–16).

Mapped to control tokens:

<MOOD_DRAMATIC> <ENERGY_UP> <DENSITY_HIGHER> <SIMILARITY_HIGH> <LEN_8BARS>

2.4 User controls (explicit sliders/toggles)

In the UX, these map to simple controls:

  • Length: number of bars / seconds.
  • Similarity to reference: low → high.
  • Energy / density change: down / same / up.
  • Mood change: optional (keep / switch mood).

Controls are encoded as tokens and/or scalar parameters used during sampling.

2.5 Reference / continuation boundary

Symbolically:

  • The reference is a prefix sequence R.
  • The continuation is a generated suffix C.
  • Full sequence is R || C.

Internally, we may optionally insert a marker token at the boundary:

... (last reference bar tokens) <CONTINUE> BarStart (first continuation bar)

but the minimal design can simply treat “end of prefix tokens” as the boundary.

The continuation always starts at a bar boundary to ensure audio-level seamlessness and to align with the performance module’s bar-based constraints:contentReference[oaicite:10]{index=10}.


3. Continuation Model Designs

3.1 Minimal model: autoregressive Transformer continuation

3.1.1 Model family

Use Architecture A’s symbolic core: an autoregressive Transformer decoder trained on solo piano token sequences:contentReference[oaicite:11]{index=11}.

  • Architecture:
  • Transformer decoder, ~8–12 layers, hidden size ~512, multi-head self-attention.
  • Relative positional encodings.
  • Training:
  • Next-token prediction on piano token sequences (e.g. MAESTRO-style symbolic data).
  • Control tokens (tempo, meter, mood, density, length) prepended as in Architecture A:contentReference[oaicite:12]{index=12}.

3.1.2 Training for continuation

We train the same model to handle prefix→continuation tasks implicitly:

  • For each training piece:
  • Sample a random cut point in bars: prefix R (e.g. first 4–16 bars), continuation C (next 4–16 bars):contentReference[oaicite:13]{index=13}.
  • Input: tokens for R and any control tokens derived from metadata.
  • Target: tokens for R || C (model is teacher-forced over full sequence).
  • Because the Transformer is autoregressive over the entire sequence, it learns to:
  • Use earlier bars as context for later bars.
  • Maintain key/tempo/meter and local style across the cut.

We do not require a separate “continuation-only” model. We leverage the same core that also supports prompt-only generation, as in Architecture A:contentReference[oaicite:14]{index=14}.

3.1.3 Inference in continuation mode

Given reference tokens R and controls:

  1. Build prefix:

    BarStart ... R

  2. Truncate R to last N bars (e.g. N = 8) if it exceeds the context length.

  3. Call the Transformer in inference mode, feeding the prefix and sampling until:
  4. We have generated length_bars new BarStart tokens after the boundary; or
  5. An <END> token appears.
  6. Decode tokens into a structured MIDI-like representation.

Coherence arises because the model is conditioned on the actual reference bars, including:

  • Harmony (pitch patterns).
  • Rhythm and note density.
  • Register and contour.

The model learns to “continue a story” rather than start afresh.

3.1.4 Novelty vs repetition controls

To prevent trivial repetition:

  • Decoding constraints:
  • Penalize re-emission of long n-grams seen in reference R.
  • Limit exact 1:1 reuse of whole bars from R (except for musically natural repetition).
  • Sampling:
  • Use top-p / top-k sampling with temperature.
  • For high similarity:
    • Lower temperature (more conservative outputs).
    • Weak n-gram penalties (allow moderate motif reuse).
  • For low similarity:
    • Higher temperature, stronger penalties (more novel direction).

These controls are wired to the Similarity slider.

3.2 Advanced model: style embedding & motif-aware continuation (optional)

For higher-quality v1+:

3.2.1 Reference encoder

Add a reference encoder (small Transformer or BiLSTM) that processes the reference tokens R and outputs:

  • Global style embedding s:
  • Encodes key center, texture, rhythmic profile, phrase length tendencies.
  • Local motif descriptors:
  • Summaries of important 1–4 bar motifs (melodic/rhythmic).

The continuation decoder conditions on:

  • The prefix tokens (last N bars).
  • The style embedding s (concatenated or injected as conditioning at each step).

This provides robustness when:

  • The prefix is long (beyond main context window).
  • We want to dial similarity up/down by interpolating between s and a neutral embedding.

3.2.2 Motif-aware continuation

We can add a simple copy-with-variation strategy:

  • Detect 1–2 high-salience motifs in reference using:
  • Pitch contour patterns.
  • Rhythmic n-gram frequency.
  • During generation:
  • Encourage re-use of motifs with transformations:
    • Transposition within key.
    • Rhythmic displacement.
    • Inversion / augmentation for development.
  • Mechanism:
  • Either integrated into the model (e.g. via attention bias to motif positions),
  • Or via post-processing that searches the continuation; if no motif from reference appears, we bias sampling or patch in motif variations.

This produces thematic development rather than arbitrary continuation or pure repetition.

3.2.3 Hierarchical bar-level planning (optional)

Borrowing from hierarchical design ideas in Architecture A:contentReference[oaicite:15]{index=15}:

  • Stage 1: bar-level planner:
  • Given reference summary + controls, generate a sequence of bar-level embeddings for continuation:
    • Per-bar: intensity, harmonic region, density.
  • Stage 2: note-level generator:
  • Generate notes per bar conditioned on that bar’s embedding and prefix.

Benefits:

  • Stronger control over:
  • Gradual energy ramp (“build over 8 bars”).
  • Harmonic direction (e.g. remain in key, optional modulation).
  • Clear mapping from user controls (energy / mood) to bar-level plan.

This is a more complex future variant and not necessary for v0.


4. Integration with Architecture A Pipeline

4.1 Symbolic generation stage

Architecture A’s symbolic generator is already structured as:

generate_composition(controls: ControlDict, prefix: Optional[MIDI]) -> MIDI

For continuation:

  • controls comes from:
  • Text prompt parsing.
  • UI sliders for length, similarity, energy, mood.
  • prefix is the reference MIDI, converted to Architecture A’s token format:contentReference[oaicite:16]{index=16}.

Flow:

  1. Parse reference into tokens.
  2. Infer / confirm tempo, meter, key; prepend tokens.
  3. Add control tokens for continuation (length, similarity, energy change).
  4. Call generate_composition with prefix set to reference.
  5. Decode tokens to MIDI: combined reference + continuation.

The symbolic output is structurally identical to any other Architecture A composition.

4.2 Performance / humanization stage

Use the existing performance module for solo piano:contentReference[oaicite:17]{index=17}:

  • Input: clean quantized MIDI (reference + continuation).
  • Output: expressive performance MIDI with:
  • Timing deviations.
  • Velocity shaping.
  • Articulation changes.
  • Pedal events.

Key integration points:

  • Boundary handling:
  • Continuation starts on a bar boundary.
  • Humanization rules are designed to keep bar length invariant (zero-mean timing deviations per bar) to preserve alignment and looping:contentReference[oaicite:18]{index=18}.
  • At the reference→continuation boundary:
    • Ensure no cumulative tempo drift in the last reference bar.
    • Optionally run performance rules with awareness of boundary to avoid abrupt changes.
  • Pedal continuity:
  • If sustain is down at the end of reference:
    • Option 1: treat the boundary as a phrase break and lift pedal.
    • Option 2: keep pedal if harmony doesn’t change and style suggests legato.
  • The existing performance design already covers bar-level and phrase-level pedal handling and can be extended with a boundary-aware heuristic:contentReference[oaicite:19]{index=19}.

We do not need a separate performance model for the continuation; the same module runs on the full sequence.

4.3 Audio rendering stage

The piano renderer (sample-based instrument in v0) remains unchanged:contentReference[oaicite:20]{index=20}:

  • Input: performance MIDI.
  • Output: stereo audio.

Because the continuation is a direct extension of the same symbolic/performance representation:

  • Timbre is consistent (same virtual piano, same FX chain).
  • The piece sounds like one continuous performance.

4.4 Caching and partial regeneration

Caching strategy:

  • Reference:
  • Keep the reference MIDI and its performance version cached across continuation attempts.
  • Continuation:
  • Each generated continuation has its:
    • Symbolic MIDI.
    • Performance MIDI.
    • Rendered audio.

For a new continuation with the same reference:

  • Option A (simplest for v0):
  • Re-run performance on full reference + continuation.
  • Option B (optimization):
  • Reuse performance for reference bars from cache, humanize only continuation + maybe a 1-bar overlap to ensure continuity.

Given short clip lengths and low cost of humanization, v0 can start with Option A.


5. UX Flows

5.1 Primary continuation flow

  1. Select / import reference:
  2. User:
    • Uploads a short solo-piano MIDI, or
    • Selects bars from an existing generated clip (e.g. last 4–8 bars).
  3. UI visualizes the reference as a piano roll or basic notation.

  4. Configure continuation:

  5. Controls:
    • Length: e.g. 4, 8, or 16 bars.
    • Similarity: e.g. “very close”, “medium”, “more exploratory”.
    • Energy: “down”, “same”, “up”.
    • Optional mood change: “keep mood” / “more hopeful” / “darker”, etc.
  6. Optional text prompt field:

    • “Make this pattern evolve into something more dramatic.”
  7. Generate:

  8. When the user presses “Continue”:
    • System:
    • Parses controls and prompt into tokens.
    • Runs symbolic continuation model with reference as prefix.
    • Applies performance humanization.
    • Renders audio.
  9. UI:

    • Append continuation visually to reference in the timeline.
    • Auto-play combined reference + continuation.
  10. Refine:

  11. User can:

    • Regenerate continuation (different random seed).
    • Adjust similarity/energy sliders and regenerate.
    • Change length and re-run.
  12. Export:

  13. Export options:
    • Full audio (WAV/OGG).
    • Combined MIDI (reference + continuation).
  14. Same export UX as baseline Architecture A generation.

5.2 Variants & advanced flows

  • Alternative endings:
  • User keeps the same reference and generates multiple continuations.
  • UI allows A/B/C listening and selection of the favorite.
  • Chain continuation:
  • After one continuation is accepted, user can use the last few bars of the new piece as the next reference, chaining sections.
  • Scoped variation:
  • Select only the last N bars of the continuation and regenerate them, leaving both the initial reference and earlier continuation intact (reuses Architecture A’s partial-regen semantics:contentReference[oaicite:21]{index=21}).

6. Evaluation Plan

6.1 Automatic symbolic metrics

We assess both coherence and novelty between reference and continuation.

  1. Key and scale consistency:
  2. Run key detection separately on reference and continuation.
  3. Measure:
    • Key match rate.
    • Degree of scale overlap.
  4. Expect:

    • Same key/mode unless user requested modulation.
  5. Meter & tempo consistency:

  6. Verify:
    • All continuation bars sum to the correct number of beats for meter (e.g. 4 quarters in 4/4).
    • Tempo tokens remain consistent (or follow user-requested changes).
  7. Flag any anomalies as model / decoding bugs.

  8. Density & rhythmic profile:

  9. Compute notes per bar and IOI (inter-onset interval) distributions:
    • Compare reference vs continuation, adjusted for user’s energy/density controls.
  10. Expectations:

    • With “same energy/density”: similar ranges.
    • With “energy up”: increased density and more active rhythms, but still musically plausible.
  11. Motif similarity & structural continuity:

  12. Extract n-gram patterns (e.g. melodic 3–5 note sequences) from reference.
  13. Compute:
    • Presence of these motifs (or transformed versions) in continuation.
    • Proportion of continuation covered by references to these motifs.
  14. We want:
    • Some motif reuse (for thematic coherence).
    • No long exact clones of reference bars (avoid trivial repetition).
  15. Also check transitions:

    • The start of continuation should harmonically and rhythmically “answer” or extend the last bar of reference, not contradict it abruptly.
  16. Repetition / looping diagnosis:

  17. Detect:
    • Exact bar repeats in continuation.
    • Short repeated cycles (e.g. 1–2 bars looped > 3 times).
  18. Apply thresholds:
    • Repetition within reason is allowed (e.g. repeated ostinato), but degenerate loops should be rare.

6.2 Automatic performance / audio checks

Given the existing performance module:

  • Confirm:
  • No excessive timing drift across full reference + continuation (bar boundaries remain stable).
  • Loudness and dynamic range are comparable between reference and continuation, or follow energy control.
  • Audio:
  • No clipping, obvious clicks, or artifacts around the boundary.

6.3 Human listening tests

  1. Coherence rating:
  2. Present listeners with:
    • Reference segment, followed by continuation.
  3. Ask:
    • “How well does the second part follow the first?” (1–5 scale).
  4. Success criterion:

    • Majority of continuations rated as at least moderately coherent.
  5. Style match rating:

  6. Ask:
    • “How similar is the style of the continuation to the reference?” (1–5).
  7. Compare across:

    • High vs medium vs low similarity settings.
  8. Instruction compliance:

  9. Provide the text or slider instructions used (e.g. “more energetic, 8 bars”).
  10. Ask:
    • “Did the continuation match these instructions?” (1–5).
  11. Evaluate:

    • Accuracy of energy/mood change controls.
  12. Baseline comparisons:

  13. Conditions:
    • Our continuation vs:
    • Naive baseline (random clip appended).
    • Trivial repetition (copy last bar N times).
  14. Ask:
    • “Which continuation feels more musical and coherent?”
  15. Expect:

    • Strong preference for our model.
  16. Expert review (optional):

  17. Have pianists/composers annotate:
    • Harmony continuity.
    • Voice-leading quality.
    • Phrase structure.

7. Risks & Minimal Experiments

7.1 Key risks

  1. Style drift vs over-adhesion:
  2. Model might:

    • Drift away from reference style.
    • Or, overfit and copy reference verbatim.
  3. Weak effect of user controls:

  4. Sliders for similarity/energy might not translate into clear musical differences.

  5. Boundary artifacts:

  6. Audible discontinuities at the reference→continuation join despite symbolic coherence.

  7. Out-of-distribution references:

  8. User reference may be outside the model’s style (e.g. jazz when model is trained on cinematic piano).

  9. Latency:

  10. Long references may increase generation time beyond interactive thresholds.

7.2 Minimal experiments

Experiment 1: Prefix–continuation fidelity

  • Take held-out pieces from training data.
  • Use first N bars as reference, ask model to generate next N bars.
  • Evaluate:
  • Symbolic metrics (key, meter, density, motif reuse).
  • Human ratings of coherence and style match.
  • If drift is frequent:
  • Strengthen training with more prefix/continuation examples.
  • Adjust decoding to be more conservative near prefix boundary.

Experiment 2: Control responsiveness

  • Fix a reference and generate multiple continuations with different control settings:
  • Similarity: low / medium / high.
  • Energy: down / same / up.
  • Measure:
  • Changes in density, dynamic range, register, and motif reuse.
  • Run quick listening test:
  • “Which one sounds more energetic?”
  • If differences are weak:
  • Revisit mapping from controls → tokens/parameters.
  • Consider additional conditioning channels (e.g. bar-level intensity embeddings).

Experiment 3: Boundary smoothness

  • Generate continuations for a variety of references.
  • Listen focusing on the join.
  • Check:
  • Timing alignment.
  • Pedal continuity.
  • Sudden changes in texture/dynamics.
  • If artifacts appear:
  • Add boundary-specific rules in performance module (e.g. phrase-end smoothing).
  • Possibly generate 1–2 overlapping bars and cross-fade symbolically or at audio level.

Experiment 4: OOD reference robustness

  • Use references from:
  • User-supplied MIDI outside the training style.
  • Extreme tempos, keys, densities.
  • Observe:
  • Does the model still produce musically coherent continuations?
  • If behavior is poor:
  • Clarify v0 style scope in product (e.g. “optimized for cinematic / emotive solo piano”).
  • Add style detection to warn when references are far outside supported domain.

Experiment 5: Latency profiling

  • Benchmark continuation generation:
  • Different prefix lengths (4, 8, 16 bars).
  • Different continuation lengths.
  • Measure:
  • End-to-end latency (symbolic + performance + render).
  • If latency is too high:
  • Limit maximum prefix bars considered.
  • Cache prefix representations for reuse (KV caches in Transformer).
  • Optionally provide “preview” mode with shorter continuation or less humanization.

8. References

8.1 Internal design docs & research prompts

Ref ID Type Title / Name How it’s used in this design
43 Research prompt “Reference-guided continuation for a single instrument” Defines product framing, objectives, questions, and scope for single-instrument continuation:contentReference[oaicite:22]{index=22}.
48 Design doc “Architecture A: Single-Instrument Text-to-Music Pipeline (Solo Piano Prototype)” Provides overall text→symbolic→audio pipeline, representation, and UX for solo piano:contentReference[oaicite:23]{index=23}.
46 Design doc “Humanization and Performance Modeling for Solo Piano (Architecture A Pipeline)” Defines performance representation and humanization module integrated into Architecture A:contentReference[oaicite:24]{index=24}.
44 Research prompt “Architecture A single-instrument prototype: text-to-symbolic-to-audio piano/guitar pipeline” Supplies original questions and constraints for Architecture A’s symbolic core and evaluation:contentReference[oaicite:25]{index=25}.
45 Research prompt “Humanization and performance modeling for a single instrument” Higher-level performance modeling goals that the solo-piano doc specializes and implements:contentReference[oaicite:26]{index=26}.

8.2 External datasets & methods (conceptual references)

These are conceptual references used for training/inspiration; details are described in the Architecture A and humanization docs:contentReference[oaicite:27]{index=27}:contentReference[oaicite:28]{index=28}.

Name Type Role in system
MAESTRO Piano dataset Source of expressive solo-piano performances for symbolic and performance modeling.
EMOPIA Emotion-labeled Provides emotion tags and modern piano clips for mood/conditioned generation.
GiantMIDI-Piano Piano dataset Large classical piano corpus for additional symbolic training.
MIDI-DDSP Method family Example of future neural renderer; not required for continuation v0.

All continuation design choices are constructed to be compatible with these underlying architecture and performance modules, so the continuation feature can be added as a thin extension over the existing pipeline, reusing as much infrastructure and representation as possible:contentReference[oaicite:29]{index=29}:contentReference[oaicite:30]{index=30}:contentReference[oaicite:31]{index=31}.