Reference-Guided Continuation for Solo Piano (Architecture A Integration)¶

1. Overview¶

This document specifies a reference-guided symbolic continuation system for solo piano, integrated into the Architecture A text-to-symbolic-to-audio pipeline:contentReference[oaicite:0]{index=0}. The goal is to let users provide a short piano snippet (symbolic) as a reference, optionally with a text prompt, and generate a continuation that:

Preserves key, tempo, meter, and overall feel of the reference.
Respects stylistic features (rhythm, density, register, articulation tendencies).
Avoids trivial copy-paste or near-duplicate continuation.

The continuation operates at the symbolic level, then passes through the existing performance / humanization module and instrument renderer as in Architecture A:contentReference[oaicite:1]{index=1}:contentReference[oaicite:2]{index=2}.

High-level properties:

Input: symbolic reference (MIDI / internal tokens) + optional text prompt + continuation controls.
Output: symbolic continuation appended to the reference, then humanized and rendered to audio.
Models:
Minimal: autoregressive Transformer in prefix-continuation mode.
Advanced: optional style-aware / motif-aware variant with a reference encoder.
UX: select segment → set length/similarity/energy/mood → preview multiple continuation options.
Evaluation: symbolic coherence metrics + human listening tests.

This design is aligned with:

The Architecture A single-instrument pipeline for solo piano:contentReference[oaicite:3]{index=3}.
The Humanization and Performance Modeling for Solo Piano module for expressive rendering:contentReference[oaicite:4]{index=4}.
The reference-guided continuation research prompt and its scope/constraints:contentReference[oaicite:5]{index=5}.

2. Input & Representation¶

2.1 Reference input¶

Primary input is a symbolic reference segment for solo piano:

Format: same event-based, bar-structured tokenization as Architecture A:
BarStart tokens for bar boundaries.
Position_k tokens for positions within a bar.
NoteOn_pitch, Duration_x, Velocity_bin tokens for notes.
Global context tokens (tempo, meter, optional key) at sequence start:contentReference[oaicite:6]{index=6}.
Source:
Imported MIDI (user upload).
A segment of a previously generated clip (internal MIDI).
Length:
Target v0: 2–16 bars, typically 4–8 bars, aligned to bar boundaries for simplicity:contentReference[oaicite:7]{index=7}.

2.2 Global musical context¶

The reference carries or induces:

Tempo (<TEMPO_XXX> token).
Meter (<METER_4/4>, <METER_3/4>, etc.).
Key / scale (optional explicit key token, or inferred via key detection on reference notes).
Clip scope: short local continuation (4–16 bars) rather than whole-song structure:contentReference[oaicite:8]{index=8}.

These are encoded as special tokens at the start:

<TEMPO_80> <METER_4/4> <KEY_CMINOR> <MOOD_SAD> BarStart ...

If the reference lacks metadata, a lightweight analyzer infers tempo, meter, and key; defaults are used if inference fails.

2.3 Text prompt¶

Optional text prompt refines high-level intent, consistent with Architecture A’s prompt-to-controls mapping:contentReference[oaicite:9]{index=9}:

Examples:

“Continue this for 8 bars in the same style.”
“Make this pattern evolve into something more dramatic over the next 8 bars.”
“Keep the rhythm but make it calmer and sparser.”

Parsed into control fields:

mood (e.g. calm, dramatic, melancholic).
energy_change (down / same / up).
density_change (sparser / similar / denser).
similarity_level (low / medium / high adherence to reference style).
length_bars (4–16).

Mapped to control tokens:

<MOOD_DRAMATIC> <ENERGY_UP> <DENSITY_HIGHER> <SIMILARITY_HIGH> <LEN_8BARS>

2.4 User controls (explicit sliders/toggles)¶

In the UX, these map to simple controls:

Length: number of bars / seconds.
Similarity to reference: low → high.
Energy / density change: down / same / up.
Mood change: optional (keep / switch mood).

Controls are encoded as tokens and/or scalar parameters used during sampling.

2.5 Reference / continuation boundary¶

Symbolically:

The reference is a prefix sequence R.
The continuation is a generated suffix C.
Full sequence is R || C.

Internally, we may optionally insert a marker token at the boundary:

... (last reference bar tokens) <CONTINUE> BarStart (first continuation bar)

but the minimal design can simply treat “end of prefix tokens” as the boundary.

The continuation always starts at a bar boundary to ensure audio-level seamlessness and to align with the performance module’s bar-based constraints:contentReference[oaicite:10]{index=10}.

3. Continuation Model Designs¶

3.1 Minimal model: autoregressive Transformer continuation¶

3.1.1 Model family¶

Use Architecture A’s symbolic core: an autoregressive Transformer decoder trained on solo piano token sequences:contentReference[oaicite:11]{index=11}.

Architecture:
Transformer decoder, ~8–12 layers, hidden size ~512, multi-head self-attention.
Relative positional encodings.
Training:
Next-token prediction on piano token sequences (e.g. MAESTRO-style symbolic data).
Control tokens (tempo, meter, mood, density, length) prepended as in Architecture A:contentReference[oaicite:12]{index=12}.

3.1.2 Training for continuation¶

We train the same model to handle prefix→continuation tasks implicitly:

For each training piece:
Sample a random cut point in bars: prefix R (e.g. first 4–16 bars), continuation C (next 4–16 bars):contentReference[oaicite:13]{index=13}.
Input: tokens for R and any control tokens derived from metadata.
Target: tokens for R || C (model is teacher-forced over full sequence).
Because the Transformer is autoregressive over the entire sequence, it learns to:
Use earlier bars as context for later bars.
Maintain key/tempo/meter and local style across the cut.

We do not require a separate “continuation-only” model. We leverage the same core that also supports prompt-only generation, as in Architecture A:contentReference[oaicite:14]{index=14}.

3.1.3 Inference in continuation mode¶

Given reference tokens R and controls:

Build prefix:

BarStart ... R
Truncate R to last N bars (e.g. N = 8) if it exceeds the context length.
Call the Transformer in inference mode, feeding the prefix and sampling until:
We have generated length_bars new BarStart tokens after the boundary; or
An <END> token appears.
Decode tokens into a structured MIDI-like representation.

Coherence arises because the model is conditioned on the actual reference bars, including:

Harmony (pitch patterns).
Rhythm and note density.
Register and contour.

The model learns to “continue a story” rather than start afresh.

3.1.4 Novelty vs repetition controls¶

To prevent trivial repetition:

Decoding constraints:
Penalize re-emission of long n-grams seen in reference R.
Limit exact 1:1 reuse of whole bars from R (except for musically natural repetition).
Sampling:
Use top-p / top-k sampling with temperature.
For high similarity:
- Lower temperature (more conservative outputs).
- Weak n-gram penalties (allow moderate motif reuse).
For low similarity:
- Higher temperature, stronger penalties (more novel direction).

These controls are wired to the Similarity slider.

3.2 Advanced model: style embedding & motif-aware continuation (optional)¶

For higher-quality v1+:

3.2.1 Reference encoder¶

Add a reference encoder (small Transformer or BiLSTM) that processes the reference tokens R and outputs:

Global style embedding s:
Encodes key center, texture, rhythmic profile, phrase length tendencies.
Local motif descriptors:
Summaries of important 1–4 bar motifs (melodic/rhythmic).

The continuation decoder conditions on:

The prefix tokens (last N bars).
The style embedding s (concatenated or injected as conditioning at each step).

This provides robustness when:

The prefix is long (beyond main context window).
We want to dial similarity up/down by interpolating between s and a neutral embedding.

3.2.2 Motif-aware continuation¶

We can add a simple copy-with-variation strategy:

Detect 1–2 high-salience motifs in reference using:
Pitch contour patterns.
Rhythmic n-gram frequency.
During generation:
Encourage re-use of motifs with transformations:
- Transposition within key.
- Rhythmic displacement.
- Inversion / augmentation for development.
Mechanism:
Either integrated into the model (e.g. via attention bias to motif positions),
Or via post-processing that searches the continuation; if no motif from reference appears, we bias sampling or patch in motif variations.

This produces thematic development rather than arbitrary continuation or pure repetition.

3.2.3 Hierarchical bar-level planning (optional)¶

Borrowing from hierarchical design ideas in Architecture A:contentReference[oaicite:15]{index=15}:

Stage 1: bar-level planner:
Given reference summary + controls, generate a sequence of bar-level embeddings for continuation:
- Per-bar: intensity, harmonic region, density.
Stage 2: note-level generator:
Generate notes per bar conditioned on that bar’s embedding and prefix.

Benefits:

Stronger control over:
Gradual energy ramp (“build over 8 bars”).
Harmonic direction (e.g. remain in key, optional modulation).
Clear mapping from user controls (energy / mood) to bar-level plan.

This is a more complex future variant and not necessary for v0.

4. Integration with Architecture A Pipeline¶

4.1 Symbolic generation stage¶

Architecture A’s symbolic generator is already structured as:

generate_composition(controls: ControlDict, prefix: Optional[MIDI]) -> MIDI

For continuation:

controls comes from:
Text prompt parsing.
UI sliders for length, similarity, energy, mood.
prefix is the reference MIDI, converted to Architecture A’s token format:contentReference[oaicite:16]{index=16}.

Flow:

Parse reference into tokens.
Infer / confirm tempo, meter, key; prepend tokens.
Add control tokens for continuation (length, similarity, energy change).
Call generate_composition with prefix set to reference.
Decode tokens to MIDI: combined reference + continuation.

The symbolic output is structurally identical to any other Architecture A composition.

4.2 Performance / humanization stage¶

Use the existing performance module for solo piano:contentReference[oaicite:17]{index=17}:

Input: clean quantized MIDI (reference + continuation).
Output: expressive performance MIDI with:
Timing deviations.
Velocity shaping.
Articulation changes.
Pedal events.

Key integration points:

Boundary handling:
Continuation starts on a bar boundary.
Humanization rules are designed to keep bar length invariant (zero-mean timing deviations per bar) to preserve alignment and looping:contentReference[oaicite:18]{index=18}.
At the reference→continuation boundary:
- Ensure no cumulative tempo drift in the last reference bar.
- Optionally run performance rules with awareness of boundary to avoid abrupt changes.
Pedal continuity:
If sustain is down at the end of reference:
- Option 1: treat the boundary as a phrase break and lift pedal.
- Option 2: keep pedal if harmony doesn’t change and style suggests legato.
The existing performance design already covers bar-level and phrase-level pedal handling and can be extended with a boundary-aware heuristic:contentReference[oaicite:19]{index=19}.

We do not need a separate performance model for the continuation; the same module runs on the full sequence.

4.3 Audio rendering stage¶

The piano renderer (sample-based instrument in v0) remains unchanged:contentReference[oaicite:20]{index=20}:

Input: performance MIDI.
Output: stereo audio.

Because the continuation is a direct extension of the same symbolic/performance representation:

Timbre is consistent (same virtual piano, same FX chain).
The piece sounds like one continuous performance.

4.4 Caching and partial regeneration¶

Caching strategy:

Reference:
Keep the reference MIDI and its performance version cached across continuation attempts.
Continuation:
Each generated continuation has its:
- Symbolic MIDI.
- Performance MIDI.
- Rendered audio.

For a new continuation with the same reference:

Option A (simplest for v0):
Re-run performance on full reference + continuation.
Option B (optimization):
Reuse performance for reference bars from cache, humanize only continuation + maybe a 1-bar overlap to ensure continuity.

Given short clip lengths and low cost of humanization, v0 can start with Option A.

5. UX Flows¶

5.1 Primary continuation flow¶

Select / import reference:
User:
- Uploads a short solo-piano MIDI, or
- Selects bars from an existing generated clip (e.g. last 4–8 bars).
UI visualizes the reference as a piano roll or basic notation.
Configure continuation:
Controls:
- Length: e.g. 4, 8, or 16 bars.
- Similarity: e.g. “very close”, “medium”, “more exploratory”.
- Energy: “down”, “same”, “up”.
- Optional mood change: “keep mood” / “more hopeful” / “darker”, etc.
Optional text prompt field:
- “Make this pattern evolve into something more dramatic.”
Generate:
When the user presses “Continue”:
- System:
- Parses controls and prompt into tokens.
- Runs symbolic continuation model with reference as prefix.
- Applies performance humanization.
- Renders audio.
UI:
- Append continuation visually to reference in the timeline.
- Auto-play combined reference + continuation.
Refine:
User can:
- Regenerate continuation (different random seed).
- Adjust similarity/energy sliders and regenerate.
- Change length and re-run.
Export:
Export options:
- Full audio (WAV/OGG).
- Combined MIDI (reference + continuation).
Same export UX as baseline Architecture A generation.

5.2 Variants & advanced flows¶

Alternative endings:
User keeps the same reference and generates multiple continuations.
UI allows A/B/C listening and selection of the favorite.
Chain continuation:
After one continuation is accepted, user can use the last few bars of the new piece as the next reference, chaining sections.
Scoped variation:
Select only the last N bars of the continuation and regenerate them, leaving both the initial reference and earlier continuation intact (reuses Architecture A’s partial-regen semantics:contentReference[oaicite:21]{index=21}).

6. Evaluation Plan¶

6.1 Automatic symbolic metrics¶

We assess both coherence and novelty between reference and continuation.

Key and scale consistency:
Run key detection separately on reference and continuation.
Measure:
- Key match rate.
- Degree of scale overlap.
Expect:
- Same key/mode unless user requested modulation.
Meter & tempo consistency:
Verify:
- All continuation bars sum to the correct number of beats for meter (e.g. 4 quarters in 4/4).
- Tempo tokens remain consistent (or follow user-requested changes).
Flag any anomalies as model / decoding bugs.
Density & rhythmic profile:
Compute notes per bar and IOI (inter-onset interval) distributions:
- Compare reference vs continuation, adjusted for user’s energy/density controls.
Expectations:
- With “same energy/density”: similar ranges.
- With “energy up”: increased density and more active rhythms, but still musically plausible.
Motif similarity & structural continuity:
Extract n-gram patterns (e.g. melodic 3–5 note sequences) from reference.
Compute:
- Presence of these motifs (or transformed versions) in continuation.
- Proportion of continuation covered by references to these motifs.
We want:
- Some motif reuse (for thematic coherence).
- No long exact clones of reference bars (avoid trivial repetition).
Also check transitions:
- The start of continuation should harmonically and rhythmically “answer” or extend the last bar of reference, not contradict it abruptly.
Repetition / looping diagnosis:
Detect:
- Exact bar repeats in continuation.
- Short repeated cycles (e.g. 1–2 bars looped > 3 times).
Apply thresholds:
- Repetition within reason is allowed (e.g. repeated ostinato), but degenerate loops should be rare.

6.2 Automatic performance / audio checks¶

Given the existing performance module:

Confirm:
No excessive timing drift across full reference + continuation (bar boundaries remain stable).
Loudness and dynamic range are comparable between reference and continuation, or follow energy control.
Audio:
No clipping, obvious clicks, or artifacts around the boundary.

6.3 Human listening tests¶

Coherence rating:
Present listeners with:
- Reference segment, followed by continuation.
Ask:
- “How well does the second part follow the first?” (1–5 scale).
Success criterion:
- Majority of continuations rated as at least moderately coherent.
Style match rating:
Ask:
- “How similar is the style of the continuation to the reference?” (1–5).
Compare across:
- High vs medium vs low similarity settings.
Instruction compliance:
Provide the text or slider instructions used (e.g. “more energetic, 8 bars”).
Ask:
- “Did the continuation match these instructions?” (1–5).
Evaluate:
- Accuracy of energy/mood change controls.
Baseline comparisons:
Conditions:
- Our continuation vs:
- Naive baseline (random clip appended).
- Trivial repetition (copy last bar N times).
Ask:
- “Which continuation feels more musical and coherent?”
Expect:
- Strong preference for our model.
Expert review (optional):
Have pianists/composers annotate:
- Harmony continuity.
- Voice-leading quality.
- Phrase structure.

7. Risks & Minimal Experiments¶

7.1 Key risks¶

Style drift vs over-adhesion:
Model might:
- Drift away from reference style.
- Or, overfit and copy reference verbatim.
Weak effect of user controls:
Sliders for similarity/energy might not translate into clear musical differences.
Boundary artifacts:
Audible discontinuities at the reference→continuation join despite symbolic coherence.
Out-of-distribution references:
User reference may be outside the model’s style (e.g. jazz when model is trained on cinematic piano).
Latency:
Long references may increase generation time beyond interactive thresholds.

7.2 Minimal experiments¶

Experiment 1: Prefix–continuation fidelity¶

Take held-out pieces from training data.
Use first N bars as reference, ask model to generate next N bars.
Evaluate:
Symbolic metrics (key, meter, density, motif reuse).
Human ratings of coherence and style match.
If drift is frequent:
Strengthen training with more prefix/continuation examples.
Adjust decoding to be more conservative near prefix boundary.

Experiment 2: Control responsiveness¶

Fix a reference and generate multiple continuations with different control settings:
Similarity: low / medium / high.
Energy: down / same / up.
Measure:
Changes in density, dynamic range, register, and motif reuse.
Run quick listening test:
“Which one sounds more energetic?”
If differences are weak:
Revisit mapping from controls → tokens/parameters.
Consider additional conditioning channels (e.g. bar-level intensity embeddings).

Experiment 3: Boundary smoothness¶

Generate continuations for a variety of references.
Listen focusing on the join.
Check:
Timing alignment.
Pedal continuity.
Sudden changes in texture/dynamics.
If artifacts appear:
Add boundary-specific rules in performance module (e.g. phrase-end smoothing).
Possibly generate 1–2 overlapping bars and cross-fade symbolically or at audio level.

Experiment 4: OOD reference robustness¶

Use references from:
User-supplied MIDI outside the training style.
Extreme tempos, keys, densities.
Observe:
Does the model still produce musically coherent continuations?
If behavior is poor:
Clarify v0 style scope in product (e.g. “optimized for cinematic / emotive solo piano”).
Add style detection to warn when references are far outside supported domain.

Experiment 5: Latency profiling¶

Benchmark continuation generation:
Different prefix lengths (4, 8, 16 bars).
Different continuation lengths.
Measure:
End-to-end latency (symbolic + performance + render).
If latency is too high:
Limit maximum prefix bars considered.
Cache prefix representations for reuse (KV caches in Transformer).
Optionally provide “preview” mode with shorter continuation or less humanization.

8. References¶

8.1 Internal design docs & research prompts¶

Ref ID	Type	Title / Name	How it’s used in this design
43	Research prompt	“Reference-guided continuation for a single instrument”	Defines product framing, objectives, questions, and scope for single-instrument continuation:contentReference[oaicite:22]{index=22}.
48	Design doc	“Architecture A: Single-Instrument Text-to-Music Pipeline (Solo Piano Prototype)”	Provides overall text→symbolic→audio pipeline, representation, and UX for solo piano:contentReference[oaicite:23]{index=23}.
46	Design doc	“Humanization and Performance Modeling for Solo Piano (Architecture A Pipeline)”	Defines performance representation and humanization module integrated into Architecture A:contentReference[oaicite:24]{index=24}.
44	Research prompt	“Architecture A single-instrument prototype: text-to-symbolic-to-audio piano/guitar pipeline”	Supplies original questions and constraints for Architecture A’s symbolic core and evaluation:contentReference[oaicite:25]{index=25}.
45	Research prompt	“Humanization and performance modeling for a single instrument”	Higher-level performance modeling goals that the solo-piano doc specializes and implements:contentReference[oaicite:26]{index=26}.

8.2 External datasets & methods (conceptual references)¶

These are conceptual references used for training/inspiration; details are described in the Architecture A and humanization docs:contentReference[oaicite:27]{index=27}:contentReference[oaicite:28]{index=28}.

Name	Type	Role in system
MAESTRO	Piano dataset	Source of expressive solo-piano performances for symbolic and performance modeling.
EMOPIA	Emotion-labeled	Provides emotion tags and modern piano clips for mood/conditioned generation.
GiantMIDI-Piano	Piano dataset	Large classical piano corpus for additional symbolic training.
MIDI-DDSP	Method family	Example of future neural renderer; not required for continuation v0.

All continuation design choices are constructed to be compatible with these underlying architecture and performance modules, so the continuation feature can be added as a thin extension over the existing pipeline, reusing as much infrastructure and representation as possible:contentReference[oaicite:29]{index=29}:contentReference[oaicite:30]{index=30}:contentReference[oaicite:31]{index=31}.