Reference-Guided Continuation for Solo Piano (Architecture A Integration)¶
1. Overview¶
This document specifies a reference-guided symbolic continuation system for solo piano, integrated into the Architecture A text-to-symbolic-to-audio pipeline:contentReference[oaicite:0]{index=0}. The goal is to let users provide a short piano snippet (symbolic) as a reference, optionally with a text prompt, and generate a continuation that:
- Preserves key, tempo, meter, and overall feel of the reference.
- Respects stylistic features (rhythm, density, register, articulation tendencies).
- Avoids trivial copy-paste or near-duplicate continuation.
The continuation operates at the symbolic level, then passes through the existing performance / humanization module and instrument renderer as in Architecture A:contentReference[oaicite:1]{index=1}:contentReference[oaicite:2]{index=2}.
High-level properties:
- Input: symbolic reference (MIDI / internal tokens) + optional text prompt + continuation controls.
- Output: symbolic continuation appended to the reference, then humanized and rendered to audio.
- Models:
- Minimal: autoregressive Transformer in prefix-continuation mode.
- Advanced: optional style-aware / motif-aware variant with a reference encoder.
- UX: select segment → set length/similarity/energy/mood → preview multiple continuation options.
- Evaluation: symbolic coherence metrics + human listening tests.
This design is aligned with:
- The Architecture A single-instrument pipeline for solo piano:contentReference[oaicite:3]{index=3}.
- The Humanization and Performance Modeling for Solo Piano module for expressive rendering:contentReference[oaicite:4]{index=4}.
- The reference-guided continuation research prompt and its scope/constraints:contentReference[oaicite:5]{index=5}.
2. Input & Representation¶
2.1 Reference input¶
Primary input is a symbolic reference segment for solo piano:
- Format: same event-based, bar-structured tokenization as Architecture A:
BarStarttokens for bar boundaries.Position_ktokens for positions within a bar.NoteOn_pitch,Duration_x,Velocity_bintokens for notes.- Global context tokens (tempo, meter, optional key) at sequence start:contentReference[oaicite:6]{index=6}.
- Source:
- Imported MIDI (user upload).
- A segment of a previously generated clip (internal MIDI).
- Length:
- Target v0: 2–16 bars, typically 4–8 bars, aligned to bar boundaries for simplicity:contentReference[oaicite:7]{index=7}.
2.2 Global musical context¶
The reference carries or induces:
- Tempo (
<TEMPO_XXX>token). - Meter (
<METER_4/4>,<METER_3/4>, etc.). - Key / scale (optional explicit key token, or inferred via key detection on reference notes).
- Clip scope: short local continuation (4–16 bars) rather than whole-song structure:contentReference[oaicite:8]{index=8}.
These are encoded as special tokens at the start:
<TEMPO_80> <METER_4/4> <KEY_CMINOR> <MOOD_SAD> BarStart ...
If the reference lacks metadata, a lightweight analyzer infers tempo, meter, and key; defaults are used if inference fails.
2.3 Text prompt¶
Optional text prompt refines high-level intent, consistent with Architecture A’s prompt-to-controls mapping:contentReference[oaicite:9]{index=9}:
Examples:
- “Continue this for 8 bars in the same style.”
- “Make this pattern evolve into something more dramatic over the next 8 bars.”
- “Keep the rhythm but make it calmer and sparser.”
Parsed into control fields:
mood(e.g. calm, dramatic, melancholic).energy_change(down / same / up).density_change(sparser / similar / denser).similarity_level(low / medium / high adherence to reference style).length_bars(4–16).
Mapped to control tokens:
<MOOD_DRAMATIC> <ENERGY_UP> <DENSITY_HIGHER> <SIMILARITY_HIGH> <LEN_8BARS>
2.4 User controls (explicit sliders/toggles)¶
In the UX, these map to simple controls:
- Length: number of bars / seconds.
- Similarity to reference: low → high.
- Energy / density change: down / same / up.
- Mood change: optional (keep / switch mood).
Controls are encoded as tokens and/or scalar parameters used during sampling.
2.5 Reference / continuation boundary¶
Symbolically:
- The reference is a prefix sequence
R. - The continuation is a generated suffix
C. - Full sequence is
R || C.
Internally, we may optionally insert a marker token at the boundary:
... (last reference bar tokens) <CONTINUE> BarStart (first continuation bar)
but the minimal design can simply treat “end of prefix tokens” as the boundary.
The continuation always starts at a bar boundary to ensure audio-level seamlessness and to align with the performance module’s bar-based constraints:contentReference[oaicite:10]{index=10}.
3. Continuation Model Designs¶
3.1 Minimal model: autoregressive Transformer continuation¶
3.1.1 Model family¶
Use Architecture A’s symbolic core: an autoregressive Transformer decoder trained on solo piano token sequences:contentReference[oaicite:11]{index=11}.
- Architecture:
- Transformer decoder, ~8–12 layers, hidden size ~512, multi-head self-attention.
- Relative positional encodings.
- Training:
- Next-token prediction on piano token sequences (e.g. MAESTRO-style symbolic data).
- Control tokens (tempo, meter, mood, density, length) prepended as in Architecture A:contentReference[oaicite:12]{index=12}.
3.1.2 Training for continuation¶
We train the same model to handle prefix→continuation tasks implicitly:
- For each training piece:
- Sample a random cut point in bars: prefix
R(e.g. first 4–16 bars), continuationC(next 4–16 bars):contentReference[oaicite:13]{index=13}. - Input: tokens for
Rand any control tokens derived from metadata. - Target: tokens for
R || C(model is teacher-forced over full sequence). - Because the Transformer is autoregressive over the entire sequence, it learns to:
- Use earlier bars as context for later bars.
- Maintain key/tempo/meter and local style across the cut.
We do not require a separate “continuation-only” model. We leverage the same core that also supports prompt-only generation, as in Architecture A:contentReference[oaicite:14]{index=14}.
3.1.3 Inference in continuation mode¶
Given reference tokens R and controls:
-
Build prefix:
BarStart ... R -
Truncate
Rto last N bars (e.g. N = 8) if it exceeds the context length. - Call the Transformer in inference mode, feeding the prefix and sampling until:
- We have generated
length_barsnewBarStarttokens after the boundary; or - An
<END>token appears. - Decode tokens into a structured MIDI-like representation.
Coherence arises because the model is conditioned on the actual reference bars, including:
- Harmony (pitch patterns).
- Rhythm and note density.
- Register and contour.
The model learns to “continue a story” rather than start afresh.
3.1.4 Novelty vs repetition controls¶
To prevent trivial repetition:
- Decoding constraints:
- Penalize re-emission of long n-grams seen in reference
R. - Limit exact 1:1 reuse of whole bars from
R(except for musically natural repetition). - Sampling:
- Use top-p / top-k sampling with temperature.
- For high similarity:
- Lower temperature (more conservative outputs).
- Weak n-gram penalties (allow moderate motif reuse).
- For low similarity:
- Higher temperature, stronger penalties (more novel direction).
These controls are wired to the Similarity slider.
3.2 Advanced model: style embedding & motif-aware continuation (optional)¶
For higher-quality v1+:
3.2.1 Reference encoder¶
Add a reference encoder (small Transformer or BiLSTM) that processes the reference tokens R and outputs:
- Global style embedding
s: - Encodes key center, texture, rhythmic profile, phrase length tendencies.
- Local motif descriptors:
- Summaries of important 1–4 bar motifs (melodic/rhythmic).
The continuation decoder conditions on:
- The prefix tokens (last N bars).
- The style embedding
s(concatenated or injected as conditioning at each step).
This provides robustness when:
- The prefix is long (beyond main context window).
- We want to dial similarity up/down by interpolating between
sand a neutral embedding.
3.2.2 Motif-aware continuation¶
We can add a simple copy-with-variation strategy:
- Detect 1–2 high-salience motifs in reference using:
- Pitch contour patterns.
- Rhythmic n-gram frequency.
- During generation:
- Encourage re-use of motifs with transformations:
- Transposition within key.
- Rhythmic displacement.
- Inversion / augmentation for development.
- Mechanism:
- Either integrated into the model (e.g. via attention bias to motif positions),
- Or via post-processing that searches the continuation; if no motif from reference appears, we bias sampling or patch in motif variations.
This produces thematic development rather than arbitrary continuation or pure repetition.
3.2.3 Hierarchical bar-level planning (optional)¶
Borrowing from hierarchical design ideas in Architecture A:contentReference[oaicite:15]{index=15}:
- Stage 1: bar-level planner:
- Given reference summary + controls, generate a sequence of bar-level embeddings for continuation:
- Per-bar: intensity, harmonic region, density.
- Stage 2: note-level generator:
- Generate notes per bar conditioned on that bar’s embedding and prefix.
Benefits:
- Stronger control over:
- Gradual energy ramp (“build over 8 bars”).
- Harmonic direction (e.g. remain in key, optional modulation).
- Clear mapping from user controls (energy / mood) to bar-level plan.
This is a more complex future variant and not necessary for v0.
4. Integration with Architecture A Pipeline¶
4.1 Symbolic generation stage¶
Architecture A’s symbolic generator is already structured as:
generate_composition(controls: ControlDict, prefix: Optional[MIDI]) -> MIDI
For continuation:
controlscomes from:- Text prompt parsing.
- UI sliders for length, similarity, energy, mood.
prefixis the reference MIDI, converted to Architecture A’s token format:contentReference[oaicite:16]{index=16}.
Flow:
- Parse reference into tokens.
- Infer / confirm tempo, meter, key; prepend tokens.
- Add control tokens for continuation (length, similarity, energy change).
- Call
generate_compositionwithprefixset to reference. - Decode tokens to MIDI: combined
reference + continuation.
The symbolic output is structurally identical to any other Architecture A composition.
4.2 Performance / humanization stage¶
Use the existing performance module for solo piano:contentReference[oaicite:17]{index=17}:
- Input: clean quantized MIDI (reference + continuation).
- Output: expressive performance MIDI with:
- Timing deviations.
- Velocity shaping.
- Articulation changes.
- Pedal events.
Key integration points:
- Boundary handling:
- Continuation starts on a bar boundary.
- Humanization rules are designed to keep bar length invariant (zero-mean timing deviations per bar) to preserve alignment and looping:contentReference[oaicite:18]{index=18}.
- At the reference→continuation boundary:
- Ensure no cumulative tempo drift in the last reference bar.
- Optionally run performance rules with awareness of boundary to avoid abrupt changes.
- Pedal continuity:
- If sustain is down at the end of reference:
- Option 1: treat the boundary as a phrase break and lift pedal.
- Option 2: keep pedal if harmony doesn’t change and style suggests legato.
- The existing performance design already covers bar-level and phrase-level pedal handling and can be extended with a boundary-aware heuristic:contentReference[oaicite:19]{index=19}.
We do not need a separate performance model for the continuation; the same module runs on the full sequence.
4.3 Audio rendering stage¶
The piano renderer (sample-based instrument in v0) remains unchanged:contentReference[oaicite:20]{index=20}:
- Input: performance MIDI.
- Output: stereo audio.
Because the continuation is a direct extension of the same symbolic/performance representation:
- Timbre is consistent (same virtual piano, same FX chain).
- The piece sounds like one continuous performance.
4.4 Caching and partial regeneration¶
Caching strategy:
- Reference:
- Keep the reference MIDI and its performance version cached across continuation attempts.
- Continuation:
- Each generated continuation has its:
- Symbolic MIDI.
- Performance MIDI.
- Rendered audio.
For a new continuation with the same reference:
- Option A (simplest for v0):
- Re-run performance on full
reference + continuation. - Option B (optimization):
- Reuse performance for reference bars from cache, humanize only continuation + maybe a 1-bar overlap to ensure continuity.
Given short clip lengths and low cost of humanization, v0 can start with Option A.
5. UX Flows¶
5.1 Primary continuation flow¶
- Select / import reference:
- User:
- Uploads a short solo-piano MIDI, or
- Selects bars from an existing generated clip (e.g. last 4–8 bars).
-
UI visualizes the reference as a piano roll or basic notation.
-
Configure continuation:
- Controls:
- Length: e.g. 4, 8, or 16 bars.
- Similarity: e.g. “very close”, “medium”, “more exploratory”.
- Energy: “down”, “same”, “up”.
- Optional mood change: “keep mood” / “more hopeful” / “darker”, etc.
-
Optional text prompt field:
- “Make this pattern evolve into something more dramatic.”
-
Generate:
- When the user presses “Continue”:
- System:
- Parses controls and prompt into tokens.
- Runs symbolic continuation model with reference as prefix.
- Applies performance humanization.
- Renders audio.
-
UI:
- Append continuation visually to reference in the timeline.
- Auto-play combined reference + continuation.
-
Refine:
-
User can:
- Regenerate continuation (different random seed).
- Adjust similarity/energy sliders and regenerate.
- Change length and re-run.
-
Export:
- Export options:
- Full audio (WAV/OGG).
- Combined MIDI (reference + continuation).
- Same export UX as baseline Architecture A generation.
5.2 Variants & advanced flows¶
- Alternative endings:
- User keeps the same reference and generates multiple continuations.
- UI allows A/B/C listening and selection of the favorite.
- Chain continuation:
- After one continuation is accepted, user can use the last few bars of the new piece as the next reference, chaining sections.
- Scoped variation:
- Select only the last N bars of the continuation and regenerate them, leaving both the initial reference and earlier continuation intact (reuses Architecture A’s partial-regen semantics:contentReference[oaicite:21]{index=21}).
6. Evaluation Plan¶
6.1 Automatic symbolic metrics¶
We assess both coherence and novelty between reference and continuation.
- Key and scale consistency:
- Run key detection separately on reference and continuation.
- Measure:
- Key match rate.
- Degree of scale overlap.
-
Expect:
- Same key/mode unless user requested modulation.
-
Meter & tempo consistency:
- Verify:
- All continuation bars sum to the correct number of beats for meter (e.g. 4 quarters in 4/4).
- Tempo tokens remain consistent (or follow user-requested changes).
-
Flag any anomalies as model / decoding bugs.
-
Density & rhythmic profile:
- Compute notes per bar and IOI (inter-onset interval) distributions:
- Compare reference vs continuation, adjusted for user’s energy/density controls.
-
Expectations:
- With “same energy/density”: similar ranges.
- With “energy up”: increased density and more active rhythms, but still musically plausible.
-
Motif similarity & structural continuity:
- Extract n-gram patterns (e.g. melodic 3–5 note sequences) from reference.
- Compute:
- Presence of these motifs (or transformed versions) in continuation.
- Proportion of continuation covered by references to these motifs.
- We want:
- Some motif reuse (for thematic coherence).
- No long exact clones of reference bars (avoid trivial repetition).
-
Also check transitions:
- The start of continuation should harmonically and rhythmically “answer” or extend the last bar of reference, not contradict it abruptly.
-
Repetition / looping diagnosis:
- Detect:
- Exact bar repeats in continuation.
- Short repeated cycles (e.g. 1–2 bars looped > 3 times).
- Apply thresholds:
- Repetition within reason is allowed (e.g. repeated ostinato), but degenerate loops should be rare.
6.2 Automatic performance / audio checks¶
Given the existing performance module:
- Confirm:
- No excessive timing drift across full
reference + continuation(bar boundaries remain stable). - Loudness and dynamic range are comparable between reference and continuation, or follow energy control.
- Audio:
- No clipping, obvious clicks, or artifacts around the boundary.
6.3 Human listening tests¶
- Coherence rating:
- Present listeners with:
- Reference segment, followed by continuation.
- Ask:
- “How well does the second part follow the first?” (1–5 scale).
-
Success criterion:
- Majority of continuations rated as at least moderately coherent.
-
Style match rating:
- Ask:
- “How similar is the style of the continuation to the reference?” (1–5).
-
Compare across:
- High vs medium vs low similarity settings.
-
Instruction compliance:
- Provide the text or slider instructions used (e.g. “more energetic, 8 bars”).
- Ask:
- “Did the continuation match these instructions?” (1–5).
-
Evaluate:
- Accuracy of energy/mood change controls.
-
Baseline comparisons:
- Conditions:
- Our continuation vs:
- Naive baseline (random clip appended).
- Trivial repetition (copy last bar N times).
- Ask:
- “Which continuation feels more musical and coherent?”
-
Expect:
- Strong preference for our model.
-
Expert review (optional):
- Have pianists/composers annotate:
- Harmony continuity.
- Voice-leading quality.
- Phrase structure.
7. Risks & Minimal Experiments¶
7.1 Key risks¶
- Style drift vs over-adhesion:
-
Model might:
- Drift away from reference style.
- Or, overfit and copy reference verbatim.
-
Weak effect of user controls:
-
Sliders for similarity/energy might not translate into clear musical differences.
-
Boundary artifacts:
-
Audible discontinuities at the reference→continuation join despite symbolic coherence.
-
Out-of-distribution references:
-
User reference may be outside the model’s style (e.g. jazz when model is trained on cinematic piano).
-
Latency:
- Long references may increase generation time beyond interactive thresholds.
7.2 Minimal experiments¶
Experiment 1: Prefix–continuation fidelity¶
- Take held-out pieces from training data.
- Use first N bars as reference, ask model to generate next N bars.
- Evaluate:
- Symbolic metrics (key, meter, density, motif reuse).
- Human ratings of coherence and style match.
- If drift is frequent:
- Strengthen training with more prefix/continuation examples.
- Adjust decoding to be more conservative near prefix boundary.
Experiment 2: Control responsiveness¶
- Fix a reference and generate multiple continuations with different control settings:
- Similarity: low / medium / high.
- Energy: down / same / up.
- Measure:
- Changes in density, dynamic range, register, and motif reuse.
- Run quick listening test:
- “Which one sounds more energetic?”
- If differences are weak:
- Revisit mapping from controls → tokens/parameters.
- Consider additional conditioning channels (e.g. bar-level intensity embeddings).
Experiment 3: Boundary smoothness¶
- Generate continuations for a variety of references.
- Listen focusing on the join.
- Check:
- Timing alignment.
- Pedal continuity.
- Sudden changes in texture/dynamics.
- If artifacts appear:
- Add boundary-specific rules in performance module (e.g. phrase-end smoothing).
- Possibly generate 1–2 overlapping bars and cross-fade symbolically or at audio level.
Experiment 4: OOD reference robustness¶
- Use references from:
- User-supplied MIDI outside the training style.
- Extreme tempos, keys, densities.
- Observe:
- Does the model still produce musically coherent continuations?
- If behavior is poor:
- Clarify v0 style scope in product (e.g. “optimized for cinematic / emotive solo piano”).
- Add style detection to warn when references are far outside supported domain.
Experiment 5: Latency profiling¶
- Benchmark continuation generation:
- Different prefix lengths (4, 8, 16 bars).
- Different continuation lengths.
- Measure:
- End-to-end latency (symbolic + performance + render).
- If latency is too high:
- Limit maximum prefix bars considered.
- Cache prefix representations for reuse (KV caches in Transformer).
- Optionally provide “preview” mode with shorter continuation or less humanization.
8. References¶
8.1 Internal design docs & research prompts¶
| Ref ID | Type | Title / Name | How it’s used in this design |
|---|---|---|---|
| 43 | Research prompt | “Reference-guided continuation for a single instrument” | Defines product framing, objectives, questions, and scope for single-instrument continuation:contentReference[oaicite:22]{index=22}. |
| 48 | Design doc | “Architecture A: Single-Instrument Text-to-Music Pipeline (Solo Piano Prototype)” | Provides overall text→symbolic→audio pipeline, representation, and UX for solo piano:contentReference[oaicite:23]{index=23}. |
| 46 | Design doc | “Humanization and Performance Modeling for Solo Piano (Architecture A Pipeline)” | Defines performance representation and humanization module integrated into Architecture A:contentReference[oaicite:24]{index=24}. |
| 44 | Research prompt | “Architecture A single-instrument prototype: text-to-symbolic-to-audio piano/guitar pipeline” | Supplies original questions and constraints for Architecture A’s symbolic core and evaluation:contentReference[oaicite:25]{index=25}. |
| 45 | Research prompt | “Humanization and performance modeling for a single instrument” | Higher-level performance modeling goals that the solo-piano doc specializes and implements:contentReference[oaicite:26]{index=26}. |
8.2 External datasets & methods (conceptual references)¶
These are conceptual references used for training/inspiration; details are described in the Architecture A and humanization docs:contentReference[oaicite:27]{index=27}:contentReference[oaicite:28]{index=28}.
| Name | Type | Role in system |
|---|---|---|
| MAESTRO | Piano dataset | Source of expressive solo-piano performances for symbolic and performance modeling. |
| EMOPIA | Emotion-labeled | Provides emotion tags and modern piano clips for mood/conditioned generation. |
| GiantMIDI-Piano | Piano dataset | Large classical piano corpus for additional symbolic training. |
| MIDI-DDSP | Method family | Example of future neural renderer; not required for continuation v0. |
All continuation design choices are constructed to be compatible with these underlying architecture and performance modules, so the continuation feature can be added as a thin extension over the existing pipeline, reusing as much infrastructure and representation as possible:contentReference[oaicite:29]{index=29}:contentReference[oaicite:30]{index=30}:contentReference[oaicite:31]{index=31}.