Research prompt¶

Title
Reference-guided continuation for a single instrument

1. Context and assumption¶

We assume:

Architecture A can generate single-instrument clips from text prompts alone.
Symbolic and performance layers exist (clean score → humanized performance → audio).
Users also want to continue or extend an existing musical idea for the same instrument.

This research focuses on reference-guided continuation:

Input: a short reference snippet for the target instrument (symbolic, and optionally audio in later phases), plus optional prompt text.
Output: a continuation that:
Maintains local musical coherence (key, tempo, feel).
Respects the style and basic patterns of the reference (rhythm, density, register).
Is not a trivial copy/paste or near-duplicate.

For this research, assume the main, practical v0 path uses symbolic reference input (e.g. MIDI). Audio-to-symbolic extraction can be treated as a separate future extension.

2. Objectives¶

Define what “continuation” means operationally
Temporal scope (e.g. 4–16 bars).
Degree of adherence to reference style vs freedom to evolve.
Specify input and conditioning formats
How symbolic reference segments are represented.
How optional text prompts interact with the reference.
Design 1–2 continuation approaches
At least one minimal, implementable now.
Optionally one more advanced approach (e.g. style embedding, motif-aware model).
Establish constraints for coherence and diversity
How to prevent near-duplication or looping artifacts.
How to avoid abrupt changes in feel/key/tempo unless requested.
Define evaluation and success criteria
Human and automatic methods to assess coherence, style match, and novelty.
Identify riskiest assumptions and minimal experiments
Especially around style conditioning and avoiding overfitting to the reference.

3. Questions to answer¶

3.1 Product and UX framing¶

What are the key user flows for continuation?

Examples: - User records or imports a short MIDI sketch and asks: “Continue this for 8 bars in the same style.” - User selects last 4 bars of a generated clip and asks: “Give me an alternative continuation that is more energetic.” - User gives text prompt + reference: “Make this pattern evolve into something more dramatic over the next 8 bars.”

What controls does the user have during continuation?

Potential controls: - Length of continuation (bars/seconds). - “Similarity to reference” (low → high). - “Energy / density” change (down, same, up). - Optional mood change.

What constraints must be respected?
Keep tempo and meter consistent unless asked otherwise.
Maintain key/scale unless user explicitly asks to modulate.
Avoid copying full bars verbatim beyond what is musically natural.

3.2 Representation and segmentation¶

How is the reference segment represented?
Same symbolic format as Architecture A (tokens/notes with bar/beat positions, velocities, etc.).
Any additional structural tags needed? (e.g. phrase boundaries.)
How is the continuation segment represented?
Same representation, but with explicit boundary between reference and continuation.
How do you segment the reference and continuation?
Fixed-length window (e.g. last N bars as conditioning).
Or variable length depending on user selection.
How do you handle partial bars or non-aligned references?
Round to nearest bar?
Allow half-bar or beat-level alignment?

3.3 Conditioning on the reference¶

What information from the reference is used for conditioning?

Candidates: - Key/scale and tonal center. - Tempo and meter. - Register (pitch range). - Density (notes per bar). - Rhythmic motifs or patterns. - Harmonic information (for chords/arpeggios).

What conditioning approaches are possible?

Approach A (minimal): - Concatenate reference tokens and have the model autoregressively continue. - Possibly clip the context length to the last N bars.

Approach B (structured): - Extract features/summary from the reference (e.g. density, contour, rhythm patterns). - Condition the continuation model on both tokens and these features.

How do we combine text prompts with reference conditioning?
- Text for high-level intent (“more energetic”, “sparser”, “modulate to G major”).
- Reference for detailed style (groove, voicing habits).

3.4 Continuation model design¶

What model architectures are viable for continuation?

Minimal option: - Same model family as Architecture A’s symbolic generator, used in “conditional continuation” mode: - Input: reference sequence (and optional prompt encoding). - Output: new tokens for continuation.

More advanced option: - Model trained explicitly on continuation tasks: - Input: (prefix, desired continuation length, control tokens). - Output: continuation segment.

How do you enforce:
Temporal coherence (no jumps in tempo or meter)?
Tonal coherence (no random key jumps unless requested)?
Style coherence (feel similar but not identical)?
How do you ensure novelty vs overfitting?
Penalize exact repetition of long n-grams from the reference?
Use sampling strategies that avoid trivial looping?
Encourage controlled variation (e.g. motif transformation).
What is the minimal training setup to get a useful continuation model?
Can we train on generic single-instrument data by:
- Splitting pieces into (prefix, continuation) pairs?
- Adding random crop positions?

3.5 Integration with performance and rendering¶

At what stage does continuation operate?
Symbolic level:
- Reference symbolic → continuation symbolic.
Performance level:
- Apply performance modeling after full sequence (reference + continuation) is decided.
Audio level:
- Continuation is rendered with the same instrument and performance style.
How do we ensure seamless audio joins?
Align on bar boundaries.
Use consistent performance parameters across the boundary.
Avoid sudden changes in loudness or timbre.
How do we handle multiple continuation attempts?
Preserve the reference as immutable.
Cache different continuation variants (symbolic and/or audio).
Support A/B listening in UX.

3.6 Evaluation¶

What automatic metrics can approximate good continuation?

Symbolic-level: - Key and scale consistency between reference and continuation. - Similar density and rhythmic complexity, unless controls say otherwise. - Motif similarity measures (e.g. pattern reuse with variation).

What human evaluation setup is needed?

Examples: - Blind tests where listeners rate: - How coherent the continuation feels with the reference. - How well it matches a target instruction (e.g. “more energetic”). - Comparisons: - Continuation vs naive baseline (e.g. unrelated generated clip). - Multiple continuation options for the same reference.

What minimal bar defines success?
Majority of listeners rate continuations as coherent and stylistically similar.
Few cases of abrupt or jarring transitions (quantified via user feedback).

4. Scope and constraints¶

Single instrument (same as Architecture A and the performance module).
Primary input modality: symbolic reference (MIDI or internal symbolic representation).
Audio reference support can be treated as:
Out of scope for v0, or
A separate future research track (audio-to-symbolic extraction for this instrument).
Continuations limited to short spans (e.g. up to 8–16 bars) for this research.
Must preserve:
Tempo and meter unless explicitly overridden.
Overall key unless explicitly overridden.

5. Artifacts and deliverables¶

The research should produce:

Continuation problem definition
Precise statement of what continuation means in v0.
Constraints and UX expectations.
Input/output and conditioning spec
How reference, prompt, and control parameters are represented.
How they feed into the continuation model.
Model design(s)
At least one minimal approach, including:
- Architecture,
- Training procedure,
- Inference strategy.
Integration design
Where continuation sits in the full pipeline.
How it interfaces with performance modeling and rendering.
Caching and UX behavior for multiple continuations.
Evaluation plan
Automatic metrics.
Human listening test designs.
Risk and experiment plan
Key assumptions (e.g. “simple autoregressive continuation with trimming is enough for v0”).
Small experiments to test these assumptions before full implementation.

6. Process guidance¶

Start by defining the user stories and constraints for continuation.
Specify the symbolic representation and segmentation strategy for references.
Design the simplest possible continuation approach consistent with these constraints.
Prototype on existing single-instrument data (prefix → continuation) and evaluate qualitatively.
Iterate on conditioning and sampling to balance coherence and novelty.
Document what works well enough for v0 and what should be deferred.

7. Non-goals¶

This research does not need to:

Handle multi-instrument or full-band continuations.
Perform robust audio-to-symbolic transcription for arbitrary user audio.
Guarantee global song-level structure beyond the local continuation window.
Incorporate advanced user editing tools beyond basic selection and regeneration.