Research prompt¶
Title
Reference-guided continuation for a single instrument
1. Context and assumption¶
We assume:
- Architecture A can generate single-instrument clips from text prompts alone.
- Symbolic and performance layers exist (clean score → humanized performance → audio).
- Users also want to continue or extend an existing musical idea for the same instrument.
This research focuses on reference-guided continuation:
- Input: a short reference snippet for the target instrument (symbolic, and optionally audio in later phases), plus optional prompt text.
- Output: a continuation that:
- Maintains local musical coherence (key, tempo, feel).
- Respects the style and basic patterns of the reference (rhythm, density, register).
- Is not a trivial copy/paste or near-duplicate.
For this research, assume the main, practical v0 path uses symbolic reference input (e.g. MIDI). Audio-to-symbolic extraction can be treated as a separate future extension.
2. Objectives¶
- Define what “continuation” means operationally
- Temporal scope (e.g. 4–16 bars).
-
Degree of adherence to reference style vs freedom to evolve.
-
Specify input and conditioning formats
- How symbolic reference segments are represented.
-
How optional text prompts interact with the reference.
-
Design 1–2 continuation approaches
- At least one minimal, implementable now.
-
Optionally one more advanced approach (e.g. style embedding, motif-aware model).
-
Establish constraints for coherence and diversity
- How to prevent near-duplication or looping artifacts.
-
How to avoid abrupt changes in feel/key/tempo unless requested.
-
Define evaluation and success criteria
-
Human and automatic methods to assess coherence, style match, and novelty.
-
Identify riskiest assumptions and minimal experiments
- Especially around style conditioning and avoiding overfitting to the reference.
3. Questions to answer¶
3.1 Product and UX framing¶
- What are the key user flows for continuation?
Examples: - User records or imports a short MIDI sketch and asks: “Continue this for 8 bars in the same style.” - User selects last 4 bars of a generated clip and asks: “Give me an alternative continuation that is more energetic.” - User gives text prompt + reference: “Make this pattern evolve into something more dramatic over the next 8 bars.”
- What controls does the user have during continuation?
Potential controls: - Length of continuation (bars/seconds). - “Similarity to reference” (low → high). - “Energy / density” change (down, same, up). - Optional mood change.
- What constraints must be respected?
- Keep tempo and meter consistent unless asked otherwise.
- Maintain key/scale unless user explicitly asks to modulate.
- Avoid copying full bars verbatim beyond what is musically natural.
3.2 Representation and segmentation¶
-
How is the reference segment represented?
-
Same symbolic format as Architecture A (tokens/notes with bar/beat positions, velocities, etc.).
-
Any additional structural tags needed? (e.g. phrase boundaries.)
-
How is the continuation segment represented?
-
Same representation, but with explicit boundary between reference and continuation.
-
How do you segment the reference and continuation?
- Fixed-length window (e.g. last N bars as conditioning).
-
Or variable length depending on user selection.
-
How do you handle partial bars or non-aligned references?
- Round to nearest bar?
- Allow half-bar or beat-level alignment?
3.3 Conditioning on the reference¶
- What information from the reference is used for conditioning?
Candidates: - Key/scale and tonal center. - Tempo and meter. - Register (pitch range). - Density (notes per bar). - Rhythmic motifs or patterns. - Harmonic information (for chords/arpeggios).
- What conditioning approaches are possible?
Approach A (minimal): - Concatenate reference tokens and have the model autoregressively continue. - Possibly clip the context length to the last N bars.
Approach B (structured): - Extract features/summary from the reference (e.g. density, contour, rhythm patterns). - Condition the continuation model on both tokens and these features.
- How do we combine text prompts with reference conditioning?
- Text for high-level intent (“more energetic”, “sparser”, “modulate to G major”).
- Reference for detailed style (groove, voicing habits).
3.4 Continuation model design¶
- What model architectures are viable for continuation?
Minimal option: - Same model family as Architecture A’s symbolic generator, used in “conditional continuation” mode: - Input: reference sequence (and optional prompt encoding). - Output: new tokens for continuation.
More advanced option: - Model trained explicitly on continuation tasks: - Input: (prefix, desired continuation length, control tokens). - Output: continuation segment.
- How do you enforce:
- Temporal coherence (no jumps in tempo or meter)?
- Tonal coherence (no random key jumps unless requested)?
-
Style coherence (feel similar but not identical)?
-
How do you ensure novelty vs overfitting?
- Penalize exact repetition of long n-grams from the reference?
- Use sampling strategies that avoid trivial looping?
-
Encourage controlled variation (e.g. motif transformation).
-
What is the minimal training setup to get a useful continuation model?
- Can we train on generic single-instrument data by:
- Splitting pieces into (prefix, continuation) pairs?
- Adding random crop positions?
3.5 Integration with performance and rendering¶
-
At what stage does continuation operate?
-
Symbolic level:
- Reference symbolic → continuation symbolic.
- Performance level:
- Apply performance modeling after full sequence (reference + continuation) is decided.
-
Audio level:
- Continuation is rendered with the same instrument and performance style.
-
How do we ensure seamless audio joins?
- Align on bar boundaries.
- Use consistent performance parameters across the boundary.
-
Avoid sudden changes in loudness or timbre.
-
How do we handle multiple continuation attempts?
- Preserve the reference as immutable.
- Cache different continuation variants (symbolic and/or audio).
- Support A/B listening in UX.
3.6 Evaluation¶
- What automatic metrics can approximate good continuation?
Symbolic-level: - Key and scale consistency between reference and continuation. - Similar density and rhythmic complexity, unless controls say otherwise. - Motif similarity measures (e.g. pattern reuse with variation).
- What human evaluation setup is needed?
Examples: - Blind tests where listeners rate: - How coherent the continuation feels with the reference. - How well it matches a target instruction (e.g. “more energetic”). - Comparisons: - Continuation vs naive baseline (e.g. unrelated generated clip). - Multiple continuation options for the same reference.
- What minimal bar defines success?
- Majority of listeners rate continuations as coherent and stylistically similar.
- Few cases of abrupt or jarring transitions (quantified via user feedback).
4. Scope and constraints¶
- Single instrument (same as Architecture A and the performance module).
- Primary input modality: symbolic reference (MIDI or internal symbolic representation).
- Audio reference support can be treated as:
- Out of scope for v0, or
-
A separate future research track (audio-to-symbolic extraction for this instrument).
-
Continuations limited to short spans (e.g. up to 8–16 bars) for this research.
- Must preserve:
- Tempo and meter unless explicitly overridden.
- Overall key unless explicitly overridden.
5. Artifacts and deliverables¶
The research should produce:
- Continuation problem definition
- Precise statement of what continuation means in v0.
-
Constraints and UX expectations.
-
Input/output and conditioning spec
- How reference, prompt, and control parameters are represented.
-
How they feed into the continuation model.
-
Model design(s)
-
At least one minimal approach, including:
- Architecture,
- Training procedure,
- Inference strategy.
-
Integration design
- Where continuation sits in the full pipeline.
- How it interfaces with performance modeling and rendering.
-
Caching and UX behavior for multiple continuations.
-
Evaluation plan
- Automatic metrics.
-
Human listening test designs.
-
Risk and experiment plan
- Key assumptions (e.g. “simple autoregressive continuation with trimming is enough for v0”).
- Small experiments to test these assumptions before full implementation.
6. Process guidance¶
- Start by defining the user stories and constraints for continuation.
- Specify the symbolic representation and segmentation strategy for references.
- Design the simplest possible continuation approach consistent with these constraints.
- Prototype on existing single-instrument data (prefix → continuation) and evaluate qualitatively.
- Iterate on conditioning and sampling to balance coherence and novelty.
- Document what works well enough for v0 and what should be deferred.
7. Non-goals¶
This research does not need to:
- Handle multi-instrument or full-band continuations.
- Perform robust audio-to-symbolic transcription for arbitrary user audio.
- Guarantee global song-level structure beyond the local continuation window.
- Incorporate advanced user editing tools beyond basic selection and regeneration.