Research prompt¶
Title
Architecture A single-instrument prototype: text-to-symbolic-to-audio piano/guitar pipeline
1. Context and assumption¶
We assume Architecture A is the “symbolic-first + sampler / MIDI-DDSP” path within a broader text-to-symbolic-to-audio system:
- Text prompts (and optionally simple style tags) are mapped to a conditioning representation.
- A symbolic composition core generates a single-track, instrument-specific performance (e.g. piano or guitar MIDI).
- A high-quality virtual instrument (sampler, physical model, or lightweight neural renderer) converts that performance into audio.
- Users interact via a Udio-style UX: they type prompts, get audio, and can refine or regenerate parts.
We now want to design Architecture A as a buildable, end-to-end prototype focused on a single instrument, short clips (e.g. 15–60 seconds), and a constrained set of styles.
Treat this as the minimal “walking skeleton” of the entire product: every key step exists and is wired together, even if quality is not yet production-grade.
2. Objectives¶
-
Define a concrete single-instrument pipeline
Specify, in engineering-level detail, how to go from text prompt → conditioning → symbolic structure → symbolic performance → instrument render → audio file for one instrument (piano or guitar). -
Choose and justify a target instrument and style scope
- Pick one: e.g. expressive solo piano or strummed/accompaniment guitar.
-
Define a narrow initial style range (e.g. “emotional cinematic piano,” “lofi piano,” “acoustic rhythm guitar pop”).
-
Specify symbolic representations and interfaces
- Event/timeline formats for structure, notes, timing, dynamics, articulations.
-
How modules pass information between each other (APIs, schemas).
-
Propose 1–2 viable model architectures for the symbolic core
- At least one “minimal, implementable now” option.
-
Optionally one more ambitious but still realistic option.
-
Define a concrete training and data plan
- Instrument-specific datasets, including licensing considerations.
-
Preprocessing, tokenization, augmentation.
-
Define evaluation methods (automatic + human)
- What “good enough” means for this prototype.
-
How to measure musical quality, style match, and prompt controllability.
-
Identify the riskiest assumptions and minimal experiments
- For each major assumption, specify a small, concrete experiment to validate or kill it.
- Prioritize by impact and ease.
3. Questions to answer¶
3.1 Product and UX framing¶
- What exactly does the prototype need to do for a non-musician user?
-
Example: “User types: ‘melancholic solo piano in ¾, sparse left-hand, lots of space’ and gets 30 seconds of reasonably coherent piano that matches the mood and meter.”
-
What kinds of prompts will be supported in v0?
- Mood/genre (melancholic, jazz, cinematic, lo-fi).
- Tempo and meter (BPM, 4/4, ¾, etc.).
- Simple structural hints (“start soft, build up”, “left hand arpeggios, right hand melody”).
-
Hard musical constraints (optional, e.g. “in C minor,” “8 bars”).
-
What editing operations must be supported in v0?
- Regenerate entire clip?
- Regenerate last N bars?
- Fix tempo or length while preserving style?
-
Change mood but keep rough rhythm?
-
What latency is acceptable for:
- A “fast preview” render?
- A “higher quality” render?
3.2 Data and licensing¶
- Which single-instrument corpora are realistic candidates for training?
-
Focus on corpora that are:
- Clearly licensed for commercial use, or
- Clearly separated as research-only for prototyping (with a plan to replace later).
-
For each candidate dataset:
- What is the size (number of pieces, hours, bars)?
- What metadata exists (composer, style, tempo, key, time signature, annotations)?
-
How well does it match the desired style(s)?
-
How will you handle the licensing stance?
- Separate “fast research” models trained on permissive or NC datasets from “production-eligible” models trained only on commercially clean data.
- Document a path from the former to the latter.
3.3 Symbolic representation¶
- What is the symbolic representation for the single instrument?
- Token-based (NOTE_ON, NOTE_OFF, TIME_SHIFT, VELOCITY, etc.) vs. quantized piano roll vs. continuous event streams.
-
How are bars, beats, and tempo represented?
-
How will you represent:
- Meter and tempo changes?
- Phrases and sections (intro, A, B, outro) even for short pieces?
-
Dynamics (pp–ff), pedaling (for piano), or strumming patterns (for guitar)?
-
How do you encode the text prompt into a conditioning representation?
- Option A: use an off-the-shelf text encoder and map to control tokens.
- Option B: manually designed control channels (e.g. mood id, density level, tempo, complexity).
-
How will symbolic structure be encoded?
- Do you need a separate “structure planner” (bars, sections) before note-level generation, or is one flat sequence enough for short clips?
3.4 Model architectures (symbolic core)¶
-
Propose at least one minimal model architecture for the symbolic generator:
- Example template: Transformer decoder that predicts token sequences conditioned on:
- Encoded prompt,
- Global control tokens (tempo, style),
- Optional previous bars (for continuation).
-
If you propose a second, more advanced variant:
- What additional structure does it exploit? (e.g. bar-level or phrase-level models, hierarchical transformers.)
- What extra complexity does it add, and why is it worth it?
-
How will you handle:
- Fixed-length vs variable-length outputs?
- Control over density (notes per bar) and register (range of pitches)?
- For guitar: voicings and playability constraints.
3.5 Performance rendering¶
-
How do you turn a “clean” symbolic sequence into a human-like performance?
- Humanization of timing and velocity.
- Articulation and pedal (piano) or strum direction, picking patterns (guitar).
-
Is performance modeling:
- Learned (e.g. model that maps clean score to expressive performance)?
- Rule-based (heuristics + noise)?
- Hybrid?
-
How do you ensure the performance layer does not destroy structural alignment (bars, beats) needed for looping or concatenation?
3.6 Instrument render¶
-
What rendering strategy will you use?
- High-quality sample library?
- Synth/physical model?
- Lightweight neural renderer?
-
What are the trade-offs between:
- Sound quality,
- Latency,
- Resource usage (CPU/GPU),
- Ease of deployment?
-
What technical interface do you define between performance and renderer?
- MIDI?
- Custom event API?
- Control signals (e.g. continuous controllers)?
3.7 Training & inference workflows¶
-
How will you train the symbolic generator?
- Teacher forcing on token sequences?
- Additional losses for style or density control?
- How will you incorporate prompt-like conditioning if you do not have real prompts?
-
How do you bootstrap prompt → control mappings?
- Synthetic prompts derived from metadata?
- Manually labeled subsets?
- Clustering of pieces into style buckets and assigning tokens?
-
How will inference work end-to-end?
- Step-by-step: from text prompt to an audio file.
- Where do you cache intermediate results (e.g. symbolic sequences) for fast regeneration?
3.8 Evaluation¶
-
What automatic metrics can you use?
- Symbolic-level:
- Pitch and rhythm distributions,
- Note density per bar,
- Key/mode detection,
- Repetition and motif statistics.
- Audio-level:
- Loudness normalization,
- Spectral properties, basic perceptual proxies.
-
What human evaluation protocol will you define?
- Listening tests comparing:
- Generated vs real pieces from the target corpus.
- Generated pieces under different prompt conditions.
- Rating dimensions:
- Musicality,
- Style match to prompt,
- Plausibility as real human performance,
- Prompt controllability.
-
What is a realistic baseline?
- For example, a simple rule-based generator or a naive Markov/LM baseline to beat.
4. Scope and constraints¶
- Single instrument (choose one: piano or guitar) for the core of this research.
- Short clips only: e.g. 15–60 seconds, no full songs.
- No multi-track arrangements yet.
- No vocals in this phase.
- UX oriented to:
- Simple text prompts,
- A small number of sliders/toggles (length, tempo, density, mood).
- Licensing stance:
- Clean separation between research-only resources and anything intended for production.
- The architecture itself must not rely on inherently non-commercial-only components.
5. Artifacts and deliverables¶
The research should produce:
- A written system design for the single-instrument Architecture A prototype:
- Module diagram and data flow.
-
Interface definitions between modules (including schema snippets or pseudo-APIs).
-
A concrete data plan:
- Enumerated datasets with licensing notes.
-
Preprocessing pipeline design.
-
Model architecture specs:
- Symbolic generator (at least one minimal, one optional advanced).
- Performance renderer design (rules, learned, or hybrid).
-
Instrument render integration.
-
Training and inference recipes:
- Data splits, tokenization, training objectives.
-
Inference flow, caching strategy, latency targets.
-
Evaluation plan:
- Automatic metrics and their limitations.
-
Human listening test protocol.
-
Risk register and experiment plan:
- List of key assumptions.
- Minimal experiments to validate or kill each assumption.
- Prioritized list of experiments for a first implementation cycle.
6. Process guidance¶
- Start by fixing:
- Target instrument,
- Target styles,
- Clip length,
-
Latency expectations.
-
Map prompts to control parameters:
- Decide which aspects of the music are prompt-controlled in v0.
-
Specify a schema for those controls.
-
Design symbolic representation and tokenization:
-
Ensure it can express all controls you want.
-
Propose and compare model variants:
-
At least one that is implementable with modest resources now.
-
Design the performance and render stages to be:
- Pluggable (swappable instrument backends),
-
Fast enough for interactive preview.
-
Keep a running list of risks and questions:
- Periodically re-rank them as you learn more.
7. Non-goals¶
This research does not need to:
- Solve multi-instrument arrangements.
- Handle advanced editing like per-note piano roll editing in the UI.
- Implement full mixing/mastering chains.
- Address advanced audio source separation or reference-audio extraction beyond simple future hooks.
8. Success criteria¶
We consider this research successful if it yields:
- One or two clear, buildable designs for a single-instrument Architecture A prototype.
- A realistic data and training plan that respects licensing constraints.
- A set of minimal experiments that can be executed to validate feasibility within weeks, not months.
- A clear explanation of what we get from this prototype in terms of:
- Demonstrated capabilities,
- Known limitations,
- Next steps towards multi-track and vocals.
Only after this synthesis and experiment plan should we commit significant engineering time to building the prototype.