Skip to content

Research prompt

Title
Architecture A single-instrument prototype: text-to-symbolic-to-audio piano/guitar pipeline


1. Context and assumption

We assume Architecture A is the “symbolic-first + sampler / MIDI-DDSP” path within a broader text-to-symbolic-to-audio system:

  • Text prompts (and optionally simple style tags) are mapped to a conditioning representation.
  • A symbolic composition core generates a single-track, instrument-specific performance (e.g. piano or guitar MIDI).
  • A high-quality virtual instrument (sampler, physical model, or lightweight neural renderer) converts that performance into audio.
  • Users interact via a Udio-style UX: they type prompts, get audio, and can refine or regenerate parts.

We now want to design Architecture A as a buildable, end-to-end prototype focused on a single instrument, short clips (e.g. 15–60 seconds), and a constrained set of styles.

Treat this as the minimal “walking skeleton” of the entire product: every key step exists and is wired together, even if quality is not yet production-grade.


2. Objectives

  1. Define a concrete single-instrument pipeline
    Specify, in engineering-level detail, how to go from text prompt → conditioning → symbolic structure → symbolic performance → instrument render → audio file for one instrument (piano or guitar).

  2. Choose and justify a target instrument and style scope

  3. Pick one: e.g. expressive solo piano or strummed/accompaniment guitar.
  4. Define a narrow initial style range (e.g. “emotional cinematic piano,” “lofi piano,” “acoustic rhythm guitar pop”).

  5. Specify symbolic representations and interfaces

  6. Event/timeline formats for structure, notes, timing, dynamics, articulations.
  7. How modules pass information between each other (APIs, schemas).

  8. Propose 1–2 viable model architectures for the symbolic core

  9. At least one “minimal, implementable now” option.
  10. Optionally one more ambitious but still realistic option.

  11. Define a concrete training and data plan

  12. Instrument-specific datasets, including licensing considerations.
  13. Preprocessing, tokenization, augmentation.

  14. Define evaluation methods (automatic + human)

  15. What “good enough” means for this prototype.
  16. How to measure musical quality, style match, and prompt controllability.

  17. Identify the riskiest assumptions and minimal experiments

  18. For each major assumption, specify a small, concrete experiment to validate or kill it.
  19. Prioritize by impact and ease.

3. Questions to answer

3.1 Product and UX framing

  1. What exactly does the prototype need to do for a non-musician user?
  2. Example: “User types: ‘melancholic solo piano in ¾, sparse left-hand, lots of space’ and gets 30 seconds of reasonably coherent piano that matches the mood and meter.”

  3. What kinds of prompts will be supported in v0?

  4. Mood/genre (melancholic, jazz, cinematic, lo-fi).
  5. Tempo and meter (BPM, 4/4, ¾, etc.).
  6. Simple structural hints (“start soft, build up”, “left hand arpeggios, right hand melody”).
  7. Hard musical constraints (optional, e.g. “in C minor,” “8 bars”).

  8. What editing operations must be supported in v0?

  9. Regenerate entire clip?
  10. Regenerate last N bars?
  11. Fix tempo or length while preserving style?
  12. Change mood but keep rough rhythm?

  13. What latency is acceptable for:

  14. A “fast preview” render?
  15. A “higher quality” render?

3.2 Data and licensing

  1. Which single-instrument corpora are realistic candidates for training?
  2. Focus on corpora that are:

    • Clearly licensed for commercial use, or
    • Clearly separated as research-only for prototyping (with a plan to replace later).
  3. For each candidate dataset:

  4. What is the size (number of pieces, hours, bars)?
  5. What metadata exists (composer, style, tempo, key, time signature, annotations)?
  6. How well does it match the desired style(s)?

  7. How will you handle the licensing stance?

  8. Separate “fast research” models trained on permissive or NC datasets from “production-eligible” models trained only on commercially clean data.
  9. Document a path from the former to the latter.

3.3 Symbolic representation

  1. What is the symbolic representation for the single instrument?
  2. Token-based (NOTE_ON, NOTE_OFF, TIME_SHIFT, VELOCITY, etc.) vs. quantized piano roll vs. continuous event streams.
  3. How are bars, beats, and tempo represented?

  4. How will you represent:

  5. Meter and tempo changes?
  6. Phrases and sections (intro, A, B, outro) even for short pieces?
  7. Dynamics (pp–ff), pedaling (for piano), or strumming patterns (for guitar)?

  8. How do you encode the text prompt into a conditioning representation?

    • Option A: use an off-the-shelf text encoder and map to control tokens.
    • Option B: manually designed control channels (e.g. mood id, density level, tempo, complexity).
  9. How will symbolic structure be encoded?

    • Do you need a separate “structure planner” (bars, sections) before note-level generation, or is one flat sequence enough for short clips?

3.4 Model architectures (symbolic core)

  1. Propose at least one minimal model architecture for the symbolic generator:

    • Example template: Transformer decoder that predicts token sequences conditioned on:
    • Encoded prompt,
    • Global control tokens (tempo, style),
    • Optional previous bars (for continuation).
  2. If you propose a second, more advanced variant:

    • What additional structure does it exploit? (e.g. bar-level or phrase-level models, hierarchical transformers.)
    • What extra complexity does it add, and why is it worth it?
  3. How will you handle:

    • Fixed-length vs variable-length outputs?
    • Control over density (notes per bar) and register (range of pitches)?
    • For guitar: voicings and playability constraints.

3.5 Performance rendering

  1. How do you turn a “clean” symbolic sequence into a human-like performance?

    • Humanization of timing and velocity.
    • Articulation and pedal (piano) or strum direction, picking patterns (guitar).
  2. Is performance modeling:

    • Learned (e.g. model that maps clean score to expressive performance)?
    • Rule-based (heuristics + noise)?
    • Hybrid?
  3. How do you ensure the performance layer does not destroy structural alignment (bars, beats) needed for looping or concatenation?

3.6 Instrument render

  1. What rendering strategy will you use?

    • High-quality sample library?
    • Synth/physical model?
    • Lightweight neural renderer?
  2. What are the trade-offs between:

    • Sound quality,
    • Latency,
    • Resource usage (CPU/GPU),
    • Ease of deployment?
  3. What technical interface do you define between performance and renderer?

    • MIDI?
    • Custom event API?
    • Control signals (e.g. continuous controllers)?

3.7 Training & inference workflows

  1. How will you train the symbolic generator?

    • Teacher forcing on token sequences?
    • Additional losses for style or density control?
    • How will you incorporate prompt-like conditioning if you do not have real prompts?
  2. How do you bootstrap prompt → control mappings?

    • Synthetic prompts derived from metadata?
    • Manually labeled subsets?
    • Clustering of pieces into style buckets and assigning tokens?
  3. How will inference work end-to-end?

    • Step-by-step: from text prompt to an audio file.
    • Where do you cache intermediate results (e.g. symbolic sequences) for fast regeneration?

3.8 Evaluation

  1. What automatic metrics can you use?

    • Symbolic-level:
    • Pitch and rhythm distributions,
    • Note density per bar,
    • Key/mode detection,
    • Repetition and motif statistics.
    • Audio-level:
    • Loudness normalization,
    • Spectral properties, basic perceptual proxies.
  2. What human evaluation protocol will you define?

    • Listening tests comparing:
    • Generated vs real pieces from the target corpus.
    • Generated pieces under different prompt conditions.
    • Rating dimensions:
    • Musicality,
    • Style match to prompt,
    • Plausibility as real human performance,
    • Prompt controllability.
  3. What is a realistic baseline?

    • For example, a simple rule-based generator or a naive Markov/LM baseline to beat.

4. Scope and constraints

  • Single instrument (choose one: piano or guitar) for the core of this research.
  • Short clips only: e.g. 15–60 seconds, no full songs.
  • No multi-track arrangements yet.
  • No vocals in this phase.
  • UX oriented to:
  • Simple text prompts,
  • A small number of sliders/toggles (length, tempo, density, mood).
  • Licensing stance:
  • Clean separation between research-only resources and anything intended for production.
  • The architecture itself must not rely on inherently non-commercial-only components.

5. Artifacts and deliverables

The research should produce:

  1. A written system design for the single-instrument Architecture A prototype:
  2. Module diagram and data flow.
  3. Interface definitions between modules (including schema snippets or pseudo-APIs).

  4. A concrete data plan:

  5. Enumerated datasets with licensing notes.
  6. Preprocessing pipeline design.

  7. Model architecture specs:

  8. Symbolic generator (at least one minimal, one optional advanced).
  9. Performance renderer design (rules, learned, or hybrid).
  10. Instrument render integration.

  11. Training and inference recipes:

  12. Data splits, tokenization, training objectives.
  13. Inference flow, caching strategy, latency targets.

  14. Evaluation plan:

  15. Automatic metrics and their limitations.
  16. Human listening test protocol.

  17. Risk register and experiment plan:

  18. List of key assumptions.
  19. Minimal experiments to validate or kill each assumption.
  20. Prioritized list of experiments for a first implementation cycle.

6. Process guidance

  1. Start by fixing:
  2. Target instrument,
  3. Target styles,
  4. Clip length,
  5. Latency expectations.

  6. Map prompts to control parameters:

  7. Decide which aspects of the music are prompt-controlled in v0.
  8. Specify a schema for those controls.

  9. Design symbolic representation and tokenization:

  10. Ensure it can express all controls you want.

  11. Propose and compare model variants:

  12. At least one that is implementable with modest resources now.

  13. Design the performance and render stages to be:

  14. Pluggable (swappable instrument backends),
  15. Fast enough for interactive preview.

  16. Keep a running list of risks and questions:

  17. Periodically re-rank them as you learn more.

7. Non-goals

This research does not need to:

  • Solve multi-instrument arrangements.
  • Handle advanced editing like per-note piano roll editing in the UI.
  • Implement full mixing/mastering chains.
  • Address advanced audio source separation or reference-audio extraction beyond simple future hooks.

8. Success criteria

We consider this research successful if it yields:

  • One or two clear, buildable designs for a single-instrument Architecture A prototype.
  • A realistic data and training plan that respects licensing constraints.
  • A set of minimal experiments that can be executed to validate feasibility within weeks, not months.
  • A clear explanation of what we get from this prototype in terms of:
  • Demonstrated capabilities,
  • Known limitations,
  • Next steps towards multi-track and vocals.

Only after this synthesis and experiment plan should we commit significant engineering time to building the prototype.