Research prompt¶
Title
Humanization and performance modeling for a single instrument
1. Context and assumption¶
We assume:
- Architecture A provides a clean, quantized symbolic output for a single target instrument (e.g. piano or guitar): correct notes, approximate durations, basic dynamics, and bar/beat structure.
- A separate instrument renderer (sampler / synth / lightweight neural) converts symbolic performance to audio.
- Current naive playback (direct quantized output → renderer) sounds robotic, mechanical, or “MIDI-ish”.
This research focuses on the performance layer between “clean symbolic” and “renderable performance”:
- For one instrument (same as Architecture A’s target, e.g. solo piano or strummed acoustic guitar).
- Adding timing deviations, dynamic shaping, articulation (and instrument-specific nuances) to make output feel human and musical.
- Maintaining structural alignment (bars/beats) so the UX can still support looping, continuation, and scoped edits.
Treat this as an independent module that can be plugged into Architecture A.
2. Objectives¶
- Define a performance representation
-
Decide how to represent expressive timing, dynamics, articulation, and other performance controls for the chosen instrument.
-
Map from “clean score” to expressive performance
-
Design one or more approaches (rule-based, learned, hybrid) that accept clean symbolic sequences and output performance-augmented sequences.
-
Specify instrument-specific expressive controls
-
For example:
- Piano: micro-timing, velocity curves, pedal, voicing emphasis.
- Guitar: strum patterns, picking direction, chord voicings, string noise.
-
Design a minimal, implementable humanization baseline
-
A rule-based or lightweight model that can be implemented quickly and evaluated against a quantized baseline.
-
Define evaluation and success criteria
-
Human and automatic methods to measure the improvement over robotic playback without breaking alignment.
-
Identify riskiest assumptions and minimal experiments
- Especially around:
- Complexity needed for noticeable quality gains.
- Data requirements for learned models (if any).
- Maintaining bar/beat integrity.
3. Questions to answer¶
3.1 Product and UX framing¶
- What does “human enough” mean for this instrument and prototype?
- Is the goal:
- Subtle realism for background listening?
- Strong stylization (e.g. “jazzy swing”, “lo-fi wonkiness”)?
-
For v0, which of these is essential, and which can be postponed?
-
Where in the UX does performance modeling show up?
- Global setting per clip (e.g. “humanization level” slider)?
- Style presets (e.g. “straight”, “swing”, “rubato”, “tight” vs “loose”)?
-
Per-section overrides (intro more rubato, main section tighter)?
-
What constraints must performance modeling respect so that:
- Clips can still be looped at bar boundaries?
- Regenerating a segment doesn’t cause jarring timing mismatches with neighboring segments?
3.2 Performance representation¶
- How do we represent performance deviations?
For timing: - Per-note onset offsets relative to the grid? - Groove templates per bar/beat? - Global tempo curves (rubato, accelerando, ritardando)?
For dynamics: - Per-note velocity modifications? - Phrase-level envelopes (crescendo/decrescendo)?
For articulation: - Note length overrides (staccato, legato). - For piano: pedal on/off, half-pedal abstractions. - For guitar: strum directions, strum speed, palm mute, slides (at least in a simplified form).
- What is the internal data structure?
- Do we modify the existing symbolic representation (add attributes to tokens)?
-
Or build a separate performance “overlay” on top of the clean score?
-
How do we ensure the representation is:
- Expressive enough to capture key human nuances?
- Simple enough to implement and inspect?
- Stable under small edits (e.g. adding one note doesn’t scramble performance for the whole bar)?
3.3 Approaches: rule-based vs learned vs hybrid¶
- Rule-based baselines:
-
What simple heuristics can immediately improve realism?
- Slight random timing jitter within a controlled range.
- Velocity shaping according to:
- Position in bar (strong vs weak beats),
- Phrase direction (melodic contour),
- Dynamic markings implied by prompt (“soft”, “intense”, etc.).
- Simple pedal rules (for piano) or strum patterns (for guitar).
-
Learned models:
-
If using a model, what is the input and output?
Inputs: - Clean score (notes, durations, bar/beat positions). - Optional style controls (e.g. “swing”, “rubato”, density). - Optional prompt encodings.
Outputs: - Timing deviations, velocities, articulations for each note/event.
-
Which architectures are plausible (e.g. small transformer, RNN, or feedforward over local windows)?
-
How large does such a model need to be, given the scope?
-
Hybrid strategies:
- Use rules for coarse structure (groove, phrase dynamics), and a small learned model for fine-grained micro-timing.
-
Or vice versa.
-
Which approach is the minimal viable path for the prototype?
- What could be built in days/weeks to yield a noticeable improvement?
3.4 Data and training (if using learned models)¶
-
What data is needed to train a performance model?
- Pairs of (clean score, human performance) for the target instrument.
- How can we derive “clean score” from performance data for training (e.g. grid-quantized version as input, real performance as target)?
-
How do we extract performance parameters from human performances?
- Align performance to a quantized grid.
- Compute offsets, velocities, articulations.
- For piano: pedal automation.
- For guitar: approximate strum patterns and micro-timing.
-
How much data is realistically available, and with what licensing constraints?
- Can we start with a small curated set just for evaluation and prototyping?
- How do we separate any research-only datasets from production plans?
-
If data is limited:
- Are there simple augmentation strategies (tempo changes, transposition, time-stretching, etc.) that preserve performance nuances?
3.5 Integration with Architecture A¶
- Where exactly does the performance module sit in the pipeline?
Example: - Text prompt → Architecture A symbolic generator → clean score. - Clean score → performance module → expressive performance. - Performance → instrument renderer → audio.
- What is the API between:
- Symbolic generator and performance module?
-
Performance module and renderer?
-
How do we handle:
- Partial regeneration (e.g. re-humanize only bars 5–8)?
- Determinism vs randomness (e.g. random seed to get different “feels” from same score)?
-
Caching: reuse performance for unchanged segments?
-
How do we propagate UX controls down to the performance module?
- E.g. “humanization amount” and “tight vs loose” as parameters that modulate rule strengths or model outputs.
3.6 Evaluation¶
- What automatic metrics can track improvement over quantized playback?
Symbolic-level: - Distribution of timing offsets, velocities, note lengths compared to human performances. - Groove statistics (e.g. consistent swing ratio).
Audio-level: - Simple loudness and dynamic-range analyses. - Stability of bar-level tempo.
- What human listening tests should be run?
At minimum: - A/B tests: - Quantized vs humanized on same generated scores. - Humanized vs real human performances on similar material, if available.
Rating axes: - Naturalness / human-likeness. - Musicality / expressiveness. - Tightness / sloppiness (do users perceive it as too sloppy?).
- What is the minimal bar for “worth shipping into a prototype”?
- For example: majority of listeners rate humanized versions as more natural than quantized baseline, without a significant drop in perceived tightness.
4. Scope and constraints¶
- Single instrument (same as Architecture A’s initial target).
- No multi-instrument interplay (no cross-instrument timing dependencies).
- Keep the performance module:
- Lightweight enough to run on typical inference hardware.
- Stable enough to support scoped regeneration and looping.
- Clear separation between:
- Experimenting with any available performance datasets.
- Long-term plan for production-eligible training data.
5. Artifacts and deliverables¶
The research should produce:
- Performance representation spec
-
Data structures (or token extensions) for timing, dynamics, articulation, and instrument-specific features.
-
Humanization baseline design
-
A rule-based approach documented in enough detail for direct implementation.
-
Optional learned model spec
-
Architecture, input/output formats, training objective, and expected resource needs.
-
Integration design
- Clear API boundaries between the symbolic generator, performance module, and renderer.
-
Handling of UX parameters and partial regeneration.
-
Evaluation plan
- Automatic metrics.
-
Human listening test protocol and criteria.
-
Risk and experiment plan
- List of key assumptions (e.g. “simple rules are enough for noticeable gains”).
- Minimal experiments to test each assumption, prioritized for early implementation.
6. Process guidance¶
- Fix the target instrument and usage goals for humanization (subtle vs stylized).
- Define the performance representation and controls first.
- Design a rule-based humanization baseline.
- Only then decide if a learned model is necessary and justified.
- Prototype on a small set of scores and run quick listening tests.
- Iterate on representation and rules based on findings.
7. Non-goals¶
This research does not need to:
- Generate new notes or musical structure (that is Architecture A’s job).
- Handle multi-instrument ensembles or cross-instrument expressive alignment.
- Solve full mixing/mastering or effects.
- Implement complex audio-level expressivity beyond what can be controlled via symbolic performance.