Research prompt¶

Title
Humanization and performance modeling for a single instrument

1. Context and assumption¶

We assume:

Architecture A provides a clean, quantized symbolic output for a single target instrument (e.g. piano or guitar): correct notes, approximate durations, basic dynamics, and bar/beat structure.
A separate instrument renderer (sampler / synth / lightweight neural) converts symbolic performance to audio.
Current naive playback (direct quantized output → renderer) sounds robotic, mechanical, or “MIDI-ish”.

This research focuses on the performance layer between “clean symbolic” and “renderable performance”:

For one instrument (same as Architecture A’s target, e.g. solo piano or strummed acoustic guitar).
Adding timing deviations, dynamic shaping, articulation (and instrument-specific nuances) to make output feel human and musical.
Maintaining structural alignment (bars/beats) so the UX can still support looping, continuation, and scoped edits.

Treat this as an independent module that can be plugged into Architecture A.

2. Objectives¶

Define a performance representation
Decide how to represent expressive timing, dynamics, articulation, and other performance controls for the chosen instrument.
Map from “clean score” to expressive performance
Design one or more approaches (rule-based, learned, hybrid) that accept clean symbolic sequences and output performance-augmented sequences.
Specify instrument-specific expressive controls
For example:
- Piano: micro-timing, velocity curves, pedal, voicing emphasis.
- Guitar: strum patterns, picking direction, chord voicings, string noise.
Design a minimal, implementable humanization baseline
A rule-based or lightweight model that can be implemented quickly and evaluated against a quantized baseline.
Define evaluation and success criteria
Human and automatic methods to measure the improvement over robotic playback without breaking alignment.
Identify riskiest assumptions and minimal experiments
Especially around:
- Complexity needed for noticeable quality gains.
- Data requirements for learned models (if any).
- Maintaining bar/beat integrity.

3. Questions to answer¶

3.1 Product and UX framing¶

What does “human enough” mean for this instrument and prototype?
Is the goal:
- Subtle realism for background listening?
- Strong stylization (e.g. “jazzy swing”, “lo-fi wonkiness”)?
For v0, which of these is essential, and which can be postponed?
Where in the UX does performance modeling show up?
Global setting per clip (e.g. “humanization level” slider)?
Style presets (e.g. “straight”, “swing”, “rubato”, “tight” vs “loose”)?
Per-section overrides (intro more rubato, main section tighter)?
What constraints must performance modeling respect so that:
Clips can still be looped at bar boundaries?
Regenerating a segment doesn’t cause jarring timing mismatches with neighboring segments?

3.2 Performance representation¶

How do we represent performance deviations?

For timing: - Per-note onset offsets relative to the grid? - Groove templates per bar/beat? - Global tempo curves (rubato, accelerando, ritardando)?

For dynamics: - Per-note velocity modifications? - Phrase-level envelopes (crescendo/decrescendo)?

For articulation: - Note length overrides (staccato, legato). - For piano: pedal on/off, half-pedal abstractions. - For guitar: strum directions, strum speed, palm mute, slides (at least in a simplified form).

What is the internal data structure?
Do we modify the existing symbolic representation (add attributes to tokens)?
Or build a separate performance “overlay” on top of the clean score?
How do we ensure the representation is:
Expressive enough to capture key human nuances?
Simple enough to implement and inspect?
Stable under small edits (e.g. adding one note doesn’t scramble performance for the whole bar)?

3.3 Approaches: rule-based vs learned vs hybrid¶

Rule-based baselines:
What simple heuristics can immediately improve realism?
- Slight random timing jitter within a controlled range.
- Velocity shaping according to:
- Position in bar (strong vs weak beats),
- Phrase direction (melodic contour),
- Dynamic markings implied by prompt (“soft”, “intense”, etc.).
- Simple pedal rules (for piano) or strum patterns (for guitar).
Learned models:
If using a model, what is the input and output?

Inputs: - Clean score (notes, durations, bar/beat positions). - Optional style controls (e.g. “swing”, “rubato”, density). - Optional prompt encodings.

Outputs: - Timing deviations, velocities, articulations for each note/event.
Which architectures are plausible (e.g. small transformer, RNN, or feedforward over local windows)?
How large does such a model need to be, given the scope?
Hybrid strategies:
Use rules for coarse structure (groove, phrase dynamics), and a small learned model for fine-grained micro-timing.
Or vice versa.
Which approach is the minimal viable path for the prototype?
- What could be built in days/weeks to yield a noticeable improvement?

3.4 Data and training (if using learned models)¶

What data is needed to train a performance model?
- Pairs of (clean score, human performance) for the target instrument.
- How can we derive “clean score” from performance data for training (e.g. grid-quantized version as input, real performance as target)?
How do we extract performance parameters from human performances?
- Align performance to a quantized grid.
- Compute offsets, velocities, articulations.
- For piano: pedal automation.
- For guitar: approximate strum patterns and micro-timing.
How much data is realistically available, and with what licensing constraints?
- Can we start with a small curated set just for evaluation and prototyping?
- How do we separate any research-only datasets from production plans?
If data is limited:
- Are there simple augmentation strategies (tempo changes, transposition, time-stretching, etc.) that preserve performance nuances?

3.5 Integration with Architecture A¶

Where exactly does the performance module sit in the pipeline?

Example: - Text prompt → Architecture A symbolic generator → clean score. - Clean score → performance module → expressive performance. - Performance → instrument renderer → audio.

What is the API between:
Symbolic generator and performance module?
Performance module and renderer?
How do we handle:
Partial regeneration (e.g. re-humanize only bars 5–8)?
Determinism vs randomness (e.g. random seed to get different “feels” from same score)?
Caching: reuse performance for unchanged segments?
How do we propagate UX controls down to the performance module?
E.g. “humanization amount” and “tight vs loose” as parameters that modulate rule strengths or model outputs.

3.6 Evaluation¶

What automatic metrics can track improvement over quantized playback?

Symbolic-level: - Distribution of timing offsets, velocities, note lengths compared to human performances. - Groove statistics (e.g. consistent swing ratio).

Audio-level: - Simple loudness and dynamic-range analyses. - Stability of bar-level tempo.

What human listening tests should be run?

At minimum: - A/B tests: - Quantized vs humanized on same generated scores. - Humanized vs real human performances on similar material, if available.

Rating axes: - Naturalness / human-likeness. - Musicality / expressiveness. - Tightness / sloppiness (do users perceive it as too sloppy?).

What is the minimal bar for “worth shipping into a prototype”?
For example: majority of listeners rate humanized versions as more natural than quantized baseline, without a significant drop in perceived tightness.

4. Scope and constraints¶

Single instrument (same as Architecture A’s initial target).
No multi-instrument interplay (no cross-instrument timing dependencies).
Keep the performance module:
Lightweight enough to run on typical inference hardware.
Stable enough to support scoped regeneration and looping.
Clear separation between:
Experimenting with any available performance datasets.
Long-term plan for production-eligible training data.

5. Artifacts and deliverables¶

The research should produce:

Performance representation spec
Data structures (or token extensions) for timing, dynamics, articulation, and instrument-specific features.
Humanization baseline design
A rule-based approach documented in enough detail for direct implementation.
Optional learned model spec
Architecture, input/output formats, training objective, and expected resource needs.
Integration design
Clear API boundaries between the symbolic generator, performance module, and renderer.
Handling of UX parameters and partial regeneration.
Evaluation plan
Automatic metrics.
Human listening test protocol and criteria.
Risk and experiment plan
List of key assumptions (e.g. “simple rules are enough for noticeable gains”).
Minimal experiments to test each assumption, prioritized for early implementation.

6. Process guidance¶

Fix the target instrument and usage goals for humanization (subtle vs stylized).
Define the performance representation and controls first.
Design a rule-based humanization baseline.
Only then decide if a learned model is necessary and justified.
Prototype on a small set of scores and run quick listening tests.
Iterate on representation and rules based on findings.

7. Non-goals¶

This research does not need to:

Generate new notes or musical structure (that is Architecture A’s job).
Handle multi-instrument ensembles or cross-instrument expressive alignment.
Solve full mixing/mastering or effects.
Implement complex audio-level expressivity beyond what can be controlled via symbolic performance.