Skip to content

Humanization and Performance Modeling for Solo Piano (Architecture A Pipeline)

Introduction: In a symbolic-to-audio generation pipeline (Architecture A), the performance layer is responsible for converting a “clean” symbolic piano score (quantized notes with ideal timing and default dynamics) into an expressive, human-like performance. This stage adds nuance – timing deviations, dynamics (velocities), articulation variations, and pedal control – while ensuring the music’s structural grid (bars/beats) remains intact. Preserving alignment is critical so that features like looping, sectional regeneration, and user edits remain seamless. We focus exclusively on solo piano, designing the performance module as a pluggable component in Architecture A. Below, we address the key aspects: performance representation, a minimal viable humanization approach, an optional learned model design, integration details, evaluation plan, and a risk register with experimental validation steps.


1. Performance Representation

Expressive Parameters: To capture a piano performance’s expressivity, we extend the symbolic representation with parameters for: (a) Timing, (b) Velocity (Dynamics), © Articulation (Note length), and (d) Pedal usage. In essence, each note in the score may carry overlays specifying how its performance deviates from the quantized score:

  • Timing Deviations: A per-note onset offset (±) relative to the metrical grid indicates playing a note slightly before or after its exact beat position. This allows micro-timing adjustments (for “feel” or rubato) without altering the notated order of events. We can also represent larger-scale tempo variations (e.g. ritardando) as a tempo curve affecting many notes, but for simplicity, per-note or per-beat offsets suffice for local timing “groove.” Each note event can carry an onset_offset in milliseconds or ticks.

  • Velocity Dynamics: Instead of a flat MIDI velocity, each note has an expressive velocity value capturing its played loudness. Dynamics can be encoded directly as the note’s velocity in the performance MIDI. Additionally, for phrase-level shaping, we may overlay velocity envelopes or tags (e.g. crescendo or accent markings) that influence a series of notes. At minimum, storing a velocity value per note is enough to capture dynamic nuance.

  • Articulation (Note Length): We represent how long each note is held relative to its notated length. For piano, articulation often distinguishes staccato (shortened) vs. legato (overlapping) playing. We can attach an articulation factor per note (e.g. a percentage of notated duration for the actual performed duration). In MIDI terms, this means adjusting the note-off time. For example, a staccato might be 50% of the written length, while legato might overlap into the next note. Representing articulation explicitly per note (or via a categorical label like “staccato/tenuto”) allows the performance renderer to shorten or lengthen notes accordingly.

  • Pedal Controls: Piano sustain pedal (and optionally una corda or sostenuto) dramatically affects sound. We include pedal events (on/off and continuous values) in the performance representation timeline. Sustain pedal (CC64 in MIDI) can be represented as a sequence of events (pedal down at time X, up at time Y). For simplicity, we can treat pedal as a separate control track parallel to notes. The representation might mark each note with a pedal state (e.g. whether sustain is down during the note), or simply include raw pedal events in the event list.

Embedded vs. Overlay: We need to decide if these expressive parameters are embedded into the primary symbolic data structure (e.g. extending a note event to include offset/velocity etc.) or kept as a separate overlay.

  • A combined representation (like an extended MIDI event) is straightforward for rendering: e.g. output a MIDI file where each note’s timestamp, velocity, and length are already humanized. This is essentially treating the expressive performance as just another MIDI sequence, but with careful timing.

  • The alternative is to maintain a parallel data structure that links back to the quantized “score” reference. For instance, we might keep the original score events and a separate list of performance adjustments (deltas) that can be applied or toggled.

Overlay Approach: Using a separate overlay has advantages for non-destructive editing – the base score remains intact for structural reference, and humanization can be enabled/disabled or scaled easily. For example, an overlay could say: Note #42: onset_offset = +15ms, velocity = 64 (instead of default 80), length_factor = 0.8, pedal = on. This preserves the knowledge that Note #42 falls on beat 3 of bar 5 even if its actual playback time is slightly shifted. It aids looping: since bar boundaries in the base score remain clean, we can ensure any tempo rubato averages out by each bar’s end (discussed below). The overlay can be stored as a parallel structure keyed by note IDs or positions.

Embedded Approach: Alternatively, we can directly adjust the timing in the note list (essentially converting the score into a performance MIDI in one structure). This is simpler for immediate rendering (just feed to a synth) but harder to preserve alignment information. A practical hybrid is to use a high-resolution time grid (e.g. ticks) where the score is quantized (e.g. to nearest 480 ticks per beat) and the performance just changes the tick values slightly – thus the bar lines (at multiples of 480 × beats) remain reference points.

Chosen Representation: For a minimal design, we will extend the existing symbolic format with additional attributes per note and add pedal events as needed. Concretely, if the symbolic generator outputs a MIDI-like structure (notes with start tick, duration, velocity), the performance module will output a modified MIDI file (or in-memory sequence) with: updated note on/off timings, updated velocities, and inserted control changes for pedal. This approach leverages the well-understood MIDI performance encoding and easily interfaces with existing renderers. Each note event can carry its original quantized position as metadata (for alignment) and its adjusted performance timestamp for playback. This ensures we capture key human nuances (timing, dynamics, articulation, pedal) while keeping the data model relatively simple and inspectable (essentially a MIDI sequence with some annotations). The representation is also stable under small edits: because each performance adjustment is local to a note or bar, adding a note in one bar need not affect performance of other bars (except perhaps slight tempo curvature, which we can constrain within that bar).

Rationale: This design balances expressivity and simplicity. Prior research on expressive performance has identified these parameters (timing deviation, velocity, articulation, pedal) as crucial features. For example, Jeong et al. (2019) align score notes to performance to extract precisely these features as training data. Our representation follows that precedent, ensuring we have the needed degrees of freedom for humanization.


2. Minimal Viable Humanization Approach (Rule-Based Baseline)

As a starting point, we design a rule-based humanization module that applies a set of simple, musically motivated heuristics to the quantized score. The goal is to achieve a perceptible improvement in naturalness with minimal complexity – essentially creating a “good enough” fake human performance (often called “deadpan” to expressive conversion). Key rules for solo piano include:

2.1 Timing Jitter & Groove

Introduce slight randomness and groove patterns in note timing. Rather than every note falling exactly on the grid, we offset some onsets by a few milliseconds. We use a controlled range so that the music doesn’t lose its rhythm or structural tempo. For instance, we might delay off-beat notes by ~±10 ms and on-beats by a smaller amount. This random “swing” simulates the imperfection of human timing and avoids the robotic machine-like exactness.

We can also incorporate a beat-strength dependent offset: e.g. in 4/4 time, let beat 1 (strong beat) remain closer to exact, but beat 2 or 4 might be laid back by +10 ms to create a subtle swing feel. If a “swing” or shuffle style is desired, we can systematically delay every second eighth-note by a fixed ratio (e.g. making the first eighth longer than the second) – essentially a groove template. The KTH rule system (Friberg et al.) implemented such timing deviations, even adding 1/f noise to simulate a human’s internal tempo drift.

For our baseline, a small random timing jitter (white noise within ~±5–15 ms) combined with optional style-based patterns (straight vs. swing) yields an immediate realism boost.

2.2 Beat and Phrase Emphasis (Velocity Shaping)

Human players naturally emphasize certain beats (e.g. downbeats) and shape phrases dynamically (crescendo, decrescendo). We implement a simple velocity curve:

  • Notes on strong beats (e.g. beat 1 of each measure, or the first beat of a 4-beat grouping) get a slightly higher velocity, while weak beats are slightly softer.
  • For example, in 4/4, we could scale velocities by: beat1 = +10% volume, beat3 = +5%, beats2&4 = normal or -5%. This mimics a basic accent pattern.

Additionally, for phrase-level shaping, if we have knowledge of phrase boundaries (or we assume every 4 bars is a phrase), we can apply a gentle swell: start slightly softer, rise to a peak, then fall at the end of the phrase. A concrete rule: “If a note is near the middle of a phrase, increase velocity by a few units (to create a mini-crescendo), and if it’s the final note of a phrase, maybe soften and shorten it slightly.” These are inspired by common performance practice (the “phrase arch” dynamics rule).

Even without explicit phrase detection, we can fake an “expression arc” over each bar: e.g. gradually increase velocity through the bar and drop at the end, to avoid flatness. We also ensure repeated notes are not all the same velocity – introducing small variations (±5 MIDI velocity units) to avoid machine-gun effect.

2.3 Melodic Voicing and Balance

In piano, often one hand carries the melody. A simple heuristic is to emphasize the melody voice slightly in both timing and velocity:

  • If the right-hand (treble) has a melodic line and the left-hand plays accompaniment chords, we can boost the right-hand velocities by a constant factor (or +5 velocity) and perhaps play melody notes slightly earlier than accompaniment (a technique pianists use called “melody lead”).
  • Identifying the melody can be done crudely by pitch range (higher notes often melody) or by note density (single-note line vs block chords).

As a baseline, if notes occur simultaneously (e.g. a chord), we can skew their onset: play the highest note ~5–10 ms earlier than lower ones, simulating how a pianist might arpeggiate a chord to bring out the top voice. This voicing rule adds a subtle human touch and prevents uniformly flat chord attacks.

2.4 Articulation Variation

To avoid mechanical uniformity, we adjust note lengths in a simple way:

  • For legato passages (notes in sequence), we might slightly overlap or at least not leave gaps – accomplished by extending note-offs to exactly the next note’s onset or using sustain pedal.
  • For sharper, detached feel or fast passages, we might shorten notes a bit (e.g. 90% of notated length) to ensure clarity.

One easy rule: if a rest is written, enforce a clear separation (don’t overlap into the rest); if notes are consecutive without rests, give a tiny overlap or use pedal to connect. Another rule from classical performance: repetition articulation – when the same note repeats, pianists often separate them with a tiny silence or a lower velocity on the second to avoid a machine-like sound. We can implement a small gap (say 10–20 ms) before a repeated note or slightly shorten repeated notes.

2.5 Pedal Application Heuristics

Sustain pedal use dramatically changes piano sound. For a minimal approach, we add pedal in a generic but musical way:

  • One heuristic: if the music has chords or slow harmonic changes, depress the sustain pedal at the start of the chord and release just before the next chord change – this will blend the notes within that harmony.
  • Concretely, we could pedal each measure: Pedal down on the first beat of a bar, release on the last beat, so that the chord or notes in that bar ring together but clear at the bar line to avoid blur between chords.
  • For passagework (many fast notes), continuous pedal might muddy, so perhaps pedal every two beats or every arpeggio.

Another simple rule: if notes are primarily legato and there are no large leaps, we might not need pedal (finger legato suffices), but if there are big leaps or the notation indicates “ped.”, then use it. Default baseline: apply sustain pedal in sections marked “legato” or where the harmony changes slowly, holding it for the duration of the chord or bar; avoid pedal in very fast runs or staccato sections. We ensure to lift the pedal at least briefly at chord changes to prevent dissonant overlap.

2.6 Rubato and Tempo Fluctuations

While complex expressive tempo curves are advanced, a minimal rubato can be simulated:

  • Slightly slow down at the end of a phrase (last bar before a section break) by increasing gaps between notes progressively, then resume tempo.
  • Since our system must preserve overall bar timing for looping, any rubato should “net out” by bar boundaries.

One compromise is local tempo modulation: allow small timing deviations inside a bar (rush then slow within the bar) that cancel out by the bar’s end. For instance, within a 4-beat bar, play beats 1–3 slightly faster (ahead of metronome) then stretch beat 4 a little longer so that downbeat of next bar is on time. This gives an illusion of expressive timing while each bar still occupies the correct total duration. Our baseline rules can include a gentle “phrase-final ritard” spanning the last few beats of a phrase, which can be implemented if we detect phrase ends or simply at the end of the entire piece.

2.7 Preserving Structural Alignment

Crucially, we design the rules to not disturb bar and beat boundaries needed for looping:

  • All timing tweaks are small (tens of milliseconds) – the cumulative error over a bar is kept near zero.
  • One method is to include a compensatory adjustment: if a note is played earlier, a later note in the bar is slightly delayed to compensate, keeping the bar length constant.
  • Velocity and pedal changes have no effect on alignment (they don’t shift time).
  • Articulation (note length) could, if overextended beyond the next note, cause overlap issues, but we will avoid extending notes past their notated next note onset unless pedal is intended to handle that overlap.

By following the score’s bar/beat grid as a skeleton and layering deviations around it, we achieve humanization without breaking the timeline. The output performance, when rendered, should still start each measure on the correct timing, enabling seamless loops and easy splicing of regenerated sections.

These heuristics can be implemented with relatively little code and align with known principles from music performance research. They are parameterized, allowing us to dial their intensity (for instance, a “humanization amount” could scale all random jitters and velocity variances). We will expose such a parameter so UX can set “tight” (small deviations) vs “loose” (larger deviations).


3. Optional Learned Model Design (Score-to-Performance Model)

While the rule-based approach provides a baseline, a learned model could refine the performance with more subtlety and automatically adapt to style nuances. We propose a lightweight model that takes the quantized score (symbolic sequence) as input and outputs an expressive performance (the same representation defined above: timing, velocity, articulation, pedal for each note). This essentially performs a sequence-to-sequence mapping from “deadpan” MIDI to “humanized” MIDI.

3.1 Model Architecture

A feasible architecture is a sequence model that can handle polyphonic music and timing context:

  • Options include a Transformer encoder-decoder or a Bi-directional LSTM/GRU that processes the score sequence and predicts performance parameters for each note.
  • For efficiency and simplicity, a non-autoregressive mapping is preferable: we’re not generating new content, just modifying existing notes, so the model can output all deviations in parallel.

One design is a Transformer encoder that takes a sequence of note events (with their quantized timing info and other features) and outputs a sequence of performance annotations of the same length. This is akin to a feed-forward translation where each input note yields an output tuple (Δtime, new velocity, Δduration, pedal state). A multi-layer bidirectional Transformer or a Graph Neural Network (to account for polyphonic relationships) has been used in literature for expressive performance rendering.

For MVP, a simpler model like a single-stream Transformer or an RNN that linearizes the music (sorted by time) should suffice given short clip lengths.

3.2 Inputs

The model input must provide musical context to inform expression:

  • Note events with structural features: Each note can be represented by (pitch, quantized onset time, notated duration). We enrich this with features like:
  • position in bar (beat index),
  • position in phrase or section if known,
  • any dynamic or phrasing markings from the generator (e.g. “this section is soft”).
  • We may also include the relative position in the piece (normalized time) so the model knows if a note is near the beginning or end of the clip.
  • Optional style controls: If the user or prompt specifies a style (e.g. “swing” or “romantic rubato”), we can feed a style embedding or flags. For instance, a boolean input for “swing feel” could tell the model to introduce swing-like delays on certain beats. A “humanization amount” scalar could be an input that the model uses to scale its output magnitudes (for controllability).
  • Prompt embedding (optional): In a text-to-music context, the text prompt may imply an expression style (“gentle and expressive” vs “robotic”). If available, a vector from the text-to-symbolic module describing intended mood or intensity could be passed in. For the performance model, this is optional initially.

3.3 Outputs

The model should output expressive parameters aligned with each note (or produce events including pedal). A convenient formulation is to output for each note:

  • Δtime (onset offset in milliseconds or ticks),
  • Δvelocity (or absolute velocity value),
  • Δduration or articulation fraction,
  • and possibly a pedal state indicator.

However, pedal tends to operate over spans rather than per-note. Another output scheme is to output a pedal control curve or labels per bar (e.g. a binary decision per bar whether pedal is down, and maybe a suggested release point). For MVP, we might leave pedal to heuristic rules or treat it separately. So the core outputs are timing and velocity for each note, which are the most crucial for humanization.

3.4 Model Size and Complexity

Given we target short clips (15–60s, maybe a few hundred notes maximum) and a single instrument, the model can be relatively small:

  • A transformer with a few layers (e.g. 4 layers, hidden size ~256) or an LSTM with a similar hidden size could capture necessary patterns without overkill.
  • The aim is to run in real-time or faster on typical hardware, so a lightweight model (<1M parameters) should be sufficient.

This is plausible because the problem is constrained (the output is a bounded deviation, not unconstrained generation).

3.5 Training Data

Training a score-to-performance model requires paired data: a clean score and the corresponding human-performed rendition. We can leverage existing datasets of piano performances:

  • MAESTRO dataset: MIDI & Audio recordings of human piano performances. MAESTRO provides MIDI files that are essentially expressive performances of classical pieces. To get training pairs, we need the aligned “score” version. We can derive a synthetic “score” by quantizing the performance MIDI to a strict grid (e.g. aligning to nearest 16th notes for rhythms, using notated tempo if known) – this yields an approximate score which, when compared to the original performance MIDI, gives us the ground-truth deviations.
  • ASAP dataset (Aligned Scores and Performances): Contains piano pieces with both written scores (MusicXML) and one or more human performances aligned note-by-note. For example, the ASAP dataset has 222 scores aligned with 1,068 performances (over 92 hours) of classical piano.

Such data is ideal as it directly provides pairs of quantized note values and expressive timing/velocities.

Other sources include:

  • Yamaha e-Piano Competition dataset (largely included in MAESTRO),
  • Possibly the Chopin Competition dataset.

Licensing for these is mostly non-commercial/research, but for prototype experimentation they are available. We assume we can use them for model training in a research setting, while planning to later obtain properly licensed data or record our own performance data for production.

3.6 Feature Extraction for Training

From each performance MIDI, we extract the features described in Section 1 and use them as training targets. Specifically, for every note (aligned to a score note), compute:

  • Onset deviation = (performance onset time) – (score onset time),
  • Velocity target = performance velocity (or deviation from a base),
  • Articulation = (performance duration) / (score notated duration),
  • Pedal usage = e.g. whether sustain pedal was on during this note.

If working from raw performance MIDI without a written score, quantizing the MIDI by removing all timing and setting uniform velocity gives us an approximate “clean” input. This approach has been used to generate training pairs from unaligned data. Tang et al. (2023) for example used automatic transcription to create score–performance pairs for training a Transformer on expressive piano performance.

3.7 Training Objective

We train the model as a regression or classification problem on each output parameter:

  • A simple loss is mean squared error (MSE) for continuous deviations (timing in milliseconds, velocity differences).
  • We might also add a term for classification for pedal on/off (binary cross-entropy).

If using a Transformer, we could also treat it as a sequence prediction and use teacher-forcing with a suitable loss per time step. It might be beneficial to normalize the targets (e.g. divide timing offsets by beat duration to predict a fractional timing, and scale velocities to 0–1) for easier learning.

During training, we want the model’s output to match the human performance as closely as possible. One challenge is the one-to-many nature of performance: there’s no single “correct” way to play a piece. So we won’t overfit to exactly reproducing one performance; rather, we want the model to capture general expressive patterns. Data augmentation can help: we can transpose pieces, or slightly alter tempos to increase variety, as long as we keep the performance deviations consistent (transposition doesn’t affect timing deviations, only pitch context).

3.8 Inference Usage

At inference time, given a new generated score (from the symbolic composer), the learned model would output a set of deviations. We could then either:

  • (a) directly apply them to get the performance MIDI, or
  • (b) use them as suggestions combined with the rule-based system (a hybrid approach).

For example, a hybrid could apply the learned model’s fine-grained adjustments but still enforce our rule constraints about alignment at bar boundaries. If the model is deterministic, running it twice on the same input yields the same output, which is good for consistency. If we introduce randomness (like sampling from a distribution of possible performances), we need to seed it for reproducibility or only use that for optional variation.

In summary, a learned model offers a path to refinement beyond the simple rules, potentially capturing expressive micro-timing that our heuristics miss. It is feasible to implement a small Transformer or RNN that, given a sequence of notes with positions, predicts note-by-note adjustments. Such a model would require a dataset of paired score-performance (e.g. using MAESTRO/ASAP data) and could be trained with standard regression objectives. The result would be a module that “translates” a mechanical performance into a human-like one, in a way that could generalize across different pieces. We emphasize that this model is optional for the prototype – we would first try the rule-based system, and only if that proves insufficient or if more expressivity is desired would we integrate the learned model.


4. Integration with Architecture A Pipeline

In the end-to-end Architecture A (text → symbolic → performance → audio) pipeline, the performance module sits squarely between the symbolic generator and the audio renderer.

Conceptual flow:

  1. Text prompt → symbolic generator → clean piano score (quantized MIDI-like representation).
  2. Clean score + performance settings → performance module (rule-based or learned) → expressive performance MIDI (timing, velocity, pedal).
  3. Expressive performance → instrument renderer (sampler / synth / neural) → audio.

The interfaces between components are:

4.1 Input to Performance Module

The symbolic generator outputs a “clean score” representation, likely as a MIDI or similar event list:

  • Note pitches,
  • Quantized start times (e.g. in beats or seconds according to a base tempo),
  • Durations,
  • Possibly initial velocities or markings,
  • Structural info like bar lines / tempo / meter.

The API here could be:

PerformanceInput = {
  "notes": [
    {"id": 0, "beat_time": 0.0, "pitch": 60, "duration_beats": 1.0, "velocity": 80},
    ...
  ],
  "tempo": 90,
  "time_signature": "4/4"
}

The performance module exposes something like:

performance = humanize_performance(score=PerformanceInput, settings=PerformanceSettings)

4.2 Output of Performance Module

The performance module returns an Expressive Performance in a format ready for rendering. If the renderer is MIDI-based, the natural choice is to output a MIDI (with real-time timestamps, all the controls embedded).

For example:

PerformanceOutput = {
  "notes": [
    {
      "id": 0,
      "time_seconds": 0.012,
      "pitch": 60,
      "velocity": 87,
      "duration_seconds": 0.95
    },
    ...
  ],
  "controls": [
    {"time_seconds": 0.0, "type": "pedal", "value": 1.0},
    {"time_seconds": 0.95, "type": "pedal", "value": 0.0},
  ]
}

The renderer can then use this to drive a piano sampler or synth.

4.3 Renderer Integration

In v1 of Arch A, suppose the renderer is a high-quality piano sample library or physical modeling synth. We feed it a MIDI stream (our performance output). For future upgrade to a neural renderer (MIDI-DDSP or similar), we might instead feed note events with additional features — precisely what the performance module is producing.

4.4 UX Controls (Humanization Level & Styles)

We plan user-facing controls that influence the performance rendering.

Humanization Amount Slider:

  • Range: 0% (completely quantized, “mechanical”) to 100% (full humanization).
  • Implementation: scale the magnitude of all deviations:
  • Timing offsets: offset_scaled = amount * offset_full.
  • Velocities: interpolate between original and humanized values.
  • Articulation factors: interpolate toward 1.0 (original length) as amount → 0.
  • At 0%: performance identical to input score.
  • At 100%: full effect of rules or learned model.

Style Presets (“Swing”, “Rubato”, etc.):

  • “Swing”: apply a specific groove template for eighth notes (fixed swing ratio).
  • “Straight”: disable swing jitter, use only small random offsets.
  • “Loose”: larger velocity variance and timing jitter.
  • “Tight”: minimal jitter and smaller dynamic contrast.
  • “Pedal heavy / light / none”: control pedal heuristics.

These presets are mapped to the underlying rule parameters and, if using a learned model, potentially to style conditioning flags.

API-wise:

PerformanceSettings = {
  "humanization_amount": 0.7,
  "swing": True,
  "pedal_style": "medium",
  "tight_vs_loose": "loose"
}

4.5 Determinism and Caching

Determinism:

  • Given the same score and settings, the performance module should produce the same output unless explicitly randomized.
  • Rule-based components are deterministic; for random jitter, we can seed the RNG with a function of the score (e.g. hash of note IDs) so output is consistent.
  • For the learned model, inference is deterministic unless we deliberately sample from a probabilistic output.

Caching:

  • We cache performance at the segment/bar level.
  • If user edits only bar 4:
  • Re-run the performance module for that bar (and possibly its immediate neighbors).
  • Keep bars 1–3 and 5–8 from cache.
  • Because our deviations are local and bar boundaries stay fixed, we can splice cached performance segments without creating timing glitches.

4.6 Partial Regeneration

We support scoped regeneration:

  • User selects bars 5–8 → “Re-humanize this section”.
  • Performance module accepts either:
  • A sub-score for bars 5–8, or
  • The full score with a mask of bars to recompute.
  • Outside the selected region, cached performance is reused.
  • At region boundaries, we ensure continuity:
  • Respect bar alignment so there’s no tempo jump.
  • Handle pedal continuity; possibly re-evaluate pedal events for one bar overlap at each end.

In summary, the performance module is cleanly separated and pluggable. Upstream, it receives a fully specified score (no composition), and downstream it outputs a performance in a standard format. This matches Architecture A’s philosophy: a symbolic core for composition, a controllable performance renderer, then a sound generator.


5. Evaluation Plan

We use both automatic metrics and human listening tests.

5.1 Automatic Metrics

Timing Offset Distribution:

  • Compute statistics of onset deviations (mean, std, distribution).
  • Compare to human performance data (e.g. typical deviations ~10–30 ms).
  • Check bar-length deviation: difference between intended bar duration and actual. Should be ≈0 on average to preserve loops.

Velocity and Dynamics Profiles:

  • Velocity variance: humanized > quantized (which might be flat).
  • Autocorrelation of velocity vs meter:
  • For 4/4, emphasis on beat 1 shows as pattern in autocorrelation.
  • Dynamic range: check that the range or standard deviation of velocities increases appropriately.
  • Phrase shape (if available): average velocity arcs within phrases.

Articulation / Note Length Stats:

  • Measure proportion of shortened vs extended notes.
  • Ensure we do not create large overlaps where not intended.
  • Check repetition articulation (e.g. repeated notes show some variation).

Groove Consistency:

  • If using swing:
  • Measure swing ratio of pairs of notes (e.g. eighth notes).
  • Check consistency across bars.
  • Group notes by metrical position and analyze offset variance; ensure structured pattern plus small randomness.

Audio-Level Metrics:

  • Loudness curves: humanized version should show more dynamic variation.
  • Tempo stability: derive tempo from audio and verify it’s stable except where intended.

These metrics help tune parameters and verify we don’t break alignment. However, they do not guarantee musical quality.

5.2 Human Evaluation

A/B Preference Tests:

  • Present pairs:
  • (A) Quantized playback,
  • (B) Humanized playback of same score,
  • Order randomized.
  • Ask: “Which sounds more natural / musical / like a human performance?”
  • Run across multiple excerpts and listeners (experts and non-experts).
  • Success target: clear majority (e.g. >70%) prefer humanized over quantized.

Rating Scales:

Ask listeners to rate each clip (1–5) on:

  • Naturalness / Human-likeness.
  • Expressiveness / Musicality.
  • Tightness / Rhythmic accuracy.

We want:

  • Humanized significantly higher on naturalness and expressiveness,
  • Quantized higher or equal on tightness, but humanized still acceptable (not perceived as sloppy).

Comparison to Real Human Performance (optional):

  • Use a piece with a known human performance.
  • Inputs: score → our system → humanized performance.
  • Compare:
  • Quantized vs humanized vs real performance.
  • ABX test: can listeners distinguish our output from the real one; which is closer to real.
  • Goal is not to match human exactly, but to be closer than quantized.

Success Criteria:

  • Majority of listeners prefer humanized to quantized.
  • Humanized rated clearly more natural/expressive, with no major complaints about timing.
  • Automatic metrics confirm structural alignment and reasonable deviation magnitudes.

5.3 Performance & Latency Evaluation

  • Rule-based module should be essentially instantaneous.
  • Learned model inference:
  • Test on typical CPU.
  • Target <50 ms for a 30-second clip.
  • If slower, learned model may be reserved for “HQ render” while rule-based is used for “preview”.

6. Risk Register and Minimal Experiments

We list key assumptions and test each with minimal experiments.

Assumption 1: Simple Rules Yield a Significant Perceptual Improvement

Risk: Rule-based approach might still sound “MIDI-ish” or make performance sloppy.

Experiment:

  • Implement rule set for a few 15–30s excerpts (some generated, some from existing MIDI).
  • Render quantized vs rule-humanized versions.
  • Run quick internal listening test (team musicians / non-musicians):
  • Preference: which sounds better?
  • Qualitative feedback: “more musical”, “too sloppy”, etc.

Outcome:

  • If majority prefer humanized and describe clear benefit → rules validated for MVP.
  • If improvement marginal or negative → refine rules or prioritize learned model/hybrid.

Assumption 2: Maintaining Bar/Beat Alignment Does Not Overly Limit Expressiveness

Risk: Strict per-bar alignment might make tempo feel unnaturally rigid.

Experiment:

  • Create two versions:
  • A: bar-aligned humanized (our default).
  • B: version with slight global ritardando across 4 bars (breaking alignment).
  • Subjective:
  • Compare perceived expressiveness between A and B.
  • Check loop behavior (A loops seamlessly; B obvious tempo jump).
  • If A is acceptable and loops cleanly, constraint holds for MVP.

Assumption 3: Small Dataset and Model Can Learn Performance Refinement

Risk: Learned model might underfit/overfit; require more data than expected.

Experiment:

  • Take ~50 pieces from MAESTRO/ASAP:
  • Generate score–performance pairs (quantized vs expressive).
  • Train a small model (e.g. 1-layer BiLSTM) just for velocity deviations.
  • Compare:
  • Rule-based velocities vs model-predicted velocities vs real performance (on held-out pieces).
  • Listening:
  • For a sample piece, generate rule-based and model-based performances; have an expert judge which closer to real.

Outcome:

  • If model is clearly better, justifies building full score-to-performance model.
  • If gains are marginal, rule-based may suffice for v0; model becomes optional enhancement.

Assumption 4: Performance Model Is Fast Enough for Interactive Use

Risk: Learned model is too slow on typical hardware.

Experiment:

  • Benchmark inference on CPU:
  • ~30s piano clip (~few hundred notes).
  • Measure wall-clock time.
  • Target: <50 ms for mapping score → performance.
  • If slower, options:
  • Optimize model (smaller, pruning),
  • Use it only for HQ offline rendering,
  • Use rule-based for real-time preview.

Assumption 5: Users Want Control Over Humanization & Style

Risk: Controls may be unused or confusing.

Experiment:

  • In a simple UI prototype:
  • Provide “humanization” slider and style toggles (swing, pedal).
  • Ask test users to:
  • Generate a clip,
  • Adjust controls until it “feels right”.
  • Observe:
  • Do they use slider?
  • Do they understand effects?
  • Any missing controls (e.g. desire for per-section variation)?

Outcome:

  • If controls used meaningfully and understood → keep.
  • If rarely used or confusing → simplify (e.g. just a couple of presets).

Assumption 6: Added Variability Doesn’t Hurt Downstream Renderer or Prompt Coherence

Risk: Performance layer might conflict with prompt-level expectations (e.g. “tight rhythm” vs loose humanization).

Experiment:

  • Take prompts like:
  • “Very tight, precise piano arpeggios”.
  • “Laid-back, lazy swing piano”.
  • Generate score and performance with different settings:
  • low vs high humanization, swing vs straight.
  • Listening:
  • Does “tight” setting sound noticeably more precise?
  • Does “swing” setting sound swinging?

Outcome:

  • If mismatch between prompt intent and performance behavior, adjust mapping between prompts and performance settings.

Assumption 7: Data & Licensing Are Sufficient for Learned Model

Risk: Large clean score–performance dataset with commercial rights may not be available.

Mitigation & Experiment:

  • Use MAESTRO/ASAP for research prototyping (non-commercial).
  • Verify we can:
  • Align performances and extract expressive parameters.
  • Train a working model.
  • For production:
  • Plan to record/obtain proprietary performance data (e.g. hire pianists to create MIDI performances of royalty-free pieces).
  • Keep clear separation:
  • Rule-based model is always safe to use commercially.
  • Learned model trained on non-commercial data remains research-only until proper data is available.

7. Priority Plan for First Implementation Cycle

  1. Implement rule-based performance module with timing, velocity, articulation, and pedal heuristics.
  2. Run internal A/B listening tests for quantized vs humanized outputs, adjust rules based on feedback.
  3. Verify alignment and looping via automatic metrics and manual loop tests.
  4. Integrate with renderer and UX:
  5. Define PerformanceInput, PerformanceOutput, and PerformanceSettings schemas.
  6. Add humanization slider + basic presets.
  7. Benchmark latency on target hardware.
  8. Prototype a small learned model on limited data (optional, parallel track).
  9. Refine based on early user and team feedback; log results into risk register and adjust roadmap (e.g., invest more in learned model if ROI appears high).

8. Summary

This design defines a modular, lightweight performance layer for solo piano within Architecture A:

  • A clear performance representation capturing timing, velocity, articulation, and pedal in an extended MIDI-like format, with clean alignment to score.
  • A rule-based baseline humanization that is easy to implement and tune, and that captures the most important expressive aspects while preserving structural alignment for looping and partial regeneration.
  • An optional learned model that can refine performance in a data-driven way, trained from score–performance pairs (e.g., MAESTRO/ASAP), with a modest model size and standard regression objectives.
  • A well-defined integration into the symbolic-to-audio pipeline, with explicit APIs, UX controls for humanization level and style, determinism, and caching.
  • A robust evaluation plan combining automatic metrics and human listening tests to ensure the humanized output is clearly preferred over quantized baselines and remains structurally coherent.
  • A risk register with minimal experiments to validate or falsify key assumptions quickly, guiding iterative refinement.

This module, once implemented and validated, will significantly increase the perceived quality and musicality of solo piano outputs in the LEMM / Architecture A pipeline, while remaining small, interpretable, and pluggable for future extensions (e.g., multi-instrument performance modeling).


9. Sources

  1. Jeong et al., Graph Neural Network for Music Score Data and Modeling Expressive Piano Performance, ICML 2019 – note-level graph and hierarchical model for piano performance.
  2. Jeong & Kim, Score and Performance Features for Rendering Expressive Music, MEC 2019 – extraction of timing, velocity, articulation, pedal from performance data.
  3. Friberg et al., KTH Rule System / Director Musices, 1991–1995 – rule-based expressive performance (micro-timing, accents, phrase shaping, noise).
  4. Cancino-Chacón et al., Computational Models of Expressive Performance: A Review, Front. Digit. Hum. 2018 – discusses evaluation and KTH model noise and limitations.
  5. ASAP Dataset paper – aligned scores and performances for piano, providing score–performance pairs.
Ref # Used? Title Mini summary Link
1 Yes LEMM internal prompt: “research prompt – humanization and performance modeling for a single instrument (piano option)” Internal LEMM document that defines the scope, constraints, and questions for the piano-only performance modeling and humanization research; specifies that we focus on a single-instrument (piano) performance layer sitting between symbolic generation and audio rendering. Internal project file (not publicly accessible)
2 Yes Jeong et al., “Graph Neural Network for Music Score Data and Modeling Expressive Piano Performance” (ICML 2019) Paper proposing a graph neural network + hierarchical architecture that takes symbolic scores as input and predicts expressive performance features (timing, dynamics, pedal, etc.), demonstrating that structured score representations help render human-like piano performances. :contentReference[oaicite:0]{index=0} https://proceedings.mlr.press/v97/jeong19a/jeong19a.pdf
3 Yes Jeong et al., “Score and Performance Features for Rendering Expressive Music Performances” (MEC 2019) Describes how to extract detailed note-level score and performance features (timing deviations, velocities, articulation, pedal) from MusicXML and MIDI, providing a feature set used for expressive performance rendering and related MIR tasks. :contentReference[oaicite:1]{index=1} https://music-encoding.org/conference/abstracts/abstracts_mec2019/Dasaem%20Jeong%20Music%20Encoding%20Conference%202019.pdf
4 Yes Friberg et al., “Director Musices: The KTH Performance Rules System” Classic description of the KTH rule-based system (Director Musices) that turns notated scores into expressive performances via hand-designed performance rules for timing, dynamics, articulation, phrasing, and final ritardando, with adjustable rule strengths. :contentReference[oaicite:2]{index=2} https://www.researchgate.net/publication/228714230_Director_musices_The_KTH_performance_rules_system
5 Yes Cancino-Chacón et al., “Computational Models of Expressive Music Performance: A Comprehensive and Critical Review” (Frontiers in Digital Humanities, 2018) Broad survey of computational models for expressive performance (rule-based, basis-function, and data-driven approaches), discussing input representations, output parameters (tempo, timing, dynamics, articulation), and challenges in evaluating generative models. :contentReference[oaicite:3]{index=3} https://www.frontiersin.org/articles/10.3389/fdigh.2018.00025/full
6 Yes Foscarin et al., “ASAP: A Dataset of Aligned Scores and Performances for Piano Transcription” (ISMIR 2020) Introduces the ASAP dataset: 222 piano scores with 1,068 aligned performances (≈92 hours), providing MusicXML scores, quantized MIDI scores, performance MIDI, and annotations (beats, time signatures), enabling precise score–performance alignment for training expressive models. :contentReference[oaicite:4]{index=4} https://program.ismir2020.net/static/final_papers/127.pdf
7 Yes ASAP dataset repository GitHub repository hosting the ASAP dataset’s MusicXML, MIDI, and annotations, plus tooling for accessing and processing the aligned score–performance pairs used as an example of training data for score-to-performance models. :contentReference[oaicite:5]{index=5} https://github.com/fosfrancesco/asap-dataset
8 Yes MAESTRO dataset: “MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization)” Dataset of ≈200 hours of virtuosic piano performances with tightly aligned audio and MIDI, used in the research text as a prime example of large-scale paired performance data that can be quantized into “scores” and compared back to the expressive originals. :contentReference[oaicite:6]{index=6} https://magenta.withgoogle.com/datasets/maestro