classicalML
itsnick.co / piano
research · v.2026.04
projecttidal-horizon
kindresearch
statelive · sweeping
repoprivate
a research project / nick friesen / 2026

classicalML

The first transformer composer trained on 27.5M tokens of voice-preserved classical piano scores. Most piano ML sounds chord-y because the data is chord-y. This dataset isn't.

26,934
pieces
27.5M
tokens
59M
params · d768 L8
1.1123
best val_loss

The cleanest score-level piano data at scale.

Existing piano ML datasets are either too small (a few hundred MIDI files of a single composer) or compressed at the wrong layer. The largest open piano corpora pass scores through music21.chordify(), which collapses every simultaneous note onto a single beat as a block chord.

That step throws away everything a pianist actually does: counterpoint, voice leading, hand independence, phrasing across staves. A model trained on chordified data can only learn harmonic averages. It will sound like piano for a few bars and then dissolve into mush.

ClassicalML preserves voice and staff structure as separate token streams. Each note carries its voice index, its staff (left vs right hand), and its onset / offset / duration. The tokenizer was rewritten from scratch in March 2026 specifically to fix the chordify problem, after the first version of the model showed the symptom.

Voice-preserving tokenization.

The single change that made this dataset useful. A two-line replacement of the tokenizer that took roughly four months of failed sweeps to find.

before · chordified

// every simultaneous note collapsed
// into a single block chord
CHORD [G4 B4 D5] q
CHORD [A4 C5 E5] q
CHORD [F4 A4 D5] q
// no melody, no bass line,
// no hand information.
Loss is fine. Output is harmonic mush.

after · voice-preserving

VOICE 0 STAFF R D5 q
VOICE 1 STAFF R B4 q
VOICE 0 STAFF L G3 h
VOICE 0 STAFF R E5 q
// melody · inner voice · bass
// each tracked independently.
Same loss number, vastly different musical structure.

A 59M-param transformer, swept Karpathy-style.

Greedy-hill-climb autoresearch over the hyperparameter space. Each trial is a real training run, not a surrogate. Best settings dropped from a baseline of 1.17 val_loss to 1.1123 — small in number, audibly different in output.

architecture
d_model = 768  ·  layers = 8  ·  59M params
training
PyTorch  ·  AdamW  ·  dropout = 0.25  ·  cosine schedule  ·  mixed precision
data
26,934 pieces  ·  27.5M tokens  ·  voice + staff preserved  ·  90 / 5 / 5 split
search
Karpathy-style greedy hill-climb  ·  ~50 trials per pass  ·  lr · dropout · width · depth · seq_len
best val_loss
1.1123  ·  from 1.17 baseline  ·  5% improvement, audible quality lift
status
Phase 2 sweep ongoing  ·  targeting sub-1.10

A generation.

Cherry-pick from a recent sweep, generated unconditionally (no prompt, no key, no style hint), then rendered with a baseline piano sound. I transposed the key down a step or two to make it actually playable on my hands. The score below is what came out, post key-change.

A-01  ·  joplin-coded ragtime
Generation #10, key-shifted for playability.
unconditional · transposed by hand
Score for the Joplin-coded generation, transposed for playability score · click to enlarge

Open questions, prior art, what's next.

open questions

Does voice-preserving tokenization help on smaller models, or is it only legible at d768+? Does adding a global-key conditioning token narrow the stylistic variance without flattening voice independence? Where does this dataset plateau on val_loss before scale stops mattering?

prior art

MAESTRO and GiantMIDI are the closest predecessors but compress through chordify or focus on performance MIDI rather than score MIDI. Music Transformer (Huang et al.) and Performance RNN frame the tokenization problem; this dataset proposes a different answer.

what's next

Phase 2 hyperparameter sweep on H100. Conditional generation experiments (key, era, composer-as-LoRA). A clean Hugging Face release of the dataset and tokenizer if the licensing checks all clear. Possibly a paper.

Curious? Get in touch.

Happy to chat with researchers, music-tech founders, or anyone working on score-level music ML. The dataset and weights are private for now; collaborations and serious requests welcome.