The first transformer composer trained on 27.5M tokens of voice-preserved classical piano scores. Most piano ML sounds chord-y because the data is chord-y. This dataset isn't.
Existing piano ML datasets are either too small (a few hundred MIDI files of a single composer) or compressed at the wrong layer. The largest open piano corpora pass scores through music21.chordify(), which collapses every simultaneous note onto a single beat as a block chord.
That step throws away everything a pianist actually does: counterpoint, voice leading, hand independence, phrasing across staves. A model trained on chordified data can only learn harmonic averages. It will sound like piano for a few bars and then dissolve into mush.
ClassicalML preserves voice and staff structure as separate token streams. Each note carries its voice index, its staff (left vs right hand), and its onset / offset / duration. The tokenizer was rewritten from scratch in March 2026 specifically to fix the chordify problem, after the first version of the model showed the symptom.
The single change that made this dataset useful. A two-line replacement of the tokenizer that took roughly four months of failed sweeps to find.
// every simultaneous note collapsed // into a single block chord CHORD [G4 B4 D5] q CHORD [A4 C5 E5] q CHORD [F4 A4 D5] q // no melody, no bass line, // no hand information.
VOICE 0 STAFF R D5 q VOICE 1 STAFF R B4 q VOICE 0 STAFF L G3 h VOICE 0 STAFF R E5 q // melody · inner voice · bass // each tracked independently.
Greedy-hill-climb autoresearch over the hyperparameter space. Each trial is a real training run, not a surrogate. Best settings dropped from a baseline of 1.17 val_loss to 1.1123 — small in number, audibly different in output.
d_model = 768 · layers = 8 · 59M paramsdropout = 0.25 · cosine schedule · mixed precision~50 trials per pass · lr · dropout · width · depth · seq_lenCherry-pick from a recent sweep, generated unconditionally (no prompt, no key, no style hint), then rendered with a baseline piano sound. I transposed the key down a step or two to make it actually playable on my hands. The score below is what came out, post key-change.
Does voice-preserving tokenization help on smaller models, or is it only legible at d768+? Does adding a global-key conditioning token narrow the stylistic variance without flattening voice independence? Where does this dataset plateau on val_loss before scale stops mattering?
MAESTRO and GiantMIDI are the closest predecessors but compress through chordify or focus on performance MIDI rather than score MIDI. Music Transformer (Huang et al.) and Performance RNN frame the tokenization problem; this dataset proposes a different answer.
Phase 2 hyperparameter sweep on H100. Conditional generation experiments (key, era, composer-as-LoRA). A clean Hugging Face release of the dataset and tokenizer if the licensing checks all clear. Possibly a paper.
Happy to chat with researchers, music-tech founders, or anyone working on score-level music ML. The dataset and weights are private for now; collaborations and serious requests welcome.