projecttidal-horizon

kindresearch

statelive · sweeping

repoprivate

a research project / nick friesen / 2026

classicalML

The first transformer composer trained on 27.5M tokens of voice-preserved classical piano scores. Most piano ML sounds chord-y because the data is chord-y. This dataset isn't.

26,934

pieces

27.5M

tokens

59M

params · d768 L8

1.1123

best val_loss

§ 01 — the dataset

The cleanest score-level piano data at scale.

Existing piano ML datasets are either too small (a few hundred MIDI files of a single composer) or compressed at the wrong layer. The largest open piano corpora pass scores through music21.chordify(), which collapses every simultaneous note onto a single beat as a block chord.

That step throws away everything a pianist actually does: counterpoint, voice leading, hand independence, phrasing across staves. A model trained on chordified data can only learn harmonic averages. It will sound like piano for a few bars and then dissolve into mush.

ClassicalML preserves voice and staff structure as separate token streams. Each note carries its voice index, its staff (left vs right hand), and its onset / offset / duration. The tokenizer was rewritten from scratch in March 2026 specifically to fix the chordify problem, after the first version of the model showed the symptom.

§ 02 — the fix

Voice-preserving tokenization.

The single change that made this dataset useful. A two-line replacement of the tokenizer that took roughly four months of failed sweeps to find.

before · chordified

// every simultaneous note collapsed
// into a single block chord
CHORD [G4 B4 D5] q
CHORD [A4 C5 E5] q
CHORD [F4 A4 D5] q
// no melody, no bass line,
// no hand information.

Loss is fine. Output is harmonic mush.

after · voice-preserving

VOICE 0 STAFF R D5 q
VOICE 1 STAFF R B4 q
VOICE 0 STAFF L G3 h
VOICE 0 STAFF R E5 q
// melody · inner voice · bass
// each tracked independently.

Same loss number, vastly different musical structure.

§ 03 — the model

A 59M-param transformer, swept Karpathy-style.

Greedy-hill-climb autoresearch over the hyperparameter space. Each trial is a real training run, not a surrogate. Best settings dropped from a baseline of 1.17 val_loss to 1.1123 — small in number, audibly different in output.

architecture

d_model = 768 · layers = 8 · 59M params

training

PyTorch · AdamW · dropout = 0.25 · cosine schedule · mixed precision

data

26,934 pieces · 27.5M tokens · voice + staff preserved · 90 / 5 / 5 split

Karpathy-style greedy hill-climb · ~50 trials per pass · lr · dropout · width · depth · seq_len

best val_loss

1.1123 · from 1.17 baseline · 5% improvement, audible quality lift

status

Phase 2 sweep ongoing · targeting sub-1.10

§ 04 · listen

A generation.

Cherry-pick from a recent sweep, generated unconditionally (no prompt, no key, no style hint), then rendered with a baseline piano sound. I transposed the key down a step or two to make it actually playable on my hands. The score below is what came out, post key-change.

A-01 · joplin-coded ragtime

Generation #10, key-shifted for playability.

unconditional · transposed by hand

Score for the Joplin-coded generation, transposed for playability

score · click to enlarge

§ 05 — notes

Open questions, prior art, what's next.

open questions

Does voice-preserving tokenization help on smaller models, or is it only legible at d768+? Does adding a global-key conditioning token narrow the stylistic variance without flattening voice independence? Where does this dataset plateau on val_loss before scale stops mattering?

prior art

MAESTRO and GiantMIDI are the closest predecessors but compress through chordify or focus on performance MIDI rather than score MIDI. Music Transformer (Huang et al.) and Performance RNN frame the tokenization problem; this dataset proposes a different answer.

what's next

Phase 2 hyperparameter sweep on H100. Conditional generation experiments (key, era, composer-as-LoRA). A clean Hugging Face release of the dataset and tokenizer if the licensing checks all clear. Possibly a paper.

§ 06 — reach out

Curious? Get in touch.

Happy to chat with researchers, music-tech founders, or anyone working on score-level music ML. The dataset and weights are private for now; collaborations and serious requests welcome.

nick@thematchartist.com ← back to itsnick.co