lakh-pianoroll-dataset icon indicating copy to clipboard operation
lakh-pianoroll-dataset copied to clipboard

A collection of 174,154 multi-track piano-rolls

Source Code for Deriving Lakh Pianoroll Dataset (LPD)

The derived dataset using the default settings is available here.

  1. Download Lakh MIDI Dataset (LMD) with the following script.

    ./scripts/download_lmd.sh
    

    (Or, download it manually here.)

  2. Set the variables LMD_ROOT and LPD_ROOT in run.sh and variables in config.py to proper values.

  3. Derive all subsets and versions of LPD, matched_ids.txt and cleansed_ids.txt with the following script.

    ./scripts/derive_lpd.sh
    

Derive the labels for the LPD

The derived labels can be found at data/labels.tar.gz.

  1. Download the labels with the following script.

    ./scripts/download_labels.sh
    
  2. Derive the labels with the following script.

    ./scripts/derive_labels.sh
    

Synthesize audio files for the LPD

  1. Install GNU Parallel to run the synthesizer in parallel mode.

  2. Synthesize audio files from multitrack pianorolls with the following script.

    ./scripts/batch_synthesize.sh ./data/lpd/lpd/lpd_cleansed/ \
      ./data/synthesized/lpd_cleansed 20
    

    (The above command will synthesize all the multitrack pianorolls in the LPD-cleansed subset with 20 parallel jobs.)