MidiTok
MidiTok copied to clipboard
Slow Performance of `tokenize_midi_dataset` Function
I have noticed that there is a significant performance gap between two different scripts I am using to tokenize my dataset. The first script, which filters MIDI files and handles saving/loading manually, operates at approximately 300 iter/s. In comparison, the second script utilizing tokenize_midi_dataset
function operates at a much slower pace of only 15x slower (around 20 iter/s) than the manual implementation.
I think the tokenize_midi_dataset
function doesn't take advantage of all cores.
Note that the filter midi script saves midi while the tokenizer_midi_dataset
saves json. Also the filter midi script utilizes all the cores available.
Filter Midi Script
import os
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
from jsonargparse import CLI
from miditok import REMI, TokenizerConfig
from symusic import Score
from tqdm.auto import tqdm
def process_midi(file_path, input_dir, output_dir):
"""
Process a single MIDI file. Tokenize it and save to the output directory maintaining the directory structure.
:param file_path: Path to the MIDI file.
:param input_dir: Base input directory.
:param output_dir: Base output directory.
"""
try:
# Read the MIDI file
midi_obj = Score.from_file(file_path)
# Initialize the tokenizer
tokenizer = REMI(
TokenizerConfig(
use_tempos=True,
use_programs=True,
use_time_signatures=True,
use_chords=True,
use_rests=True,
one_token_stream_for_programs=True,
special_tokens=["PAD", "BOS", "EOS"],
)
)
# Tokenization (example usage, can be adjusted based on how you want to use the tokenizer)
_ = tokenizer(midi_obj)
# Constructing the new path
relative_path = os.path.relpath(file_path, input_dir)
new_path = os.path.join(output_dir, relative_path)
# Create directories if they don't exist
os.makedirs(os.path.dirname(new_path), exist_ok=True)
# Save the tokenized file (assuming it's a MIDI object, adjust if the format is different)
midi_obj.dump_midi(new_path)
except Exception as e:
print(f"Error processing {file_path}: {e}")
def main(input_dir: str, output_dir: str, max_workers: int | None = None):
"""
Process all MIDI files in the input directory in parallel, and save them in the output directory.
"""
# List all MIDI files in the directory tree
midi_files = [str(p) for p in Path(input_dir).rglob("*.mid")]
# Process each MIDI file in parallel
with ProcessPoolExecutor(max_workers=max_workers) as executor:
# Create a list to hold the future results
future_to_midi = {
executor.submit(process_midi, midi_file, input_dir, output_dir): midi_file
for midi_file in midi_files
}
# Iterate through the futures as they complete (as_completed)
for future in tqdm(
as_completed(future_to_midi),
total=len(midi_files),
desc="Processing MIDI files",
):
file = future_to_midi[future]
try:
result = future.result()
if result:
# Progress will be updated automatically by tqdm
pass
except Exception as exc:
print(f"{file} generated an exception: {exc}")
if __name__ == "__main__":
CLI(main, as_positional=False)
Tokenize Midi Script
from dataclasses import dataclass
from pathlib import Path
from jsonargparse import CLI
from miditok import REMI, TokenizerConfig
@dataclass
class Config:
data_dir: str
output_dir: str
def cli_main(config: Config) -> None:
# Initialize tokenizer
tokenizer = REMI(
TokenizerConfig(
use_tempos=True,
use_programs=True,
use_time_signatures=True,
use_chords=True,
use_rests=True,
one_token_stream_for_programs=True,
special_tokens=["PAD", "BOS", "EOS"],
)
)
# Tokenize a whole dataset and save it at Json files
nobpe_output_dir = Path(f"{config.output_dir}/tokens_noBPE")
midi_paths = list(Path(config.data_dir).glob("**/*.mid"))
tokenizer.tokenize_midi_dataset(midi_paths, nobpe_output_dir)
if __name__ == "__main__":
config = CLI(Config, as_positional=False)
cli_main(config)
Hi @Kinyugo , thank you for the report! :)
I'll inspect the script very soon, tomorrow at the latest.
Now indeed tokenize_midi_dataset
doesn't handle multiprocessing, which was actually already discussed #102
That's definitely something that should be worth implementing.
Also, since the v3.0.0 and thanks to symusic, as the overall tokenization time is now much faster, I think it is safe to use the tokenizer directly with the raw MIDI files when training a model (tokenize on the fly), without having to "pre-tokenize" them as json files :)
Thanks @Natooz.
You are right miditok v3.0 is much faster I have seen about a 10x improvement. I'll consider tokenizing on the fly, that would use PyTorch dataset mutiprocessing, right? Have your ran experiments for tokenizing on the fly vs pre-tokenization, how much of a speed up during training does pre-tokenization offer?
I'll consider tokenizing on the fly, that would use PyTorch dataset mutiprocessing, right?
If the tokenization is handled by the Dataset
with a DataLoader
yes! Now I just realised that miditok actually doesn't implement such method in the dataset classes, I'll add it on the list and do it soon!
I wrote an example in the docs, where it's first pre-tokenized by the dataset (no multiprocessing and stored in ram).
Have your ran experiments for tokenizing on the fly vs pre-tokenization, how much of a speed up during training does pre-tokenization offer?
I didn't so I can't answer with numbers, however as the collator has multiple workers running on CPU and as tokenizing is pretty fast, I can't guarantee it but we can expect none to very little speed up with pre-tokenizing
Interesting. If I get some time I'll test this and share the results with you.
That will help reduce complexity a lot. But I assume for bpe you will want to pre-train the tokenizer first.
Do you have an idea for how one might extract the time say for the start of the segment and end of the segment? I do think this would be quite possible when tokenizing on the fly as we have access to the midi file itself.
BTW awesome job with miditok, it has been quite instrumental in my work.
I just realised that I mixed-up collator and data loader in my last comment. 😅
I'll put this on the account that it was late.
The DataLoader
has multiple workers running concurrently, accessing to a common Dataset
.
The collator only collate several data entries returned from the data loader into a batch ready to be fed to a model.
Working on it right now!
Interesting. If I get some time I'll test this and share the results with you.
That would be awesome! They could be integrated in the docs!
That will help reduce complexity a lot. But I assume for bpe you will want to pre-train the tokenizer first.
Yes indeed. Note that BPE training is also done by tokenizing MIDIs on the fly. The big interest is to not have to rely on pre-tokenised files, but directly operated on the MIDI dataset itself.
Do you have an idea for how one might extract the time say for the start of the segment and end of the segment? I do think this would be quite possible when tokenizing on the fly as we have access to the midi file itself.
I'm not sure to understand exactly what you mean by start and end of segment. Do you refer to the segments of tokens created by the dataset, and time in beat?
BTW awesome job with miditok, it has been quite instrumental in my work.
Thank you! Such comments motivates the most to contribute to open-source! 🫶
As I thought of how to implement a good Dataset
class tokenizing MIDIs on the fly, I realised that splitting token sequences on the fly wouldn't be possible as what's done currently when pre-tokenizing and storing the tokens in memory. Splitting tokens sequences is anyway not a good solution in the end, as the first tokens might not be "valid" and missing critical information (tempo, signature...).
Hence the right way would be to have one MIDI --> one token sequence. Now some MIDI files might be very long, resulting in very long token sequences. A think a good way to mitigate this and maximise the amount of training/validation/testing data would be split the MIDIs themselves. For this, a user would typically have a pre-defined maximum token sequence length, and would need to know the average number of tokens/beat in order to split the MIDIs by portions of a the right number of beats, hence your previous question I suppose. This could be solved by measuring the tokens/beat, use the average to split the MIDIs.
The data flow would be something like: Original MIDI dataset --> Splitted MIDIs --> tokens (one seq per MIDI)
The MIDI splitting could be achieved by several ways:
- Before running the training and stored in a directory provided by the user. Pros: MIDI splitting is done once. Cons: This would add a step in the code pipeline and permanently increase disk space usage, and require to specify the right data directory in the code;
- When initializing the
Dataset
, then saving everything in a temporary directory that would be reaccessed by the dataset when loading the MIDIs to be tokenized on the fly. Pro: no intermediate directory. Cons: MIDI splitting has to be done at eachDataset
initialization, but fortunately this would be a fast step; - Same as before but also pre-tokenizing them. Pros: faster loading during training (but are there any real speed-ups?). Cons: slower
Dataset
initialization; - Same as 2 and 3 but storing the tokens/MIDIs in memory. Pros: no need any disk space, to read and write things. Cons: depending on the amount of data; can require a large amount of memory.
It's probably best to considered all these cases, that could cover various cases for users having different hardware configurations and usages.
Ultimately, these methods could be implemented in a common Dataset
class with midi_splitting
, tmp_dir
and pre_tokenizing
arguments.
What do you think of this?
We could also do the MIDI splitting in the Dataset
initialization, and save the MIDIs in a permanent directory (as 1.) with a config file, that would allow to not have to result every time by automatically detect that this was already done before.
This way the user wouldn't have to do the extra step of doing it as in 1.
@Kinyugo in #148 I added the get_num_tokens_per_beat_distribution
and get_num_beats_for_token_seq_len
methods that should somehow fulfil your problem of start/end segment, by finding a number of beats to split a MIDI into section of appropriate length
Hello. Thanks for your apt replies. I have a few questions though.
Yes indeed. Note that BPE training is also done by tokenizing MIDIs on the fly. The big interest is to not have to rely on pre-tokenised files, but directly operated on the MIDI dataset itself.
Do you mean that I won't have to pretrain the tokenizer before starting training?
As I thought of how to implement a good Dataset class tokenizing MIDIs on the fly, I realised that splitting token sequences on the fly wouldn't be possible as what's done currently when pre-tokenizing and storing the tokens in memory. Splitting tokens sequences is anyway not a good solution in the end, as the first tokens might not be "valid" and missing critical information (tempo, signature...).
Thanks so much for the insights on splitting midi. I have been splitting the tokens directly as we do in language modelling tasks and haven't taken into account the midi structure. I'll definitely try your approach.
@Kinyugo in https://github.com/Natooz/MidiTok/pull/148 I added the get_num_tokens_per_beat_distribution and get_num_beats_for_token_seq_len methods that should somehow fulfil your problem of start/end segment, by finding a number of beats to split a MIDI into section of appropriate length
I appreciate all the good work you are doing with miditok. I'll be experimenting with these over the weekend and hopefully share my insights too.
Do you mean that I won't have to pretrain the tokenizer before starting training?
No I just meant that when training the tokenizer, the training data (MIDIs) is tokenized on the fly, there is no need to pre-tokenise it. :)
Thanks so much for the insights on splitting midi. I have been splitting the tokens directly as we do in language modelling tasks and haven't taken into account the midi structure. I'll definitely try your approach.
Thank you! The drawback however will be that if the number of tokens/beat has a great variance, we will either lose a portion of the training tokens (cut by the maximum token sequence length) and/or train with potential short sequences (that would be padded), increasing the overall training time. I assume a quick analysis of the tokens/beat of the training data is necessary in order to split it the best way.
Thanks so much for the insights on splitting midi. I have been splitting the tokens directly as we do in language modelling tasks and haven't taken into account the midi structure. I'll definitely try your approach.
🫶
No I just meant that when training the tokenizer, the training data (MIDIs) is tokenized on the fly, there is no need to pre-tokenise it. :)
Nice. I had misunderstood that.
... we will either lose a portion of the training tokens (cut by the maximum token sequence length) and/or train with potential short sequences (that would be padded), increasing the overall training time ...
I am also not sure how we will teach the model to generate full samples. Previously even with splitting we can just append BOS and EOS tokens. How do we work with that splitting at midi level? Would we have to split overlapping sequences?
I am also not sure how we will teach the model to generate full samples.
About full samples: I am currently experimenting with a TSD
tokenizer, trained with BPE on the MetaMIDI Dataset. I will have to measure it more accurately, but the number of beats per sequence (1k tokens) is quite large, often ranging from 40 to 80 bars depending on the note density and number of instruments.
Previously even with splitting we can just append BOS and EOS tokens. How do we work with that splitting at midi level? Would we have to split overlapping sequences?
The MIDI themselves are split in chunks of $n$ MIDIs, each one is tokenized independently. The resulting tokens wouldn't be split as the segments beginning from the second one might begin with a non-coherent token and miss global information like tempo or time signature, so the tokens will be just clipped to the maximum sequence length. And finally the BOS and EOS tokens can be added by the collator.
About full samples: I am currently experimenting with a TSD tokenizer, trained with BPE on the MetaMIDI Dataset.
Nice! Looking forward to your findings.
The resulting tokens wouldn't be split as the segments beginning from the second one might begin with a non-coherent token and miss global information like tempo or time signature
I now understand why splitting at midi level makes sense. In that case it might make sense to split dynamically during training that way we can also easily figure out where and when to add the BOS and EOS tokens. Also we can split at different points reducing the need for padding/trimming.
I now understand why splitting at midi level makes sense. In that case it might make sense to split dynamically during training that way we can also easily figure out where and when to add the BOS and EOS tokens. Also we can split at different points reducing the need for padding/trimming.
Absolutely. We will have to find a way to design such dynamic trimming method. I currently have a lot in my plate, so I'll probably work on the open PR later when I'll have more time.
No problem if I do get time too may be I can contribute that. Currently held up as well.
This issue is stale because it has been open for 30 days with no activity.
Hi @Kinyugo 👋 I finally got some time to get back at the task :)
I ended up making a "dynamic" splitting solution based on note note densities of each bars in order to reduce padding. The MIDI is split at bars, in order to get chunks that start at relevant times, and the number of bars of each chunk is determined by the number of notes they contain.
You can take a look in https://github.com/Natooz/MidiTok/pull/148 And here is the docs preview
I still have a few things to adjust but that part is almost done! 🙌 Would you be interested to give feedback or suggestions before merging it? No obligation/pressure at all!
Hello @Natooz
Thanks for looking into the issue.
I am currently running tests for this. However, I do notice an IndexError: list index out of range
error with some files on this line. Do you know why this might happen.
Here is my tokenizer config:
tokenizer = REMI(
TokenizerConfig(
use_tempos=True,
use_programs=True,
use_time_signatures=True,
use_chords=True,
use_rests=True,
one_token_stream_for_programs=True,
special_tokens=["PAD", "BOS", "EOS"],
)
)
Thank for taking the time to test it, and for reporting this bug!
The errors comes from the bi
index which exceeds the number of bars, I'm working on a fix.
Just pushed it, I think it should do it 🙌
Thanks for the quick response.
In a typical sequence modelling scenario wouldn't it make sense to have overlapping sequences. Or what's your recommended approach?
By overlapping sequences are you referring to chunks of music that would cover several portions of the same bars?
E.g. [b1, b2, b3]
, [b2, b3, b4, b5]
, [b4, b5, b6]
Not necessarily an N-1 overlap but just a way to control the overlap between the sequences. I reckon this would help the model learn continuation better. Also do you pass the type of chunk for each sequence or how would one go about adding BOS and EOS tokens?
That's a good idea, I'll add a n_overlap
option to allow to MIDI chunks to overlap a few bars, this could enforce the "causality" throughout the entire MIDI.
For BOS and EOS tokens, that's actually the last things remaining to do. Right now BOS and EOS tokens are added for each chunk, but I intend to add markers in each chunk to indicate their "number" in the original MIDI, and add a BOS only to the first chunk, and EOS only to the last. EOS is intended to indicate to the model the end of a data sample. Adding EOS to each chunk would not really make sense and could end up with training it to end sequences that shouldn't be. Adding BOS to each chunk would be less detrimental I think, but it would "break" the causality chain with overlapping chunks, so it's probably better to only add it for the first chunk.
That will be awesome.
Do you have an idea about how you will go about adding the markers, such that when someone is using a custom dataset/dataloader they can replicate this functionality easily?
I think a good way is to add a MIDI marker at the first tick of each chunk. Here is how I implemented it. And here is in the DatasetMIDI how they are detected. I think it shouldn't be hard to reuse in a custom Dataset.
Could you link the exact lines for how they are detected?
FYI I have verified that the fix for the IndexError works. Thanks for fixing that.
Its in DatasetMIDI._tokenize_midi
but linking the exact line might change with future commits, here is the code:
# If this file is a chunk (split_midis_for_training), determine its id.
# By default, we add BOS and EOS tokens following the values of
# self.bos_token_id and self.eos_token_id (that may be None), except when the
# file is identified as a chunk.
add_bos_token = add_eos_token = True
for marker in midi.markers:
if marker.time != 0:
break
if marker.text.startswith("miditok: chunk"):
chunk_id, chunk_id_last = map(
int, marker.text.split(" ")[-1].split("/")
)
add_bos_token = chunk_id == 0
add_eos_token = chunk_id == chunk_id_last
# Adds BOS/EOS if necessary...
Chunk overlap is pushed, I'll have a few tests to do then and to modify the docs and we'll be good to merge! 🙌
That's a nice way to implement it. You should consider adding it somewhere in the docs/tutorials that's easier to access.
Thanks for the updates. I can't test them atm but will let you know once I do.