Nathan Fradet
Nathan Fradet
> I'll consider tokenizing on the fly, that would use PyTorch dataset mutiprocessing, right? If the tokenization is handled by the `Dataset` with a `DataLoader` yes! Now I just realised...
I just realised that I mixed-up collator and data loader in my last comment. 😅 I'll put this on the account that it was late. The `DataLoader` has multiple workers...
As I thought of how to implement a good `Dataset` class tokenizing MIDIs on the fly, I realised that splitting token sequences on the fly wouldn't be possible as what's...
We could also do the MIDI splitting in the `Dataset` initialization, and save the MIDIs in a permanent directory (as 1.) with a config file, that would allow to not...
@Kinyugo in #148 I added the `get_num_tokens_per_beat_distribution` and `get_num_beats_for_token_seq_len` methods that should somehow fulfil your problem of start/end segment, by finding a number of beats to split a MIDI into...
> Do you mean that I won't have to pretrain the tokenizer before starting training? No I just meant that when training the tokenizer, the training data (MIDIs) is tokenized...
> I am also not sure how we will teach the model to generate full samples. About full samples: I am currently experimenting with a `TSD` tokenizer, trained with BPE...
> I now understand why splitting at midi level makes sense. In that case it might make sense to split dynamically during training that way we can also easily figure...
Hi @Kinyugo 👋 I finally got some time to get back at the task :) I ended up making a "dynamic" splitting solution based on note note densities of each...
Thank for taking the time to test it, and for reporting this bug! The errors comes from the `bi` index which exceeds the number of bars, I'm working on a...