MidiTok
MidiTok copied to clipboard
Error in training tokenizer "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: invalid continuation byte"
The error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: invalid continuation byte
The stack trace:
[00:05:34] Pre-processing sequences ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 93583 / 93583
[00:00:04] Tokenize words ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4365 / 4365
[00:00:01] Count pairs ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4365 / 4365
[00:06:09] Compute merges ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 15386 / 15386
Traceback (most recent call last):
File "/home/surya/Documents/Projects/MusicGPT/data.py", line 45, in <module>
tokenizer.learn_bpe(vocab_size=16000, files_paths=midi_files)
File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/midi_tokenizer.py", line 2380, in learn_bpe
self._bpe_model.train_from_iterator(
File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/bpe_iterator.py", line 87, in __next__
return self[self.__iter_count - 1]
~~~~^^^^^^^^^^^^^^^^^^^^^^^
File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/bpe_iterator.py", line 76, in __getitem__
return self.load_file(self.files_paths[idx])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/bpe_iterator.py", line 45, in load_file
token_ids = self.tokenizer(midi)
^^^^^^^^^^^^^^^^^^^^
File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/midi_tokenizer.py", line 3064, in __call__
return self.midi_to_tokens(obj, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/midi_tokenizer.py", line 1330, in midi_to_tokens
midi = self.preprocess_midi(midi)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/midi_tokenizer.py", line 342, in preprocess_midi
merge_same_program_tracks(midi.tracks)
File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/utils/utils.py", line 490, in merge_same_program_tracks
new_track = merge_tracks([tracks[idx] for idx in idx_group], effects=effects)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/utils/utils.py", line 400, in merge_tracks
tracks_[0].name += "".join([" / " + t.name for t in tracks_[1:]])
^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: invalid continuation byte
Code:
from pathlib import Path
from miditok import MIDILike, TokenizerConfig
import os
import fnmatch
# Our parameters
TOKENIZER_PARAMS = {
"pitch_range": (21, 109),
"beat_res": {(0, 4): 8, (4, 12): 4},
"num_velocities": 32,
"special_tokens": ["PAD", "BOS", "EOS", "MASK"],
"use_chords": True,
"use_rests": True,
"use_tempos": True,
"use_time_signatures": True,
"use_programs": True,
"num_tempos": 32, # number of tempo bins
"one_token_stream_for_programs": True,
"tempo_range": (40, 250), # (min, max)
}
config = TokenizerConfig(**TOKENIZER_PARAMS)
# Creates the tokenizer
tokenizer = MIDILike(config)
midi_paths = []
root1 = '/home/surya/Documents/datasets/midi1'
root2 = '/home/surya/Documents/Projects/midi2'
def find_midi_files(directory):
midi_files = []
for root, dirs, files in os.walk(directory):
for file in files:
if fnmatch.fnmatch(file, '*.mid') or fnmatch.fnmatch(file, '*.midi') or fnmatch.fnmatch(file, '*.MIDI') or fnmatch.fnmatch(file, '*.MID'):
midi_files.append(Path(os.path.join(root, file)))
return midi_files
midi_files = find_midi_files(root1) + find_midi_files(root2)
print(len(midi_files))
# Builds the vocabulary with BPE
tokenizer.learn_bpe(vocab_size=16000, files_paths=midi_files)
tokenizer.save_pretrained('./tokenizer')
I don't understand what's the error. Can someone help me fix this error?
Hi,
The error seems to occurs when preprocessing a certain file (merging its tracks more exactly). Would you be able to either identify this specific file (printing its path in the source code) and share it, or share your datasets so that I can reproduce the error?
Hi! I am using a combination of this dataset and a custom dataset. The error occurs only when using the bread-midi dataset but not with my custom dataset.
Thank you! It's quite large so I hope to catch it quickly and see how I'll handle this.
In the meantime if you want to fix the issue right now, you can just clone the repo, edit the problematic line in merge_tracks
method in the miditok/utils/utils.py file to something like:
try:
# Change name
tracks_[0].name += "".join([" / " + t.name for t in tracks_[1:]])
except UnicodeDecodeError:
pass
It will just skip the track name association if the MIDI is encoded with something other than utf-8
@Yikai-Liao @lzqlzzq 👋
Sorry to ping you with another bug 😅 I'm not sure what's the exact cause of the problem here, only that symusic fails to bind a few track names that seem to be encoded with something other than utf-8. (but which codec then, latin is already supported?)
To reproduce:
from pathlib import Path
from symusic import Score
path = Path('BreadMIDI/VGM - X (452 MIDI-files)/X/Xenogears (1998) - Other Version/Xenogears (1998) - Other Version - CD 1 - 22. The Wounded shall Advance into the Light.mid')
midi = Score(path)
test = midi.tracks[0].name
Here is a failing file: Xenogears (1998) - Other Version - CD 1 - 22. The Wounded shall Advance into the Light.mid.zip
@ojus1 this dataset seems to contain a lot of corrupted files, you might want to clean it a bit before using it
I grasped these bytes from the file:
\x81\x75\x58\x65\x6e\x6f\x67\x65\x61\x72\x73\x20\x8f\x9d\x82\xe0\x82\xc4\x82\xe9\x82\xed\x82\xea\x82\xe7\x20\x8c\xf5\x82\xcc\x82\xc8\x82\xa9\x82\xf0\x90\x69\x82\xdc\x82\xf1\x81\x76\x20\x5b\x53\x43\x2d\x38\x38\x50\x72\x6f\x5d\x20\x62\x79\x20\x82\xe0\x82\xab\x82\xe3
And that is not any encoding, as even chardet
cannot detect it correctly.
We are still looking into it, because I remember we stripped non utf-8 characters in track name
when parsing midi......
And @ojus1, I also used this dataset myself, that is really dirty. You can simply clean the dataset by:
import os
from symusic import Score
for root, dirs, files in os.walk('path/to/dataset'):
for f in files:
if(f.upper().endswith('MID') or f.upper().endswith('MIDI'):
fn = os.path.join(root, f)
try:
s = Score(fn)
track_names = [t.name for t in s.tracks]
except:
os.remove(fn)
symusic
can provide processing speed of ~3000 files/second
. That won' t take too long.
Thanks @lzqlzzq @Natooz ! Turns out, it was the corrupted files. Removing them, I was able to get a tokenizer up and running.
@lzqlzzq Thank you for your help!
@ojus1 Great!
I actually just added a filter_dataset
method in #160 that foes exactly what its name suggests
Actually, there is a problem with strip_non_utf_8
.
I shall fix it ASAP. Occupied composing my graduation work now......
I've switched to using utfcpp to remove illegitimate characters, and it just works well.
Currently, you could try this version by building from the github repository.
Thank you for the fix!
v0.4.5 has been released now, and it fixed the bug.
This issue is stale because it has been open for 30 days with no activity.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.