MidiTok Error in training tokenizer "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: invalid continuation byte"

The error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: invalid continuation byte

The stack trace:

[00:05:34] Pre-processing sequences       ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 93583    /    93583
[00:00:04] Tokenize words                 ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4365     /     4365
[00:00:01] Count pairs                    ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 4365     /     4365
[00:06:09] Compute merges                 ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 15386    /    15386
Traceback (most recent call last):
  File "/home/surya/Documents/Projects/MusicGPT/data.py", line 45, in <module>
    tokenizer.learn_bpe(vocab_size=16000, files_paths=midi_files)
  File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/midi_tokenizer.py", line 2380, in learn_bpe
    self._bpe_model.train_from_iterator(
  File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/bpe_iterator.py", line 87, in __next__
    return self[self.__iter_count - 1]
           ~~~~^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/bpe_iterator.py", line 76, in __getitem__
    return self.load_file(self.files_paths[idx])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/bpe_iterator.py", line 45, in load_file
    token_ids = self.tokenizer(midi)
                ^^^^^^^^^^^^^^^^^^^^
  File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/midi_tokenizer.py", line 3064, in __call__
    return self.midi_to_tokens(obj, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/midi_tokenizer.py", line 1330, in midi_to_tokens
    midi = self.preprocess_midi(midi)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/midi_tokenizer.py", line 342, in preprocess_midi
    merge_same_program_tracks(midi.tracks)
  File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/utils/utils.py", line 490, in merge_same_program_tracks
    new_track = merge_tracks([tracks[idx] for idx in idx_group], effects=effects)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/surya/miniconda3/lib/python3.11/site-packages/miditok/utils/utils.py", line 400, in merge_tracks
    tracks_[0].name += "".join([" / " + t.name for t in tracks_[1:]])
    ^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: invalid continuation byte

Code:

from pathlib import Path
from miditok import MIDILike, TokenizerConfig
import os
import fnmatch

# Our parameters
TOKENIZER_PARAMS = {
    "pitch_range": (21, 109),
    "beat_res": {(0, 4): 8, (4, 12): 4},
    "num_velocities": 32,
    "special_tokens": ["PAD", "BOS", "EOS", "MASK"],
    "use_chords": True,
    "use_rests": True,
    "use_tempos": True,
    "use_time_signatures": True,
    "use_programs": True,
    "num_tempos": 32,  # number of tempo bins
    "one_token_stream_for_programs": True,
    "tempo_range": (40, 250),  # (min, max)
}
config = TokenizerConfig(**TOKENIZER_PARAMS)

# Creates the tokenizer
tokenizer = MIDILike(config)


midi_paths = []
root1 = '/home/surya/Documents/datasets/midi1'
root2 = '/home/surya/Documents/Projects/midi2'

def find_midi_files(directory):
    midi_files = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            if fnmatch.fnmatch(file, '*.mid') or fnmatch.fnmatch(file, '*.midi') or fnmatch.fnmatch(file, '*.MIDI') or fnmatch.fnmatch(file, '*.MID'):
                midi_files.append(Path(os.path.join(root, file)))
    return midi_files

midi_files = find_midi_files(root1) + find_midi_files(root2)
print(len(midi_files))

# Builds the vocabulary with BPE
tokenizer.learn_bpe(vocab_size=16000, files_paths=midi_files)
tokenizer.save_pretrained('./tokenizer')

I don't understand what's the error. Can someone help me fix this error?

Apr 08 '24 13:04 ojus1

Hi,

The error seems to occurs when preprocessing a certain file (merging its tracks more exactly). Would you be able to either identify this specific file (printing its path in the source code) and share it, or share your datasets so that I can reproduce the error?

Apr 08 '24 13:04 Natooz

Hi! I am using a combination of this dataset and a custom dataset. The error occurs only when using the bread-midi dataset but not with my custom dataset.

Apr 08 '24 13:04 ojus1

Thank you! It's quite large so I hope to catch it quickly and see how I'll handle this.

Apr 08 '24 13:04 Natooz

In the meantime if you want to fix the issue right now, you can just clone the repo, edit the problematic line in merge_tracks method in the miditok/utils/utils.py file to something like:

try:
    # Change name
    tracks_[0].name += "".join([" / " + t.name for t in tracks_[1:]])
except UnicodeDecodeError:
    pass

It will just skip the track name association if the MIDI is encoded with something other than utf-8

Apr 08 '24 13:04 Natooz

@Yikai-Liao @lzqlzzq 👋

Sorry to ping you with another bug 😅 I'm not sure what's the exact cause of the problem here, only that symusic fails to bind a few track names that seem to be encoded with something other than utf-8. (but which codec then, latin is already supported?)

To reproduce:

from pathlib import Path
from symusic import Score

path = Path('BreadMIDI/VGM - X (452 MIDI-files)/X/Xenogears (1998) - Other Version/Xenogears (1998) - Other Version - CD 1 - 22. The Wounded shall Advance into the Light.mid')
midi = Score(path)
test = midi.tracks[0].name

Here is a failing file: Xenogears (1998) - Other Version - CD 1 - 22. The Wounded shall Advance into the Light.mid.zip

@ojus1 this dataset seems to contain a lot of corrupted files, you might want to clean it a bit before using it

Apr 08 '24 16:04 Natooz

I grasped these bytes from the file:

\x81\x75\x58\x65\x6e\x6f\x67\x65\x61\x72\x73\x20\x8f\x9d\x82\xe0\x82\xc4\x82\xe9\x82\xed\x82\xea\x82\xe7\x20\x8c\xf5\x82\xcc\x82\xc8\x82\xa9\x82\xf0\x90\x69\x82\xdc\x82\xf1\x81\x76\x20\x5b\x53\x43\x2d\x38\x38\x50\x72\x6f\x5d\x20\x62\x79\x20\x82\xe0\x82\xab\x82\xe3

And that is not any encoding, as even chardet cannot detect it correctly.

We are still looking into it, because I remember we stripped non utf-8 characters in track name when parsing midi......

Apr 09 '24 05:04 lzqlzzq

And @ojus1, I also used this dataset myself, that is really dirty. You can simply clean the dataset by:

import os
from symusic import Score

for root, dirs, files in os.walk('path/to/dataset'):
    for f in files:
        if(f.upper().endswith('MID') or f.upper().endswith('MIDI'):
            fn = os.path.join(root, f)
            try:
                s = Score(fn)
                track_names = [t.name for t in s.tracks]
            except:
                os.remove(fn)

symusic can provide processing speed of ~3000 files/second. That won' t take too long.

Apr 09 '24 06:04 lzqlzzq

Thanks @lzqlzzq @Natooz ! Turns out, it was the corrupted files. Removing them, I was able to get a tokenizer up and running.

Apr 09 '24 07:04 ojus1

@lzqlzzq Thank you for your help!

@ojus1 Great!

I actually just added a filter_dataset method in #160 that foes exactly what its name suggests

Apr 09 '24 07:04 Natooz

Actually, there is a problem with strip_non_utf_8. I shall fix it ASAP. Occupied composing my graduation work now......

Apr 09 '24 10:04 lzqlzzq

I've switched to using utfcpp to remove illegitimate characters, and it just works well.

img_v3_029q_616331d7-a89d-4d81-a11b-b4ce54067bfg

Currently, you could try this version by building from the github repository.

Apr 10 '24 15:04 Yikai-Liao

Thank you for the fix!

Apr 10 '24 15:04 Natooz

v0.4.5 has been released now， and it fixed the bug.

Apr 12 '24 13:04 Yikai-Liao

This issue is stale because it has been open for 30 days with no activity.

May 04 '24 02:05 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity.

May 26 '24 02:05 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jun 02 '24 02:06 github-actions[bot]

MidiTok MidiTok copied to clipboard

Error in training tokenizer "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 0: invalid continuation byte"

MidiTok
MidiTok copied to clipboard