musicaiz
musicaiz copied to clipboard
Datasets create empty samples
Hi Carlos,
I created a dataset using Musicaiz. I used the provided code:
from musicaiz.tokenizers import MMMTokenizer, MMMTokenizerArguments
from musicaiz.datasets import JSBChorales
# Tokenize a dataset in musicaiz
output_path = "./BachChorales_4Bar_128"
args = MMMTokenizerArguments(
prev_tokens="",
windowing=True,
time_unit="HUNDRED_TWENTY_EIGHT",
num_programs=None,
shuffle_tracks=True,
track_density=True,
window_size=4,
hop_length=2,
time_sig=False,
velocity=False,
)
dataset = JSBChorales()
dataset.tokenize(
dataset_path="/path/JSBChoralesDataset",
output_path=output_path,
output_file="token-sequences",
args=args,
tokenize_split="all"
)
When reviewing the dataset, I noticed that there were some empty lines:
You can also check it out in Hugging Face.
I am unsure about why this is happening. I could not install the repo locally now to debug it, but maybe it has to do with the mmm tokenizer? Line 174:
tokens += "\n"
.
I am happy to create a PR if I find the problem. But I wanted to create the issue first 😃
Thanks again for the great library.
Hi Juan Carlos,
Can you please provide the version of musicaiz that you are using?
Sure! I installed it using:
!pip install musicaiz==0.1.2