musicaiz icon indicating copy to clipboard operation
musicaiz copied to clipboard

Datasets create empty samples

Open juancopi81 opened this issue 1 year ago • 2 comments

Hi Carlos,

I created a dataset using Musicaiz. I used the provided code:

from musicaiz.tokenizers import MMMTokenizer, MMMTokenizerArguments
from musicaiz.datasets import JSBChorales

# Tokenize a dataset in musicaiz
output_path = "./BachChorales_4Bar_128"

args = MMMTokenizerArguments(
    prev_tokens="",
    windowing=True,
    time_unit="HUNDRED_TWENTY_EIGHT",
    num_programs=None,
    shuffle_tracks=True,
    track_density=True,
    window_size=4,
    hop_length=2,
    time_sig=False,
    velocity=False,
)
dataset = JSBChorales()
dataset.tokenize(
    dataset_path="/path/JSBChoralesDataset",
    output_path=output_path,
    output_file="token-sequences",
    args=args,
    tokenize_split="all"
)

When reviewing the dataset, I noticed that there were some empty lines:

image

You can also check it out in Hugging Face.

I am unsure about why this is happening. I could not install the repo locally now to debug it, but maybe it has to do with the mmm tokenizer? Line 174:

tokens += "\n".

I am happy to create a PR if I find the problem. But I wanted to create the issue first 😃

Thanks again for the great library.

juancopi81 avatar Apr 05 '23 23:04 juancopi81

Hi Juan Carlos,

Can you please provide the version of musicaiz that you are using?

carlosholivan avatar Apr 07 '23 00:04 carlosholivan

Sure! I installed it using:

!pip install musicaiz==0.1.2

juancopi81 avatar Apr 07 '23 14:04 juancopi81