mirdata icon indicating copy to clipboard operation
mirdata copied to clipboard

Support random validation for big dataset

Open rabitt opened this issue 4 years ago • 2 comments

add extra option to validate which would allow for randomly sampling N tracks to validate (in order to support big datasets)

rabitt avatar Oct 19 '20 21:10 rabitt

Moving the conversation from #305 here:

@PRamoneda We are going to add mirdata some big dataset. https://stackoverflow.com/questions/10382253/reading-rather-large-json-files-in-python/10382359#10382359

I think that this dependency could save us a lot of time: ijson https://pypi.org/project/ijson/

What's your opinion?

@nkundiushuti just to put this into context: it is possible to save a large index by writing each record in json rather than accumulating it into a big dictionary which does not fit into memory. however, reading that file is cumbersome. the problems appear at validating and loading the data we are not sure what is the best way:

to load the Tracks we currently use the index loaded in memory. @PRamoneda proposed to use ijson but maybe there are alternatives. I am not so eager to add another dependency unless it's light. as an alternative the track_ids may be provided by a generator

def gen_index():
    with open(filename, 'r') as f:
        for line in f:
           yield json.loads(line)

to validate we can randomly sample N records from the json index and validate solely those. we may give the option to fully validate all the dataset.

rabitt avatar Nov 03 '20 22:11 rabitt

@PRamoneda and @genisplaja also commented that tests in big datasets are slow (with test_full_dataset). They mentioned the idea that we could parallelize them to save some time. @rabitt thoughts?

magdalenafuentes avatar Jan 18 '21 16:01 magdalenafuentes