mirdata
mirdata copied to clipboard
Support random validation for big dataset
add extra option to validate which would allow for randomly sampling N tracks to validate (in order to support big datasets)
Moving the conversation from #305 here:
@PRamoneda We are going to add mirdata some big dataset. https://stackoverflow.com/questions/10382253/reading-rather-large-json-files-in-python/10382359#10382359
I think that this dependency could save us a lot of time: ijson https://pypi.org/project/ijson/
What's your opinion?
@nkundiushuti just to put this into context: it is possible to save a large index by writing each record in json rather than accumulating it into a big dictionary which does not fit into memory. however, reading that file is cumbersome. the problems appear at validating and loading the data we are not sure what is the best way:
to load the Tracks we currently use the index loaded in memory. @PRamoneda proposed to use ijson but maybe there are alternatives. I am not so eager to add another dependency unless it's light. as an alternative the track_ids may be provided by a generator
def gen_index():
with open(filename, 'r') as f:
for line in f:
yield json.loads(line)
to validate we can randomly sample N records from the json index and validate solely those. we may give the option to fully validate all the dataset.
@PRamoneda and @genisplaja also commented that tests in big datasets are slow (with test_full_dataset
). They mentioned the idea that we could parallelize them to save some time. @rabitt thoughts?