mirdata
mirdata copied to clipboard
Move index files to a remote location
As this grows, it will get less feasible to include each dataset's index file (of checksums and paths) in the repo. Instead we can put them on Zenodo and download them when they're needed.
Thoughts?
We discussed offline with @lostanlen that we should write down somewhere the version policy when moving the indexes to Zenodo:
- what happens in the future with a new release of mirdata
- what happens if we modify an index (e.g. found a bug, fix it, we keep the old one in Zenodo for version control as well)
- and when we include a new loader.
@lostanlen pointed out a fair question, how to include index in zenodo when someone is contributing with a new loader? Are we going to do it ourselves manually? Is it worth then? @rabitt thoughts?
On versioning - if we update an index, it will have a new download link with zenodo, so mirdata would have to change the link as well to update it, so there shouldn't be any "silent" version changes.
I'm imagining a separate zenodo record per index, rather than having one record will all indexes. This way it's easy to add new ones without affecting existing ones.
Moving a discussion from #149 to here:
v @rabitt (in response to a file-system based dataset.track_ids()
function)
I think
dataset.track_ids()
should be deterministic, instead of based on the user's filesystem. Some ideas:
- What about a file that just stores the track ids (no filenames or checksums)?
- I opened an idea for how we deal with the growing size of the indexes/ folder in #153 - we could do something similar here (download the trackids file from zenodo as part of the download).
v @andreasjansson
I like that idea! In that case we could even store the full index on Zenodo. If you have bandwidth to download ~500GB of MSD, you can probably download another ~100MB of index. The validate method could get an optional subsample_percent=1.0 (or something) argument, in case you want faster validation.
I am currently exploring the possibility of enabling git lfs (large file storage) in mirdata. Here's my fork: https://github.com/lostanlen/lfs-mirdata
here’s our breakdown of memory usage per file type for the whole mirdata repository
(lfsmd) 4c32759bd91d:lfs-mirdata vl238$ git lfs migrate info --include-ref=master
migrate: Sorting commits: ..., done
migrate: Examining commits: 100% (133/133), done
*.json 11 MB 35/35 files(s) 100%
*.wav 6.4 MB 21/21 files(s) 100%
*.py 3.2 MB 550/558 files(s) 99%
*.mp3 2.0 MB 4/4 files(s) 100%
*.md 260 KB 30/30 files(s) 100%
and now for only the mirdata package (as present on PyPI)
(lfsmd) 4c32759bd91d:lfs-mirdata vl238$ git lfs migrate info --include="mirdata/**" --include-ref=master
migrate: Sorting commits: ..., done
migrate: Examining commits: 100% (133/133), done
*.json 10 MB 22/22 files(s) 100%
*.py 2.0 MB 241/241 files(s) 100%
I used the following command:
git lfs migrate import --everything --include="mirdata/indexes/**"
and i managed to package a “mirdata lite” that weighs only 55kb!
then i made a python setup.py sdist
and the distribution installs correctly. it runs download() and cite() normally.
now i need to update util.load_json_index
so that it downloads the JSON on the fly from the remote LFS store
this "git-lfs-fetch" module is interesting and actively maintained. i am in favor of adding it as a dependency https://github.com/liberapay/git-lfs-fetch.py
however, we also need to add other functionalities to the git-lfs-fetch.py
module so that index loading works on lite PyPI distributions.
here's my plan for this. the idea is to reimplement git_lfs.get_lfs_endpoint_url
and extend it so that it works for PyPI distributions. in this way, we can accommodate power users (who clone the source) and casual users (who install from pip).
for pip users, there is no .git folder to pull from, so the endpoint lives on the web. (raw.githubusercontent.com
). Now, we want to download an index from the correct version of the code, so that has to be tied down to mirdata.version
.
so, in other words, i need to update setup.py so that it produces the “LFS endpoint URL” automatically, and so that get_lfs_endpoint_url
works on PyPI.
Example usage:
>>> git_lfs.get_lfs_endpoint_url('/Users/vl238/lfs-mirdata', '/Users/vl238/lfs-mirdata')
'https://github.com/lostanlen/lfs-mirdata.git/info/lfs'
now the URL we need is
https://raw.githubusercontent.com/mir-dataset-loaders/mirdata/ + VERSION + /mirdata/indexes
@rabitt
Instead we can put them on Zenodo and download them when they're needed.
what do you mean by "when they're needed?" at import time? if so, users will have to be connected to the Internet the first time they import a mirdata module. is that OK for you?
I created an lfs
branch on mirdata
: https://github.com/mir-dataset-loaders/mirdata/tree/lfs
Note that master and lfs have entirely different commit histories, so it will not be possible to make a PR for this.

Instead, it must be done by a force-push, which of course would come with a great deal of caution. Our master branch is protected against force-pushes (which is good), so there is no risk of it ever happening by accident.
As open PRs get merged to master (#149, #159, #174, #185, #188, and #190 + probably others), i will merge them to this lfs
branch too so that i'm up to date with master.
In the meantime, i'd like to be able to run CircleCI on the lfs
branch. May i set this radio button to "Off"?

In the meantime, i'd like to be able to run CircleCI on the lfs branch. May i set this radio button to "Off"?
Yes go ahead!
hey! this branch does not seem to exist: https://github.com/mir-dataset-loaders/mirdata/tree/lfs should I start from scratch?
Give me a couple of days to check if it's worth starting from a backup version
the index for acousticbrainz genre is 1.5GB. there is a limit of 1GB for github's LFS. so we either have to store the file somewhere else and find another solution for zenodo. I am in favour of using zenodo and pulling the index when the dataset is imported for large indexes/datasets. there is also to have indexes stored as releases in github: https://docs.github.com/en/free-pro-team@latest/github/managing-large-files/distributing-large-binaries
https://medium.com/@megastep/github-s-large-file-storage-is-no-panacea-for-open-source-quite-the-opposite-12c0e16a9a91
Closed via #335. Because of size limits of GitHub LFS we decided to move big indexes to zenodo.
Reopening to restart the discussion as this is becoming increasingly important.
We now support large indexes on zenodo - We've now started discussing putting all indexes on zenodo, and removing all local indexes to better support multiple dataset versions and sample versions as in #433 .