mirdata icon indicating copy to clipboard operation
mirdata copied to clipboard

Move index files to a remote location

Open rabitt opened this issue 5 years ago • 12 comments

As this grows, it will get less feasible to include each dataset's index file (of checksums and paths) in the repo. Instead we can put them on Zenodo and download them when they're needed.

Thoughts?

rabitt avatar Nov 20 '19 16:11 rabitt

We discussed offline with @lostanlen that we should write down somewhere the version policy when moving the indexes to Zenodo:

  • what happens in the future with a new release of mirdata
  • what happens if we modify an index (e.g. found a bug, fix it, we keep the old one in Zenodo for version control as well)
  • and when we include a new loader.

magdalenafuentes avatar Feb 11 '20 21:02 magdalenafuentes

@lostanlen pointed out a fair question, how to include index in zenodo when someone is contributing with a new loader? Are we going to do it ourselves manually? Is it worth then? @rabitt thoughts?

magdalenafuentes avatar Feb 11 '20 21:02 magdalenafuentes

On versioning - if we update an index, it will have a new download link with zenodo, so mirdata would have to change the link as well to update it, so there shouldn't be any "silent" version changes.

I'm imagining a separate zenodo record per index, rather than having one record will all indexes. This way it's easy to add new ones without affecting existing ones.

rabitt avatar Feb 21 '20 22:02 rabitt

Moving a discussion from #149 to here:

v @rabitt (in response to a file-system based dataset.track_ids() function)

I think dataset.track_ids() should be deterministic, instead of based on the user's filesystem. Some ideas:

  • What about a file that just stores the track ids (no filenames or checksums)?
  • I opened an idea for how we deal with the growing size of the indexes/ folder in #153 - we could do something similar here (download the trackids file from zenodo as part of the download).

v @andreasjansson

I like that idea! In that case we could even store the full index on Zenodo. If you have bandwidth to download ~500GB of MSD, you can probably download another ~100MB of index. The validate method could get an optional subsample_percent=1.0 (or something) argument, in case you want faster validation.

rabitt avatar Mar 05 '20 17:03 rabitt

I am currently exploring the possibility of enabling git lfs (large file storage) in mirdata. Here's my fork: https://github.com/lostanlen/lfs-mirdata

here’s our breakdown of memory usage per file type for the whole mirdata repository

(lfsmd) 4c32759bd91d:lfs-mirdata vl238$ git lfs migrate info --include-ref=master
migrate: Sorting commits: ..., done
migrate: Examining commits: 100% (133/133), done
*.json	11 MB 	  35/35 files(s)	100%
*.wav 	6.4 MB	  21/21 files(s)	100%
*.py  	3.2 MB	550/558 files(s)	 99%
*.mp3 	2.0 MB	    4/4 files(s)	100%
*.md  	260 KB	  30/30 files(s)	100%

and now for only the mirdata package (as present on PyPI)

(lfsmd) 4c32759bd91d:lfs-mirdata vl238$ git lfs migrate info --include="mirdata/**" --include-ref=master
migrate: Sorting commits: ..., done
migrate: Examining commits: 100% (133/133), done
*.json	10 MB 	  22/22 files(s)	100%
*.py  	2.0 MB	241/241 files(s)	100%

I used the following command:

git lfs migrate import --everything --include="mirdata/indexes/**"

and i managed to package a “mirdata lite” that weighs only 55kb! then i made a python setup.py sdist and the distribution installs correctly. it runs download() and cite() normally.

now i need to update util.load_json_index so that it downloads the JSON on the fly from the remote LFS store

this "git-lfs-fetch" module is interesting and actively maintained. i am in favor of adding it as a dependency https://github.com/liberapay/git-lfs-fetch.py

however, we also need to add other functionalities to the git-lfs-fetch.py module so that index loading works on lite PyPI distributions.

here's my plan for this. the idea is to reimplement git_lfs.get_lfs_endpoint_url and extend it so that it works for PyPI distributions. in this way, we can accommodate power users (who clone the source) and casual users (who install from pip).

for pip users, there is no .git folder to pull from, so the endpoint lives on the web. (raw.githubusercontent.com). Now, we want to download an index from the correct version of the code, so that has to be tied down to mirdata.version. so, in other words, i need to update setup.py so that it produces the “LFS endpoint URL” automatically, and so that get_lfs_endpoint_url works on PyPI.

Example usage:

>>> git_lfs.get_lfs_endpoint_url('/Users/vl238/lfs-mirdata', '/Users/vl238/lfs-mirdata')
'https://github.com/lostanlen/lfs-mirdata.git/info/lfs'

now the URL we need is

https://raw.githubusercontent.com/mir-dataset-loaders/mirdata/  + VERSION +  /mirdata/indexes

@rabitt

Instead we can put them on Zenodo and download them when they're needed.

what do you mean by "when they're needed?" at import time? if so, users will have to be connected to the Internet the first time they import a mirdata module. is that OK for you?

lostanlen avatar Mar 07 '20 13:03 lostanlen

I created an lfs branch on mirdata: https://github.com/mir-dataset-loaders/mirdata/tree/lfs

Note that master and lfs have entirely different commit histories, so it will not be possible to make a PR for this.

Screen Shot 2020-03-07 at 9 20 27 AM

Instead, it must be done by a force-push, which of course would come with a great deal of caution. Our master branch is protected against force-pushes (which is good), so there is no risk of it ever happening by accident.

As open PRs get merged to master (#149, #159, #174, #185, #188, and #190 + probably others), i will merge them to this lfs branch too so that i'm up to date with master.

In the meantime, i'd like to be able to run CircleCI on the lfs branch. May i set this radio button to "Off"?

Screen Shot 2020-03-07 at 9 24 08 AM

lostanlen avatar Mar 07 '20 14:03 lostanlen

In the meantime, i'd like to be able to run CircleCI on the lfs branch. May i set this radio button to "Off"?

Yes go ahead!

rabitt avatar Mar 10 '20 17:03 rabitt

hey! this branch does not seem to exist: https://github.com/mir-dataset-loaders/mirdata/tree/lfs should I start from scratch?

nkundiushuti avatar Oct 20 '20 10:10 nkundiushuti

Give me a couple of days to check if it's worth starting from a backup version

magdalenafuentes avatar Oct 20 '20 16:10 magdalenafuentes

the index for acousticbrainz genre is 1.5GB. there is a limit of 1GB for github's LFS. so we either have to store the file somewhere else and find another solution for zenodo. I am in favour of using zenodo and pulling the index when the dataset is imported for large indexes/datasets. there is also to have indexes stored as releases in github: https://docs.github.com/en/free-pro-team@latest/github/managing-large-files/distributing-large-binaries

https://medium.com/@megastep/github-s-large-file-storage-is-no-panacea-for-open-source-quite-the-opposite-12c0e16a9a91

nkundiushuti avatar Nov 05 '20 10:11 nkundiushuti

Closed via #335. Because of size limits of GitHub LFS we decided to move big indexes to zenodo.

magdalenafuentes avatar Nov 30 '20 16:11 magdalenafuentes

Reopening to restart the discussion as this is becoming increasingly important.

We now support large indexes on zenodo - We've now started discussing putting all indexes on zenodo, and removing all local indexes to better support multiple dataset versions and sample versions as in #433 .

rabitt avatar Apr 06 '21 16:04 rabitt