MLDatasets.jl icon indicating copy to clipboard operation
MLDatasets.jl copied to clipboard

move with_accept() to download_dep()

Open alyst opened this issue 7 years ago • 12 comments

I think with_accept() should be in download_dep() rather than datadir(). Otherwise e.g. MNIST.download(i_accept_the_terms_of_use = true) has no chances of setting DATADEPS_ALWAYS_ACCEPT.

alyst avatar Aug 30 '18 09:08 alyst

v0.7 CI is fine (after fixing travis), v1.0 fails because of MAT.jl

alyst avatar Aug 30 '18 10:08 alyst

I think the current implementation should work fine. Do you have an example where it doesn't (the MNIST example you give works for me)? download should accept the keyword argument (see https://github.com/oxinabox/DataDeps.jl/blob/master/src/resolution_automatic.jl#L39-L44).

Its just that the code here outdates these upstream changes a little so we do some redundant work here as far as I recall.

Evizero avatar Aug 30 '18 10:08 Evizero

cc @oxinabox

Evizero avatar Aug 30 '18 11:08 Evizero

Thanks, it really looks like we can remove with_accept() altogether. But the check for DATADEPS_DISABLE_DOWNLOAD (if it's needed) still have to be moved to download_dep() or even to the DataDeps package.

I didn't test MNIST.download(), just looked at the code and was confused. :)

alyst avatar Aug 30 '18 11:08 alyst

I didn't test MNIST.download(), just looked at the code and was confused. :)

Understandibly, its not terribly clear right now. Sorry about that. Basically what happened was that Lyndon moved a couple of concepts that were prototyped here upstream into DataDeps. After that I only made the minimally necessary changes and didn't refactor "properly".

I don't think that refactor is trivial to do either, given that the behaviour that we are going for is a little complex and not exactly the same as DataDeps promotes (which is fine).

Evizero avatar Aug 30 '18 11:08 Evizero

Understandibly, its not terribly clear right now. Sorry about that.

No problem at all. What's the difference in the MLDatasets and DataDeps behaviours?

alyst avatar Aug 30 '18 11:08 alyst

What's the difference in the MLDatasets and DataDeps behaviours?

I don't remember exactly. Might even be that he/we got rid of them (?)

Basically I want to preserve to current behaviour of the download logic here.

Evizero avatar Aug 30 '18 11:08 Evizero

Basically I want to preserve to current behaviour of the download logic here.

You mean DATADEPS_DISABLE_DOWNLOAD should only be checked by datafile(), while download_deps() should ignore this var?

Now I'm not sure how to proceed with this PR, but maybe my original goal will help shaping it. I'm trying to use the MNIST from MLDatasets when testing the other package. Probably I can do MNIST.download(i_accept_the_terms_of_use=true) to get the dataset in CI, but wouldn't it trigger re-download every time when I'm testing the package locally? Unlike datafile(), download_dep() doesn't check for the existence of the file. OTOH MLDatasets.datafile() is called from e.g. MNIST.traindata(), but the latter does not support passing kwargs to datafile() (and, I guess, it really should not). So IIUC, there is no public interface that allows downloading data only once both from interactive and CI sessions.

One solution would be to add force = false kwarg and the checks for dataset existence into download_dep() (similar to the ones that are currently in datafile()). So that the download happens only when the data are missing or when forced.

alyst avatar Aug 30 '18 12:08 alyst

right. the idea of calling MNIST.download is to force the download.

What you can do in your case is just set the environment variable "DATADEPS_ALWAY_ACCEPT" to "true". This will allow you to work with the package in the usual way (no explicit call to download etc) without having to deal with the input prompt.

EDIT: and it will only download the data if it is not already downloaded

Evizero avatar Aug 30 '18 12:08 Evizero

What you can do in your case is just set the environment variable "DATADEPS_ALWAY_ACCEPT" to "true".

Thanks. Probably it will work, I'm just not comfortable that instead of directly defining the behaviour of the functions I'm calling through their arguments, I would be doing it through the environmental variable (OS level) that also affects the global behaviour of the DataDeps package. I have to say, this mechanism of indirect logic injection is quite fragile.

alyst avatar Aug 30 '18 12:08 alyst

So I thought a little about why I made download always force a download. As far as I remember, DataDeps just checks if the folder for the dataset exists (not the files in it). It can't do much more because DataDeps doesn't know what you do in the post-fetch-method. Now MLDatasets offers a user to specify a dir in all the methods (e.g. MNIST.traintensor(dir="./MyMNIST")), including MNIST.download. I think it would be strange that a download chose to not trigger, just because the user created this empty directory already. So I decided to just force the download when explicitly calling download. That's also why I have the datafile, which basically is my workaround for requiring individual files.

I am not against a force keyword argument and it defaulting to false, but it would have to behave how a user expects it to. (i.e. checking if any of the data files is missing and if so trigger the re-download).

Probably the way to do it would be to add a new function maybe_download_dep that has the additional force parameter, and a list of the files that are expected to exist. This function first checks the existence (unless force) and then calls download_dep. All the other parts of download.jl would remain unchanged I think. MNIST.download etc would then call maybe_download_dep instead (and additionally provide a vector of filenames).

Evizero avatar Aug 30 '18 13:08 Evizero

Yes, some things do still need a refactoring here after functionality was moved up stream. And there likely is still some impedance and things we could move upstream to make it nicer.

IIRC the Core difference between his MLDataSets uses DataDeps and how it is "intended" (I say that loosely) to be used, is that the A MLDataSets function can take to a path to a folder has 3 possibilites

1 If path is nothing then download to default dir and read contents 2 if path is to a empty folder download the data there. And read contents 3 if path is to a full folder read content s

Where as "normal" use of DataDeps is to have it as a default arg value. See possible options are just

1 if not provided download to default directory, and read contents 2 if provided, read contents.

Maybe we can do more and lift the 3 option thing into DataDeps.jl.

Environment vars can be defined inside Julia and don't outlive the Julia session. I am generally happy with the idea of configuring UX via environment variables.

DATADEPS_DISABLE_DOWNLOAD is not something I imagine most users ever needing -- it us mostly for testing purposes.

Anyway upstreaming more functionality is something I am generally in favour of.

Embedding.jl also has stuff built on top of DataDeps.jl, So I think maybe there is a need for something more.

oxinabox avatar Aug 30 '18 15:08 oxinabox

closing as outdated

CarloLucibello avatar Nov 22 '22 05:11 CarloLucibello