dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

Client-side multifile zip download

Open qqmyers opened this issue 3 years ago • 9 comments

What this PR does / why we need it: A possible addition to/replacement for zipping on the server. In this PR, the multi-file download button invokes JavaScript that will download files individually (using direct download if enabled) and create a zip locally, using file names/directoryPaths from the specific datasetVersion being downloaded.

Current issues/limitations:

  • It isn't clear that this will work on all browsers
  • There's no error handling - should be possible, for example, to default to using the server side zip if things go wrong or if the browser type/version doesn't support what's needed.
  • It should be more efficient, but I've done minimal testing on scalability so far. Nominally, one could allow users to download all files and not have a size limit. The underlying zip code is using Blobs and Promises and is supposed to scale, but I'm not sure I'm doing everything to configure it to scale, etc.
  • The logic not allowing download of a zip when your over the size limit has not been changed, so this method is also subject to any limit so far.
  • The download file is always named dataverse_files.zip as before - could potentially use the dataset PID/version to create a unique name (with full or partial to indicate some/all files)
  • There is not currently any manifest file in the zip - should be possible to add one if desired (or to someday make a Bag)

Which issue(s) this PR closes:

Closes #5864

Special notes for your reviewer: To enable this, I needed to know the datasetVersion in the download code which required trying to fix #5864 - the multifile button doesn't set the datasetVersion in the guestbook by default. If the rest gets delayed, it may be worth pulling out this one line fix (it's a separate commit).

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

qqmyers avatar Dec 22 '22 21:12 qqmyers

Coverage Status

Coverage: 20.005% (-0.004%) from 20.009% when pulling bd1396765cee0a0b305ae81a1df0be4719335360 on GlobalDataverseCommunityConsortium:clientsidezip into 1aabf6995305ee17475375322b792cedfd7b2ea2 on IQSS:develop.

coveralls avatar Dec 22 '22 21:12 coveralls

this is wonderful! does it default to original file format, or does it send surrogate copies?

donsizemore avatar Dec 22 '22 22:12 donsizemore

Right now it is adding ?format=original when it retrieves each file.

qqmyers avatar Dec 23 '22 11:12 qqmyers

@qqmyers @jdmar3 corrects me: From archival theory there is a case to be made either way, but I do argue that prioritizing original file formats over plaintext tabular data incurs tech debt and may make data unusable. Would perhaps an "original format" check-box or some such require much additional work?

donsizemore avatar Jan 10 '23 15:01 donsizemore

@qqmyers Would it be possible to present an option to add ?format=archival in addition to/instead of ?format=original. Also, we're working on some automated user testing for different browsers, so I would be happy to help with testing on different browsers if needed.

EDIT: @donsizemore beat me to the punch!

jdmar3 avatar Jan 10 '23 15:01 jdmar3

The Access Dataset menu at the top of the page allows getting either original or archival format. Currently I have not changed those buttons to use the client-size zipping but that's a useful addition once we know that it works well for most browsers and sizable datasets.

Both forms are also available at the individual file level, so it is mostly a limitation of the bulk 'Download' button for selected files, regardless of whether the existing server-side zipping or this client-side method is used. I don't want to change that as part of this PR for client-side zipping, but I think both client-size and the existing server-side algorithm could handle both cases if the user interface work is done to allow it. FWIW: I think the API call to download all files allows you to specify either form as well.

W.r.t. archiving, I would argue that the Bag exports are better than the zip available from the front end (the Bag has fixity info, all the metadata for the dataset, etc.) and since it is privileged, it doesn't run the risk of files being excluded if you don't have permissions (if I recall the zip options in the UI include a manifest that lists files that weren't included due to permissions or size limits). There has also been discussion of the archival Bag exports w.r.t. whether including the ingested formats would be better than the original, but there are issues that have slowed that work, e.g. the fact that Dataverse isn't storing the fixity info for the ingested versions. It would definitely be useful to have some discussion/review of the Bags to decide requirements and priorities.

W.r.t. to testing - thanks! The draft PR should work as is so if we can get a test server(s) set up somewhere, it could be tested with different browsers, larger data, etc. I think DataverseNO was going to try to fire one up, Don could probably do that at Odum as well. Assuming that looks promising, I can look into updating the download all buttons - that shouldn't involve any new risks - if it works for the one format, it will work for the other.

qqmyers avatar Jan 10 '23 18:01 qqmyers

This is not ready for Review/QA (hence draft). Testing has show the local browser uses significant amounts of memory with large files and can fail with an out-of-memory error. I'm still investigating how to handle this. Perhaps it should not be on the board yet?

qqmyers avatar Feb 07 '23 16:02 qqmyers

We're excited about it. Let's let Jim size it.

pdurbin avatar Mar 09 '23 16:03 pdurbin

Sizing:

  • Slid this back to Jim's column as not ready for sizing.

mreekie avatar Mar 14 '23 15:03 mreekie