dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

GDCC/Globus and Big Data Support

Open qqmyers opened this issue 2 years ago • 2 comments

What this PR does / why we need it: This PR builds on #7325 and earlier work by Scholars Portal/Borealis to add support in Dataverse for Globus-based data transfer to/from a Dataverse-managed S3 store. It is intended as a minimum viable capability that is expected to continue evolving over time.

This PR allows Globus to be used with a specific S3 store(s) and use of Globus requires the store to be 'public' which turns of support for restriction and embargo in that store ('public' indicates the store is not capable of enforcing Dataverse's per-file access controls, which is the case for Globus where access control is per folder (and Dataverse stores all files for a dataset in one folder)).

Which issue(s) this PR closes:

Closes #7740 Closes #7626 Closes #5994

Special notes for your reviewer: This PR includes PR #7325 as a practical matter (they were deployed/tested together in one branch) and so that the Globus effort can inherit common code cleanup. As with other PRs, differencing against the branch for that PR would show what is unique here.

W.r.t closing issue - guessing the ones above can actually close but there are other open Globus issues. I think most are obsoleted by this update but for those, and even the ones here, some human review before auto-closing might be in order.

Suggestions on how to test this: There is significant setup involved in supporting Globus transfer - this PR for Dataverse, the updated Dataverse Globus app from Scholars Portal/Borealis, and configuration of a Globus S3 connector and various Globus accounts are required. As part of the development effort, there is a EC2 installation of Dataverse and Globus available. This, combined with a local install of the Angular 9 Dataverse Globus app allows testing.

Testing could/should cover the basic up/download via Globus as well as regression testing w.r.t. other stores not being affected and with normal up/download to the Globus store also being allowed.

Does this PR introduce a user interface change? If mockups are available, please link/include them here: This adds a Globus upload button on the upload pane (when using a Globus-enabled store) and puts a Globus-transfer item in the file download menu (again when using a Globus-enabled store). The upload/download widgets themselves are the separate Dataverse Globus app.

Is there a release notes update needed for this change?: yes

Additional documentation: There's a demo, talk, Data Commons documentation on setup, etc. that I'll start linking here.

qqmyers avatar Aug 04 '22 12:08 qqmyers

Coverage Status

Coverage decreased (-0.2%) to 19.9% when pulling c554ecc0622188c1ada2500f34efc95bfd21ccd1 on GlobalDataverseCommunityConsortium:GDCC/DC-1 into 454f3f1c448208dbadcf38f8ca796963bfc07bf8 on IQSS:develop.

coveralls avatar Aug 08 '22 21:08 coveralls

With #7325 merged, this is down to 47 files changed from 70!

qqmyers avatar Aug 18 '22 15:08 qqmyers

I tested with http://ec2-44-207-2-30.compute-1.amazonaws.com as of de2d8b3bc7 of this PR and https://github.com/scholarsportal/dataverse-globus/pull/3/commits/5576ce12450ab6d6cc9ee66365a9a2e464d8b1e3

(I realize there's now a couple more commits. As of a243db6 I'm seeing this error: Error: RemoteOverlayAccessIOTest.testRemoteOverlayFiles:101 expected: <true> but was: <false>.)

I had a little trouble getting dataverse-globus (the transfer tool) running but @lubitchv helped me out. We agreed that I'd push a branch to adjust the README but I haven't done this yet.

I logged in with dataverseAdmin. Initially I logged into Globus with HarvardKey but later switched to [email protected]. I'll explain this below.

Rather than running Globus on my Mac I searched for "EuPathDB Public Data".

Below are some notes and screenshots. In this initial pass, I'm just getting the screenshots in place. Then I'll edit a bit for clarity.

Errors or "transfer submitted" should persist until user x's it out. Something like "preparing transfer" flashes but is very easy to miss!

  • https://github.com/scholarsportal/dataverse-globus/blob/5576ce12450ab6d6cc9ee66365a9a2e464d8b1e3/src/app/navigate-template/navigate-template.component.ts#L508

I can't log out of Dataverse Globus Transfer Tool.

  • Go to https://www.globus.org and log out there.

Non-obvious that you can double-click in the left pane to open folders.

API token shows up in browser console log.

Rename Task to GlobusTask? (Like how we have DataFile instead of File.)

The overall question is, when something goes wrong (in this case, we suspect a cert error) and no feedback is given in the transfer tool or Dtaverse, where can the user go for more information? Where can sysadmins go for more information? Also, obviously, can the either the transfer tool or Dataverse give information about the failure to the user directly?

From "new dataset" page we see that we have to save the dataset before uploading via Globus

Screen Shot 2022-08-29 at 11 30 26 AM

dataset created

Screen Shot 2022-08-29 at 11 31 09 AM

New "Upload from Globus" button (and new warning at top)

Screen Shot 2022-08-29 at 11 31 28 AM

Screen Shot 2022-08-29 at 11 31 59 AM Screen Shot 2022-08-29 at 11 32 42 AM Screen Shot 2022-08-29 at 11 33 28 AM Screen Shot 2022-08-29 at 11 39 27 AM Screen Shot 2022-08-29 at 11 40 22 AM Screen Shot 2022-08-29 at 11 41 50 AM Screen Shot 2022-08-29 at 11 50 07 AM Screen Shot 2022-08-29 at 12 05 03 PM Screen Shot 2022-08-29 at 1 33 37 PM Screen Shot 2022-08-29 at 1 34 42 PM Screen Shot 2022-08-29 at 1 34 57 PM Screen Shot 2022-08-29 at 1 35 07 PM Screen Shot 2022-08-29 at 1 35 46 PM Screen Shot 2022-08-29 at 1 46 22 PM Screen Shot 2022-08-29 at 1 46 31 PM Screen Shot 2022-08-29 at 1 47 18 PM Screen Shot 2022-08-29 at 1 47 28 PM Screen Shot 2022-08-29 at 1 47 42 PM Screen Shot 2022-08-29 at 1 49 16 PM Screen Shot 2022-08-29 at 2 18 13 PM Screen Shot 2022-08-29 at 2 19 32 PM

pdurbin avatar Aug 29 '22 18:08 pdurbin

Success! I upgraded to 84b393c48f and I was able to upload a file from Globus.

About to click "Submit Transfer"

Screen Shot 2022-08-30 at 1 43 35 PM

Message shown about start of transfer

Screen Shot 2022-08-30 at 1 43 40 PM

Back on the dataset page (haven't refreshed yet)

Screen Shot 2022-08-30 at 1 43 48 PM

Clicking refresh shows "Globus Transfer in Progress"

Screen Shot 2022-08-30 at 1 43 56 PM

The file is there! Somewhat oddly the message about the need to publish is gone and the "Publish Dataset" button is disabled.

Screen Shot 2022-08-30 at 1 44 54 PM

Wait a while and refresh and now the publish button is available as well as the message about the need to publish.

Screen Shot 2022-08-30 at 1 49 39 PM

Notification is present but worded a little strangely. The dataset was uploaded? I thought a file or files were uploaded?

Screen Shot 2022-08-30 at 2 26 56 PM

pdurbin avatar Aug 30 '22 17:08 pdurbin

@qqmyers and I did a demo during tech hours. Overall, it looks good but there was definitely a problem with download. That's probably where I'll focus next.

Meanwhile, I typed up some feedback for both @lubitchv and @qqmyers for the two apps in play. I'm using the term "punchlist" from my days in construction. I hope no one is offended! These lists repeat some of the comments above. The idea is that these lists contain the latest thinking. They are roughly in order of what I feel is the.priority.

dataverse-globus repo (transfer tool) punchlist:

  • API token shows up in browser console log.
  • "Up one folder" doesn't work, at least on Firefox. Update: see https://github.com/scholarsportal/dataverse-globus/issues/5
  • Non-obvious that you can (indeed have to!) double click folders to drill into them. Add plus (+) to left of folder icons to help reinforce that it's a folder (like OSF)? (A single click selects the folder for upload.)
  • Too easy to accidentally select a folder. Have to click it on the right to remove it.
  • No indication of file size.
  • Can't click magnifying glass under "Search for Active Endpoints". You have to hit Return.
  • Document how to log out by going to https://www.globus.org and logging out there.
  • In docs, provide guidance on where to install dataverse-globus. Same server as Dataverse (getting the ports right)? Separate server? See Jim's slides with arrows: https://osf.io/8gn7d
  • Should we add a "Close" button? Or buttons? You can close the popup but hitting Esc. You can close the browser tab in the normal way.
  • Rename basicGlobusToken to whatever we rename :BasicGlobusToken to (see below).

dataverse repo (this PR) punchlist:

  • Download isn't working (tech hours demo). Update: fixed. Works at the dataset level.
  • When download is invoked, it takes over the whole browser window (should open a tab instead). Update: fixed.
  • Lots of docs feedback already. Can we address it? Should I help?
    • Move content from Google doc to guides.
    • Bullets don't look right.
    • Typos and rewording.
    • Etc.
  • Rename :BasicGlobusToken to :GlobusTokenBasic (or :GlobusBasicToken)? Update: fixed.
  • Rename Task to GlobusTask Update: fixed.
  • The "needs to be published" message disappears when the files from Globus first appear (refresh happens automatically). Same thing with ingest? I can help test that. Update: fixed? Intermittent? Not sure.
  • Reword "Publicly-accessible storage – Files in this dataset may be readable outside Dataverse, restricted and embargoed access are disabled" (I can help with this). Update: this is probably fine for now. Perhaps it could use the "brand name" like LibraData instead of "Dataverse".
  • Allow the polling/monitoring interval to be changed (hard coded at 60 seconds). Update: fixed.

Notes:

  • Files must be public. Not restricted, not embargoed.
  • Our integration with Globus only works through an S3 store, not a file system or other stores.

pdurbin avatar Aug 30 '22 20:08 pdurbin

I just merged this PR and put a bunch of screenshots over in the issue that has to do with UI for Globus: https://github.com/IQSS/dataverse/issues/7626#issuecomment-1251369839

I wasn't able to test writing from Dataverse to any endpoint other than my laptop. I've applied for access through Harvard and would be happy to test this if the server is still up.

I put "Update: whatever" in the lists above. In short, I feel like Globus support is ready to ship as experimental. Thanks @JayanthyChengan @lubitchv and @qqmyers!

pdurbin avatar Sep 19 '22 18:09 pdurbin