dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

File API download bypasses terms of use

Open scolapasta opened this issue 9 years ago • 4 comments
trafficstars

Currently, when you download a file through the UI, all logic for creating a GuestbookResponse row is down before hitting the API to download the file.

If you download the file directly from the API, you don't create a row here, so the count does not go up. Also this bypasses the terms of use and guestbook completely. We need to make sure a ) a row gets created, so counts are accurate, b) that we determine how we want to handle the bypassing of the terms of use (via a token?) rather than just acting like they don't exist.

scolapasta avatar Feb 03 '16 21:02 scolapasta

Updating this to cover terms of use and not increasing download count, which is covered in #3331.

djbrooke avatar Oct 11 '16 21:10 djbrooke

We are looking forward to this functionality cause we are facing some issue related to copyright regarding organizations those harvesting our dataverse using the API.Since they are getting a direct download link to the file and puts it on their sits users are downloading them without any knowledge or agreement to the terms of use.

solhm avatar Jan 26 '18 16:01 solhm

@solhm thanks for your comment. I just brought up "File API download bypasses terms of use" with @djbrooke @scolapasta and @sekmiller while discussing #3758.

pdurbin avatar Oct 28 '19 16:10 pdurbin

@alejandratenorio brought up this issue today and we've been discussing it at https://dataverse.zulipchat.com/#narrow/stream/379856-security/topic/No.20Restricted.20Files.20.2F.20Access.20conditions/near/427955246

pdurbin avatar Mar 20 '24 18:03 pdurbin

Hi all,

Possibly CIMMYT could collaborate on this. As @pdurbin suggested, we would like to have a proposal validated by you before any development. We think that the file download could work as follows, (it's our proposal v. 0.2):

These are our assumptions:

  • You can create an API token only if you have a user on a Dataverse instance. At least we have each user's last name, first name and email address. Desirably the Affiliation.
  • Using the API, anyone can download files with no access restrictions.
  • If someone uses an API Token, we could know the user associated to that token, do not?
  • As a user, when you request access to a restricted datafile you must accept the Terms of Access for Restricted Files.

File download: Terms of use: As a user, when you request access to a restricted datafile you must accept the Terms of Access for Restricted Files and its Terms of use.

  • If its dataset has no Terms of use & the datafile has no access restrictions: o No changes.

  • If its dataset has no Terms of use & the datafile has access restrictions: o No changes, an API token is required.

  • If its dataset has Terms of use & the datafile has no access restrictions: o An API token is required because we must be sure that a user accepts the terms of use. o The API Would download the file with its terms of use as a txt file.

  • If its dataset has Terms of use & the datafile has access restrictions: o An API token is required. o Would the API download the file with its terms of use as a txt file? If the user has already accepted the terms of use, is it necessary?

Guestbook:

  • If a Dataset has no guestbook & the datafile has no access restrictions: o No changes.

  • If a Dataset has no guestbook & the datafile has access restrictions: o No changes, an API token is required.

  • If a Dataset has guestbook & the datafile has no access restrictions: o A token will always be required, and the API would create a GuestbookResponse row with the user's first name, last name and email.

  • If a Dataset has guestbook & the datafile has access restrictions: o A token will always be required, no changes. o and the API would create a GuestbookResponse row with the user's first name, last name and email.

We underline the proposed changes. Please let me know your comments and whether this proposal is feasible.

alejandratenorio avatar Apr 02 '24 17:04 alejandratenorio

Hi all,

Due to some observations and comments, we have adjusted our proposal: These are our assumptions:

  • You can create an API token only if you have a user on a Dataverse instance. At least we have each user's last name, first name and email address. Desirably the Affiliation.
  • Using the API, anyone can download files with no access restrictions.
  • If someone uses an API Token, we could know the user associated to that token, do not?
  • As a user, when you request access to a restricted datafile you must accept the Terms of Access for Restricted Files.

CIMMYT Proposal - File download:

  • Proposed changes are highlighted in italics.

Guestbook: Since not all institutions may require these restrictions, we propose adding a global setting to enable this new functionality.

  • If a Dataset has no guestbook & the datafile has no access restrictions: o No changes.

  • If a Dataset has no guestbook & the datafile has access restrictions: o No changes, an API token is required.

  • If a Dataset has guestbook & the datafile has no access restrictions: o A token will always be required, and the API would create a GuestbookResponse row with the user's first name, last name and email.

  • If a Dataset has guestbook & the datafile has access restrictions: o A token will always be required, and the API would create a GuestbookResponse row with the user's first name, last name and email.

Terms of use: a. As a user, when you request access to a restricted datafile you must accept the Terms of Access for Restricted Files and its Terms of use. b. Since not all institutions may require these restrictions, we propose adding a global setting to enable this new functionality.

  • If its dataset has no Terms of use & the datafile has no access restrictions: o No changes.

  • If its dataset has no Terms of use & the datafile has access restrictions: o No changes, an API token is required.

  • If its dataset has Terms of use & the datafile has no access restrictions: o When a bot or user attempts to download a datafile directly from the API, they will not download the datafile itself; instead, they will download a PDF or TXT containing all the metadata of the datafile and the data from user or bot attempting the download: User agent, IP, Date and Time. o Additionally, a message will be added to the file similar to: "If you wish to download the datafile XXXX, please go to [insert Datafile URL]." o At the end of the file, a text will also be added mentioning that the datafile is subject to usage restrictions and explicit approval is required.

  • If its dataset has Terms of use & the datafile has access restrictions: o An API token is required. o No changes, see section A of this block.

Private link to accept Terms of Use: a. Since not all institutions may require these restrictions, we propose adding a global setting to enable this new functionality.

  • If a dataset has Terms of use, Dataverse opens a pop-up windows with the terms to ensure that the user cannot download the files without accepting the terms.

We would like to hear your comments, if you think it could work.

alejandratenorio avatar Apr 09 '24 16:04 alejandratenorio

@alejandratenorio thank you for the detailed writeup! Overall, I think this makes a lot of sense. A few questions:

  • For this part... If you wish to download the datafile XXXX, please go to [insert Datafile URL]... would the second URL always be the same or would it vary and expire over time? If it's the latter, perhaps we could re-use SignedUrls from #9001.
  • What do you think about making the new behavior the default, since it's more secure... and if installations don't like it, the configuration option could revert to the old behavior?
  • For guestbook, what about required fields that aren't in the user account? Custom questions can be created and set as required, which complicates things.
  • Have you considered getting additional feedback from the Dataverse community by posting at https://groups.google.com/g/dataverse-community ? I think others might have opinions on this! I'll also mention this our internal Slack (DONE).

pdurbin avatar Apr 09 '24 20:04 pdurbin

Hi @pdurbin,

Thanks you very much for your comments.

For this part... If you wish to download the datafile XXXX, please go to [insert Datafile URL]... would the second URL always be the same or would it vary and expire over time? If it's the latter, perhaps we could re-use SignedUrls from GDCC/7715 Signed Urls for external tools #9001.

We propose to use its Persistent Datafile URL, something like "If you wish to download the datafile [datafile name], please go to [Persistent Datafile URL]."

What do you think about making the new behavior the default, since it's more secure... and if installations don't like it, the configuration option could revert to the old behavior?

Yeah, great idea.

For guestbook, what about required fields that aren't in the user account? Custom questions can be created and set as required, which complicates things.

Since we do not have all this information, a solution could also be to download the PDF / TXT file. What do you think?

Have you considered getting additional feedback from the Dataverse community by posting at https://groups.google.com/g/dataverse-community ? I think others might have opinions on this! I'll also mention this our internal Slack (DONE).

We could have a final proposal together and share it, what do you say?

alejandratenorio avatar Apr 10 '24 16:04 alejandratenorio

Sure. I think I'm still a bit confused about the proposed multistep solution for downloading files. Is it something like this?

  • API user tries to download a file with terms. They get a text file instead.
  • The text file has the URL to download the file.

I guess my question is, do they have to parse the text to find the URL? Will this be easy to do?

What happens to the existing download URL? It stops working? Now the user get a text file instead?

We can go back to Zulip if that's easier! 😄

Or maybe a Google doc where I can leave comments here or there?

pdurbin avatar Apr 10 '24 20:04 pdurbin

Google doc is at https://docs.google.com/document/d/15UEJMocWDFABRAPaYkRTTFIOpi_Ra-_z0vu2eABMm6o/edit?usp=sharing

qqmyers avatar May 10 '24 18:05 qqmyers

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

cmbz avatar Aug 20 '24 15:08 cmbz

I left a note at https://dataverse.zulipchat.com/#narrow/stream/379856-security/topic/No.20Restricted.20Files.20.2F.20Access.20conditions/near/464124733 that anyone is welcome to open a fresh issue.

pdurbin avatar Aug 21 '24 17:08 pdurbin

@landreev You had this same experience with the DeCode data correct? Can you summarize anything relevant here? @scolapasta is recommending we revisit this after the SPA API work @cmbz

sbarbosadataverse avatar Mar 12 '25 20:03 sbarbosadataverse

2025/03/17

  • Make sure to review this in context of the SPA requirements for 6.7 (July time frame) as well. This functionality will be needed to support API-only access needed by SPA, too.

cmbz avatar Mar 17 '25 14:03 cmbz

@landreev still need your feedback here from our discussion

sbarbosadataverse avatar Apr 07 '25 14:04 sbarbosadataverse

I've edited the original description, to clarify that the downloads via the api ARE counted; and that the issue is now solely about bypassing the agreement to the terms of use/filling the guestbook.

@sbarbosadataverse I have zero feedback on how this could really be solved (and by "really solved" I mean, how to make API users fill guestbooks and/or legally agree to terms of use). Please talk to @scolapasta about any such plans. But, as I mentioned earlier, there was a CYMMIT developer interested in implementing a solution, we probably should get in touch.

There is a couple of simple solutions:

  1. Go back to what we used to do in DVN, where enabling terms of use automatically closed any API access to the file. (Nobody's going to like this/Unlikely that anyone would agree to go with this)
  2. Let the Dataset/Collection owner decide and configure what they want to do, by giving them a choice: Allow to continue API access to bypass terms/guestbook popus vs. close API access to such files. Neither choice is ideal, but at least they will have a choice, which they don't as of now.

landreev avatar Apr 07 '25 14:04 landreev

2025/03/17

* Make sure to review this in context of the SPA requirements for 6.7 (July time frame) as well. This functionality will be needed to support API-only access needed by SPA, too.

@cmbz this is not a problem with downloads that go via SPA per se. The user ends up redirected to, and downloading the file from the api endpoint. But, so is the case with the current UI as well. However, the fact that the user is going through a UI - either old or new - means we can present them with the proper popups before that redirect happens. It's the case where a user, or script is going to the API directly that's a a problem. There still is an SPA connection or potential dependency. If we were to implement a "real" solution, where we provide a way for an API user to legally agree to some terms and/or enter required guestbook info ahead of time, it will almost certainly involve having to do this in some UI (possibly a UI behind a captcha, to ensure that this step cannot be automated itself). Then this extra UI would need to be added to the SPA. It may make sense to do any such dev. in the SPA only, since the old UI is going away.

landreev avatar Apr 07 '25 16:04 landreev

2025-04-09

  • First goal: Develop a design proposal for how to provide this functionality
    • One possibility: use a signed URL to ensure caller had read/agreed to terms
    • See comments in issue for other possibilities
    • Also review Alejandra's proposal (starting here: https://github.com/IQSS/dataverse/issues/2911#issuecomment-2032586311)
  • Size refers to the design effort, not implementation of the design: preliminary size = 30

cmbz avatar Apr 09 '25 19:04 cmbz