datahub [EPIC] Automate Data Retrieval Requests

Currently, students need to raise a request on GitHub to get a copy of their archived files (https://github.com/berkeley-dsep-infra/datahub/issues/2866). This also creates manual work for @felder, and there are also way more requests than I had thought.

In the text file telling students what to do, we ask them to open this issue here. Instead, we can just provide the signed URL automatically there - so they can self-serve themselves the files without having to bother us. This eliminates an entire class of service requests we need to handle.

Oct 18 '21 19:10 yuvipanda

@yuvipanda my understanding for signed urls is their max duration is 7 days.

https://cloud.google.com/storage/docs/gsutil/commands/signurl

We'd really need something that could based on user auth produce these on the fly in order to automate.

Oct 18 '21 19:10 felder

ah damn.

Yeah, in that case I agree it means we've to write some code here. I don't think it needs us to introduce another layer of auth tho, we can just implement signed URLs ourselves with GCP KMS

During archival, create a URL contains in it all the info needed to figure out how to fetch the file
Sign this URL with a KMS key, and add this sign as a query param or something (gotta do this carefully)
Write a simple service that when accessed will check this signature, validate it, and then let the user download the actual file

Oct 18 '21 19:10 yuvipanda

@yuvipanda It will be fantastic to automate such requests considering the FERPA requirements and the additional throttle on the bandwidth of @felder!

How do you see the complexity of writing this service that does this automation? I am conscious of our backlogs and want to avoid adding more requests at your end currently. Let me know!

Oct 18 '21 20:10 balajialg

@yuvipanda my concern here is that unless the URL obfuscated (not a big fan of security by obscurity though) other students may be able to figure out how to gain access to data they should not be able to gain access to. That's the reasoning behind me saying we may want to tie it to auth of some sort.

Oct 18 '21 21:10 felder

@felder absolutely agree it shouldn't be security by obscurity, it should be fairly strong crypto. I think a simple signature where we keep the key private would be good enough. If people can guess those signed URLs most of the crypto we rely on would be considered broken.

Good question on complexity, @balajialg. I'll try investigate that.

Oct 18 '21 21:10 yuvipanda

@yuvipanda yeah I wouldn't expect people to guess the signedurls themselves! I'm referring more to the query string parameters that would be used to generate them

Oct 18 '21 21:10 felder

yeah, i think the signing means it doesn't matter what the user can guess.

However, I think given my current workload, I won't be able to build this anytime soon. So please don't block on it if other privacy preserving workflow changes need to happen.

Oct 26 '21 20:10 yuvipanda

@yuvipanda @felder We have a couple of options in the short term,

Continue serving the same way hoping that such requests are automated in the future/ explore creating private issues in GitHub
Shift to a ticketing system of choice for such requests alone.

I am inclining towards 2 for such requests alone. What do you both think?

Oct 27 '21 02:10 balajialg

@balajialg I'm inclined toward 2 as well. However, I do not believe these requests should be considered in a vacuum. We may opt to move these requests first, but ultimately we should consider it as a trial run for a general support process for Berkeley specific operational issues.

Oct 29 '21 19:10 felder

@felder When you mean support process, you mean for the regular requests we get right? Package requests, admin access, RAM elevation, etc.. or are you also considering bugs being reported?

If it is a bug, I wonder how issues such as this would be fixed as they have an upstream dependency and would require interaction with other developers/admins! Lets discuss more during sprint planning meeting (Lets see whether we will be able to wrap this discussion in time)

Oct 30 '21 01:10 balajialg

I think it might be helpful to have something else that contains possible private information - but I'd love for most things to stay as public as possible here.

Oct 30 '21 06:10 yuvipanda

@balajialg @yuvipanda Anything reported by a student or regarding a specific student where FERPA would apply.

Basically I'd like to start thinking about datahub the UCB specific service vs datahub the opensource software project (not to be confused with datahub the proposed building 😃), with service related requests having a private ticketing system. Note that requests that require development resources to resolve can have github issues created for them.

I understand that transparency is important, but I do think there are plenty of support requests that don't really require any development resources to fix and probably would not be of that much interest to anyone else.

Individual issues regarding say rstudio not launching would fall into this as well, as opposed to generalized solutions such as terminating rstudio gracefully on logout which would remain here in github.

Nov 01 '21 19:11 felder

@felder Got it! I wonder whether reporting bugs through different systems (based on the nature of the bugs) will be a cumbersome support experience for the users as most users would not care to know whether their issue should be raised via Github or a ticketing system based on the nature of the bug. For eg: The rstudio usecase highlighted by you.

I am personally aligned with moving chores to a support system (if that is something you feel strongly about) but keeping the feature enhancements and issues being reported (since many issues are correlated with package requests) with Github considering that they may require upstream dependency. Thoughts?

@felder @yuvipanda Did some analysis on the distribution of requests that we get every month. This is how it looks like for the past three months,

August: Package Requests:11 Admin Requests: 4 Issues:1 File Requests: 0 RAM: 0

September: Package Requests: 11 File Requests: 4 Issues:3 Admin:1 RAM:1

October: Package Requests: 8 File Requests: 7 Issues: 2 Admin: 1
RAM: 1

Based on the frequency and volume, the routine support requests that really matter are the "package installation/upgrade" and the "retrieval of the file" requests.

Nov 01 '21 21:11 balajialg

It might be useful to create a service that is proxied by the user's server which can generate these URLs or invoke various APIs. For example it could be a tornado/flask app that runs on a random port in the user's pod and is proxied by jupyter-server-proxy. It would be behind the hub's authentication. There could be a launcher in retro's New > dropdown or in the Lab launcher that invokes /user/{username}/data-retrieval, or we could advertise a URL of the form https://{hub}.datahub.berkeley.edu/user-redirect/data-retrieval which would redirect to that user's service.

I'm not sure about the full details of the signed URL and retrieval process, so this idea might require iteration.

Feb 03 '23 00:02 ryanlovett

datahub datahub copied to clipboard

[EPIC] Automate Data Retrieval Requests

datahub
datahub copied to clipboard