311-data icon indicating copy to clipboard operation
311-data copied to clipboard

Make 311 Data's request data easily accessible for data scientists

Open nichhk opened this issue 2 years ago • 6 comments

Overview

Currently, we have no way of providing our clean request data to our data scientists. @priyakalyan is working on PR #1257, but this script takes a long time to execute to get all the data.

I see two solutions to this:

  1. Make our SQL DB accessible to data scientists + create Python methods that read from the DB and produce dataframes.
  2. During our nightly Prefect runs, add a step that also writes the request data out as a csv file to S3.

I'm leaning towards (1) right now, since it will use less disk space and has fewer moving parts. (1) will take longer to implement, however, since we currently don't have access to make infrastructure changes (waiting for SSH keys or Terraform configs from @mattyweb).

Action Items

  • [ ] Get SSH key passphrase for DB access
  • [ ] Make our SQL DB accessible to data scientists + create Python methods that read from the DB and produce dataframes

nichhk avatar Jun 24 '22 18:06 nichhk

@joshuayhwu , who might be interested in working on this.

nichhk avatar Jun 24 '22 18:06 nichhk

I was taking a look at the lacity API config and attempting to connect to postgres locally. I was wondering what's the passcode for the database, if any? (the code said it defaults to none but psycopg2 said db refused connection)

joshuayhwu avatar Jun 30 '22 02:06 joshuayhwu

Please take a look at this section of the Terraform README. Basically, we cannot directly access our DB right now because it's in a virtual private network (or cloud?). It can only be access through the bastion server, but we don't have the SSH key from Matt yet :(

nichhk avatar Jun 30 '22 20:06 nichhk

Interesting. What are your thoughts on directly accessing the Socrata API and building a separate db?

joshuayhwu avatar Jun 30 '22 21:06 joshuayhwu

Hmm I think that would help us get a solution faster, but we'd have to maintain another component in an already too-complicated system.

If we think that this task is highest priority (i.e., we want it completed in the coming weeks), I would go with option 2 listed in the first comment.

nichhk avatar Jun 30 '22 21:06 nichhk

Sounds good. We can discuss more tonight.

joshuayhwu avatar Jun 30 '22 21:06 joshuayhwu

Hey @joshuayhwu Do you have an update for us on this issue?

Please update:

  • Progress:
  • Blockers:
  • Availability:
  • ETA:

Thanks!

mc759 avatar Dec 13 '22 03:12 mc759