lucky-parking
lucky-parking copied to clipboard
Create data cleaning pipeline in AWS
Overview
We need to create a data cleaning pipeline that takes in raw input data from the Socrata API and updates the AWS database with the correctly formatted geospatial data
Action items
- [x] Create list of data cleaning steps
- [x] Create code turn clean data accordingly
- [x] Decide on database technology
- [ ] Do some tests
- [ ] Deploy pipeline to AWS
Resources/Instructions
@gregpawin Please provide an update
- Progress
- Blockers
- Availability
- ETA
- Progress Created preprocess.py. Still needs work.
- Blockers Need to figure out how to implement in AWS Glue. Also, need to finish car/aliases
- Availability Couple hours/week
- ETA 1-2 weeks
- Progress Created Lambda function to download whole dataset and created Glue table but stopped before doing ETL
- Blockers Maybe this isn't necessary. Need to discuss next project redesign with PM
- Availability Couple hours/week
- ETA 1-2 weeks
Cleaned data can be created via make data command using citation analysis branch
Reevaluating how often data needs to be kept up to date.
Was wondering about the status of this. The most recent citations I see in the database are from April 1, 2021. I think that's plenty of data to work with for now but the link to the preprocess.py script above is broken and I was wondering if we could put the existing data processing code somewhere and document its progress/usage.
@gregpawin This issue has not had an update since 8/3/21. If you are no longer working on this issue please let us know. If you are able to give any closing comments related to why this issue stopped being worked on or if there are any other notes that never got added to the issue. We would appreciate it. If you are still working on the issue, please provide update using these guidelines
- Progress: "What is the current status of your project? What have you completed and what is left to do?"
- Blockers: "Difficulties or errors encountered."
- Availability: "How much time will you have this week to work on this issue?"
- ETA: "When do you expect this issue to be completed?"
- Pictures (if necessary): "Add any pictures that will help illustrate what you are working on."
This issue is a DRAFT for now, but anyone can update the sections based on the format below, especially the Overview section. Once we know what needs to be done and why we can prioritize whether to work on this issue.
Dependencies
ANY ISSUE NUMBERS THAT ARE BLOCKERS OR OTHER REASONS WHY THIS WOULD LIVE IN THE ICEBOX
Overview
WE NEED TO DO X FOR Y REASON
Action Items
A STEP BY STEP LIST OF ALL THE TASK ITEMS THAT YOU CAN THINK OF NOW EXAMPLES INCLUDE: Research, reporting, etc.
Resources/Instructions
REPLACE THIS TEXT -If there is a website which has documentation that helps with this issue provide the link(s) here.
Progress: Finished setting up IAM roles and permissions for AWS Glue job/role
Blockers: Taking time to learn how AWS Glue works--ie. writing custom transforms in Python
Availability: Will set at least 2 hours to work on it.
ETA: I think I can have a beta version up in a week.
Pictures (if necessary):
Progress: Still learning PySpark. Applied custom mapping, using the visual editor to create boilerplate code. Blockers: Learning PySpark Availability: Will work on it more over the weekend. ETA: I hope by next week.
Progress: Created DynamoDB table--discussing with Glen if we want to go with Dynamo or EC2 with MongoDB instead. It might also be good to have an API built in to interact with the DB Blockers: Working on custom transforms and discussing design with dev team Availability: Will work on it more over the weekend. ETA: I hope by next week.
Progress: Created script to find last updated date from API. Created a lambda to download the latest csv and upload to S3 bucket. Blockers: Working on custom transforms and discussing design with dev team Availability: Will work on it more over the weekend. ETA: I hope by next week.
Progress: Met with dev lead to decide on database technology--will go with MongoDB not DynamoDB to take advantage of geospatial functions. Blockers: None Availability: A few hours this week ETA: I hope by this week.