pudl
pudl copied to clipboard
Get basic user metrics we technically have access to
Overview
We want to be able to see our basic usage metrics:
- how many different IPs are accessing our data via Datasette?
- how many different IPs are accessing our data on S3?
Let's write a script that figures that out and provides metrics!
fly.io currently doesn't retain logs for a long time so we need to use the https://github.com/superfly/fly-log-shipper fly log shipper to send logs to S3.
It also doesn't log out the IP address of the datasette requests - guessing that the IP currently logged is the load balancer IP. Usually the load balancer includes some sort of "forwarded this request from original IP" information in the headers, so we should be able to extract that somehow. Seems like we can't configure the datasette access logs so we'll need to set it up behind something we can configure, like NGINX.
Success Criteria
- [ ] we can run a script to get timestamp, IP, and accessed resource for every S3 and datasette access
- [ ] GH action puts summary statistics on GCS every week for posterity.
### Next steps
- [ ] get S3 permissions to update IAM credentials so CLI/boto can work
- [ ] set up S3 bucket for fly-log-shipper to ship logs to
- [ ] set up some sort of application access for fly-log-shipper so it can send logs to S3
- [x] deploy datasette behind nginx within the docker instance, and configure it to log the X-Forwarded-For or X-Real-IP headers... need to determine which one of these the Fly.IO load balancer uses. Do this with a tiny little datasette instance on free tier app first.
- [ ] write a script to parse logs into timestamp/IP/resource records
- [ ] compare & contrast