py-openaq icon indicating copy to clipboard operation
py-openaq copied to clipboard

Including historic data...

Open lionfish0 opened this issue 6 years ago • 5 comments

I wanted to download more than just the data in the API (i.e. I also wanted data from a sensor going back years). To this end I wrote a python module that uses yours, but also accesses the s3 buckets where openAQ keeps its archive. It works... sort of. But each file is about 100Mb for one day, and the sensors are jumbled about inside them, so I find I have to search the whole file for each one. I'm just writing some code now that tries to deal with that, as the same sensor seems to be fairly evenly spaced over the file, but it's not ideal. Just wondering if you've any ideas? Maybe the openAQ people could host the files in a different way (e.g. order the files by sensor first, then by time?)...Thanks for the great module.

lionfish0 avatar Apr 17 '19 10:04 lionfish0

Hi @lionfish0 - I've had several students ask about this recently as well...one option is to use Athena and query for the specific sensor, though that requires an AWS account, which is a non starter for our students. To use Athena, you can follow this tutorial. I'm not sure what the best alternative is at the moment - I may start keeping a cloned database that updates every day or week that I could potentially share? I'm not sure what the timeline for that would look like though...

To recommend changes to the structure/format of the API/database itself, I would get in touch with @RocketD0g or post an issue on the API repo itself. Let me know if you have any other questions/comments!

dhhagan avatar Apr 17 '19 12:04 dhhagan

Hi @lionfish0, thanks for looking to use the OpenAQ data! This definitely sounds more like an OpenAQ question than one for py-openaq as it comes down to our provided formats. We will likely not reproject our files into another format as they're pretty tied to how the ingest system works. David pointed out the Athena tutorial which is currently the best way to access all the data in a performant manner. Note that we are looking at brining that into the API with https://github.com/openaq/openaq-api/pull/387 but there are a few other things that need to get pulled in before that. But our API would be doing the exact same thing that tutorial does, no extra magic.

@dhhagan just a heads up we specifically moved away from using the database to hold everything because of the volume. Running aggregations over 400 million+ measurements required a very large (expensive) database. With better indexes, more performant queries and potentially less live users, this may be mitigated, but that's what we ran into.

jflasher avatar Apr 22 '19 14:04 jflasher

Thanks for your input @jflasher. Have you written anything about these issues? I'd be curious to know if the scaling issues were platform dependent and/or generally how much it was costing (both $ and computational resources) to run the site - also a few details about bottlenecks would be really interesting...

dhhagan avatar Apr 23 '19 11:04 dhhagan

@jflasher thanks for looking at this! I completely understand about the awkwardness & cost of hosting additional databases. I was thinking (like @dhhagan suggested) of reorganising the tables each day, and hosting new files (organised by sensor) probably on s3 like yourself. I'm collaborating with a big project looking at air pollution, so maybe that would be a good vehicle for these tables. I really wrote my original message to check I'd not missed anything obvious (e.g. a trick for indexing your files etc) and to avoid duplicating work. I'll try to get something up and running by about mid-May (other things to work on before then). What AWS tool would you recommend by the way for this? (I've mainly just used EC2/EMR/S3 so I don't know what else might be more appropriate for this daily batch-processing task?)

lionfish0 avatar Apr 23 '19 19:04 lionfish0

@dhhagan there have been a number of issues opened (and mostly closed) that reference the db. I've listed a few below.

https://github.com/openaq/openaq-fetch/issues/182 https://github.com/openaq/openaq-fetch/issues/177 https://github.com/openaq/openaq-fetch/issues/185 https://github.com/openaq/openaq-fetch/issues/355

If you've got any ideas after looking those, would be great to hear them! I am pretty sure we're not totally optimized with what we're doing as far as the db goes. Historically, about 1/3 of our monthly costs go into the database itself, 1/3 goes to using Athena for large aggregations and 1/3 is compute to power API, etc.

@lionfish0 Nope, you're not missing anything! I think there are two pieces to think about, one is reorganizing the existing data and the other is handling the data as it comes in. For doing the larger task of reorganizing the data that already exists, doing something on an EC2 instance is probably the easiest, it'll just take a while to touch all the files. For doing the ongoing work, I think utilizing the notifications mentioned at https://medium.com/@openaq/get-faster-access-to-real-time-air-quality-data-from-around-the-world-c6f9793d5242 and using a Lambda function would be best. You can spin up a Lambda function each time a new fetch has happened, sort the data however you'd like and store to the appropriate objects. Let me know if you have any other questions, I think this sort of thing might be useful for the community if there is any desire to share it!

jflasher avatar Apr 24 '19 16:04 jflasher