coronavirus-tracker-api
coronavirus-tracker-api copied to clipboard
Storing data in a data in database
Right now the data is just stored in cache. Is it perhaps better to sync the data to an actual MySQL database? It would allow for fast querying.
From what I have seen you guys talking about on other issues it seems like the format of the data is not totally stable. If that's the case wouldn't it be easier to use a NoSQL database over a RDBMS?
Yes, that would probably be better. MongoDB or alike.
Happy to help with this as well.
This would be good too because the API is now broken for me and others
@traderjosh can you explain in more detail how itās broken for you? JHU (our data provider) made some pretty drastic changes lately which has caused the APIās outputs to change (of note the ID indexing and provinces no longer being present for USA).
@ExpDev07 it works now - it was that cors header field missing from the API. The recovered field is missing now it seems. Is that gone forever? And some countries don't have an id, you're right.
I think a database would be good for this in case of website changes. MongoDB would be good because the data can change unexpectedly.
For JHU, yes, the recovery stats is gone forever unless they decide to bring it back. Iām gonna see if I can find some other reputable sources that offer it and add it to the API.
I believe their reasoning was that no reputable sources were providing accurate recovery numbers, so they just decided to remove it.
Ok that's not an issue. Do you want me to help setup MongoDB? I can write the boilerplate and you can integrate an account for it.
It would be awesome if you can start drafting a PR for it. It needs to be compatible with our service provider system (see āapp.services.locationsā module). But I think MongoDB will be perfect for it. Iām thinking we periodically sync the DB with data retrieved from the data sources.
And it should be in Python? Not my speciality but I could do some research.
Yeah, feel like that would be best.
And it should be in Python? Not my speciality but I could do some research.
https://api.mongodb.com/python/current/
Perhaps we should use Mongo to store and update the normalized data? We can keep the data in a format that is easy to translate into our various responses.
- Collections for each
source
- Documents for the countries/locations
- Background tasks to refresh the
sources
according to how frequently they are each updated. - Continue to use caching to minimize database reads.
From what I have seen in the code so far, both the JHU and CSBS location models derive from the Location
class, with CSBS having some additional fields. This kind of inheritance relationship should be easy to represent (and query) in an RDBMS.
Splitting the data in multiple collections in Mongo wouldn't really add much value unless you want to support historical records in multiple formats, which is horrible to query anyway if they aren't backwards compatible.
We are deployed with gunicorn
that runs multiple worker processes (4) each with their own caches.
@cyenyx once we are storing the data in any database (RDBMS, Mongo, Redis, etc) the workers can use it like a shared cache. Then they don't have to independently rebuild their own separate caches every hour.
It also adds resiliency when one of the dependent services timeouts, or encounters some kind of error, which is often the problem that causes this API to go down.
Related reading
https://realpython.com/python-memcache-efficient-caching/ https://redis.io/topics/lru-cache
Created a new issue for using a shared cache #304