coronavirus-tracker-api icon indicating copy to clipboard operation
coronavirus-tracker-api copied to clipboard

Storing data in a data in database

Open ExpDev07 opened this issue 4 years ago ā€¢ 15 comments

Right now the data is just stored in cache. Is it perhaps better to sync the data to an actual MySQL database? It would allow for fast querying.

ExpDev07 avatar Mar 19 '20 21:03 ExpDev07

From what I have seen you guys talking about on other issues it seems like the format of the data is not totally stable. If that's the case wouldn't it be easier to use a NoSQL database over a RDBMS?

Kilo59 avatar Mar 21 '20 22:03 Kilo59

Yes, that would probably be better. MongoDB or alike.

ExpDev07 avatar Mar 21 '20 22:03 ExpDev07

Happy to help with this as well.

Kilo59 avatar Mar 22 '20 00:03 Kilo59

This would be good too because the API is now broken for me and others

focus1691 avatar Mar 25 '20 07:03 focus1691

@traderjosh can you explain in more detail how itā€™s broken for you? JHU (our data provider) made some pretty drastic changes lately which has caused the APIā€™s outputs to change (of note the ID indexing and provinces no longer being present for USA).

ExpDev07 avatar Mar 25 '20 07:03 ExpDev07

@ExpDev07 it works now - it was that cors header field missing from the API. The recovered field is missing now it seems. Is that gone forever? And some countries don't have an id, you're right.

I think a database would be good for this in case of website changes. MongoDB would be good because the data can change unexpectedly.

focus1691 avatar Mar 25 '20 07:03 focus1691

For JHU, yes, the recovery stats is gone forever unless they decide to bring it back. Iā€™m gonna see if I can find some other reputable sources that offer it and add it to the API.

I believe their reasoning was that no reputable sources were providing accurate recovery numbers, so they just decided to remove it.

ExpDev07 avatar Mar 25 '20 07:03 ExpDev07

Ok that's not an issue. Do you want me to help setup MongoDB? I can write the boilerplate and you can integrate an account for it.

focus1691 avatar Mar 25 '20 07:03 focus1691

It would be awesome if you can start drafting a PR for it. It needs to be compatible with our service provider system (see ā€œapp.services.locationsā€ module). But I think MongoDB will be perfect for it. Iā€™m thinking we periodically sync the DB with data retrieved from the data sources.

ExpDev07 avatar Mar 25 '20 07:03 ExpDev07

And it should be in Python? Not my speciality but I could do some research.

focus1691 avatar Mar 25 '20 07:03 focus1691

Yeah, feel like that would be best.

ExpDev07 avatar Mar 25 '20 07:03 ExpDev07

And it should be in Python? Not my speciality but I could do some research.

https://api.mongodb.com/python/current/

Kilo59 avatar Mar 25 '20 11:03 Kilo59

Perhaps we should use Mongo to store and update the normalized data? We can keep the data in a format that is easy to translate into our various responses.

  1. Collections for each source
  2. Documents for the countries/locations
  3. Background tasks to refresh the sources according to how frequently they are each updated.
  4. Continue to use caching to minimize database reads.

Kilo59 avatar Mar 29 '20 16:03 Kilo59

From what I have seen in the code so far, both the JHU and CSBS location models derive from the Location class, with CSBS having some additional fields. This kind of inheritance relationship should be easy to represent (and query) in an RDBMS.

Splitting the data in multiple collections in Mongo wouldn't really add much value unless you want to support historical records in multiple formats, which is horrible to query anyway if they aren't backwards compatible.

cyenyxe avatar Mar 31 '20 00:03 cyenyxe

We are deployed with gunicorn that runs multiple worker processes (4) each with their own caches. @cyenyx once we are storing the data in any database (RDBMS, Mongo, Redis, etc) the workers can use it like a shared cache. Then they don't have to independently rebuild their own separate caches every hour. It also adds resiliency when one of the dependent services timeouts, or encounters some kind of error, which is often the problem that causes this API to go down.

Related reading

https://realpython.com/python-memcache-efficient-caching/ https://redis.io/topics/lru-cache

Created a new issue for using a shared cache #304

Kilo59 avatar Apr 25 '20 18:04 Kilo59