covid-tracking-api
covid-tracking-api copied to clipboard
Feature Request: Time Series
Johns Hopkins has completely fudged their time series, any chance you could host it via api?
I know I could write something to query day by day, and do the work from there, but it would be great if there was a CSV all set to go.
Hi, @jjshoe, is this purely and API question? If so, can I move it over to the API repo?
@kevee it probably belongs best on the repo where data is assembled.
I'm basically asking when the state, and county data sets get compiled each day, that a singular csv and whatever is used to drive the API, have a single data source, that gets data appended.
For example, in CSV, this would just mean a new data column, or several columns really, to contain that days data:
State | 1/1/2020 | 1/2/2020 |
---|---|---|
Alabama | 1 | 2 |
Alaska | 3 | 4 |
Or perhaps:
State | 1/1/2020 Tested | 1/1/2020 Confirmed |
---|---|---|
Alabama | 10 | 2 |
Alaska | 4 | 4 |
For anyone interested, the following perl code will generate confirmed cases (excluding deaths) into a time series CSV.
https://gist.github.com/jjshoe/a5c62a7ff12d85f3badaa398fbf0cbff
Indeed. Right now I have to make separate queries for date=20200316
, date=20200317
, and so on. It would be really nice to have startDate=20200315&endDate=20200320
.
I would recommend using the timestamp range inclusive, non-inclusive
[2020-04-04T14:26:13.978+00:00 .. 2020-04-05T14:26:13.978+00:00) An ISO-8601 encoded date string. Assumed to be UTC unless a different timezone is passed.
Maybe something like checked:<=2020-04-05T14:26:00Z
@Nosferican Why do you think that would be useful in this context? The API doesn’t use that date format to begin with, and at least for myself, I don’t see why a half-open interval would be the desired semantics (unlike in, say, the C++ STL). I do like the idea of date=20200315..20200320
instead of two separate parameters, but I think a closed interval would be easiest to deal with.
The date associated with the data is for the day of publication based on the timestamp. It usually means it is about the reported tests for the day before. In some cases a case that was reported to the jurisdiction will show up in the API two days after depending on when it was reported. The actual data collection uses the timestamp of when the source of the website was downloaded and parsed which is a better measure of until when the data is "comprehensive". The metadata has documented the LastUpdated
field but that one differs by jurisdiction and the heuristics make it a bit harder to use. The date
field is a generated field based on the timestamp. Internally it is the timestamp the raw field that is actually used. Since the data and the API are updated more frequently (e.g., job starts at 16:00 ET and should be done by 17:00 ET) it makes more sense to use the timestamp. For the CSV API backups, those are always delayed since the CRON job runs about every 6 hours.
I think the intervals could be closed but when consuming it, you have the data as [start .. end) since anything after when the data was queried is still unknown.
@Nosferican That’s useful information, but I’m not sure I understand how it interacts with time series. We already can query the API by date, so shouldn’t the semantics of a date series be the same? AFAIK the API doesn’t offer any finer time granularity (right?).