covid-tracking-api icon indicating copy to clipboard operation
covid-tracking-api copied to clipboard

Feature Request: Time Series

Open jjshoe opened this issue 4 years ago • 8 comments

Johns Hopkins has completely fudged their time series, any chance you could host it via api?

I know I could write something to query day by day, and do the work from there, but it would be great if there was a CSV all set to go.

jjshoe avatar Mar 25 '20 19:03 jjshoe

Hi, @jjshoe, is this purely and API question? If so, can I move it over to the API repo?

kevee avatar Mar 26 '20 00:03 kevee

@kevee it probably belongs best on the repo where data is assembled.

I'm basically asking when the state, and county data sets get compiled each day, that a singular csv and whatever is used to drive the API, have a single data source, that gets data appended.

For example, in CSV, this would just mean a new data column, or several columns really, to contain that days data:

State 1/1/2020 1/2/2020
Alabama 1 2
Alaska 3 4

Or perhaps:

State 1/1/2020 Tested 1/1/2020 Confirmed
Alabama 10 2
Alaska 4 4

jjshoe avatar Mar 26 '20 12:03 jjshoe

For anyone interested, the following perl code will generate confirmed cases (excluding deaths) into a time series CSV.

https://gist.github.com/jjshoe/a5c62a7ff12d85f3badaa398fbf0cbff

jjshoe avatar Mar 26 '20 12:03 jjshoe

Indeed. Right now I have to make separate queries for date=20200316, date=20200317, and so on. It would be really nice to have startDate=20200315&endDate=20200320.

marnen avatar Apr 03 '20 02:04 marnen

I would recommend using the timestamp range inclusive, non-inclusive

[2020-04-04T14:26:13.978+00:00 .. 2020-04-05T14:26:13.978+00:00) An ISO-8601 encoded date string. Assumed to be UTC unless a different timezone is passed.

Maybe something like checked:<=2020-04-05T14:26:00Z

Nosferican avatar Apr 05 '20 14:04 Nosferican

@Nosferican Why do you think that would be useful in this context? The API doesn’t use that date format to begin with, and at least for myself, I don’t see why a half-open interval would be the desired semantics (unlike in, say, the C++ STL). I do like the idea of date=20200315..20200320 instead of two separate parameters, but I think a closed interval would be easiest to deal with.

marnen avatar Apr 05 '20 14:04 marnen

The date associated with the data is for the day of publication based on the timestamp. It usually means it is about the reported tests for the day before. In some cases a case that was reported to the jurisdiction will show up in the API two days after depending on when it was reported. The actual data collection uses the timestamp of when the source of the website was downloaded and parsed which is a better measure of until when the data is "comprehensive". The metadata has documented the LastUpdated field but that one differs by jurisdiction and the heuristics make it a bit harder to use. The date field is a generated field based on the timestamp. Internally it is the timestamp the raw field that is actually used. Since the data and the API are updated more frequently (e.g., job starts at 16:00 ET and should be done by 17:00 ET) it makes more sense to use the timestamp. For the CSV API backups, those are always delayed since the CRON job runs about every 6 hours. I think the intervals could be closed but when consuming it, you have the data as [start .. end) since anything after when the data was queried is still unknown.

Nosferican avatar Apr 05 '20 14:04 Nosferican

@Nosferican That’s useful information, but I’m not sure I understand how it interacts with time series. We already can query the API by date, so shouldn’t the semantics of a date series be the same? AFAIK the API doesn’t offer any finer time granularity (right?).

marnen avatar Apr 05 '20 21:04 marnen