data icon indicating copy to clipboard operation
data copied to clipboard

Bad data quality South America

Open co-miko opened this issue 4 years ago • 5 comments

Hi Opencovid Team

Thanks again for your efforts on gathering all the data. While looking through the data I observed some strange behaviour in various country districts and municipalities.

Especially in those countries: Argentina: La Rioja, La Pampa... there the value of the cases increase at the begining and decrase in the middle again.

Bolivia: La Paz,... Brazil: Acre,... Chile: Antofagasta,... Peru: Ancash There the values at the beginning are constantly very wrong.

Mexico: Tlaxcala The values jump a lot at the start of the tracking

Poland: Greater Poland,... 13.6 There the values decrease from 2000 something to 24, and increase the next day again to 2000

Czechia: Prague, 13.7 no more data for death or recovered are available

We use your data for our website to show some statistics and developments. You can have a look at one example here: https://covid.lanthaler.com/BO/cochabamba/

I hope you keep up your great work. Thank you

co-miko avatar Jul 20 '20 13:07 co-miko

Thank you for the kind words and for reporting these issues. I can confirm that I see some of the problems that you reported, for example Tlaxcala's numbers: image

I'm guessing it's some date-parsing error. I'll look into it and get back to you.

owahltinez avatar Jul 20 '20 15:07 owahltinez

We narrowed it down to a particularly careless data source, and we now heavily filter their data to only take what looks reasonable. I visually inspected all the examples you provided, and they look fine to me now. Can you verify?

Also, can I add your page to the grid of data users at the top of the page?

owahltinez avatar Jul 22 '20 14:07 owahltinez

I will check them. And of course you can add us to the the grid of data users.

co-miko avatar Jul 22 '20 19:07 co-miko

The data for Bolivia, Brazil, Chile looks very good.

There are still some minor data anomalies: Argentina:

  • Chubut (has 84 total cases on april 14, on april 15 it is reduced to 1),
  • La Pampa (has more total deaths than total infected)
  • La Rioja (same as Chubut)

Peru:

  • Lima (total death is the same as total cases)
  • nearly all provinces show the behaviour of Lima

Mexico:

  • Baja California: Current day shows only a fraction of the previous day (looks like incomplete count for this day)
  • Campeche (same as Baja)
  • Chiapas (same as Baja)
  • Morelos (same as Baja)

co-miko avatar Jul 22 '20 22:07 co-miko

Thank you for the detailed feedback!

Argentina

We just switched to a new data source via #301 so all of these should be resolved.

Peru

I had made a silly mistake and used the same URL for confirmed and deceased cases... Fix via #302

Mexico

I will double check, but I think this is just the nature of our data source which outputs incomplete data for the latest day. If it's frequent enough (i.e. it's happening for all subregions) I would consider tossing out the latest day but I would strongly prefer not to filter the data since it's coming directly from an authoritative source.

owahltinez avatar Jul 22 '20 23:07 owahltinez