covid19-data icon indicating copy to clipboard operation
covid19-data copied to clipboard

District names unstandardized / unknown encoding

Open csaid opened this issue 3 years ago • 5 comments

It appears that some files are encoded in utf-8 and others are encoded in ISO-8859-2. Is that correct?

ag = pd.read_csv('AG_Tests/OpenData_Slovakia_Covid_AgTests_District.csv', 
                           sep=';', parse_dates=['Date'], encoding='utf-8')

deaths = pd.read_csv('Deaths/OpenData_Slovakia_Covid_Deaths_AgeGroup_District.csv', 
                           sep=';', encoding = "ISO-8859-2")

But even then, the characters in the district names are corrupted, which prevents me from joining data in these two files.

For example, "Okres Banská Štiavnica" in one file vs "Banská Štiavnica" in the other file.

csaid avatar Mar 03 '21 00:03 csaid

Does anybody have any code snippets for reading these two files into python pandas? I can deal with fact that the string "Okres " is missing from all the districts in some files, but the corrupted characters is a bigger problem.

csaid avatar Mar 03 '21 00:03 csaid

I just looked at every possible encoding (sometimes Windows-1250 is still used here) and none turn out right. I'd say you're best off downloading the XLSX file and exporting a CSV from that, because in the XLSX the encoding is correct. "Okres" just means "District" in Slovak.

freedomlives avatar Mar 04 '21 09:03 freedomlives

@csaid you should be aware that in the fall there was massive antigen testing conducted, and the results of that aren't in the table you're looking at, but in the file OpenData_Slovakia_National_Testing.xlsx

freedomlives avatar Mar 04 '21 10:03 freedomlives

@freedomlives @csaid national antigen testing from fall of 2020 have all data available in file Slovakia_National_Testing_Municipality_Data.csv (in the root of this repo); there are also notes in commit messages to understand some aspects of testing and data limitations

sk-juroot avatar Mar 04 '21 10:03 sk-juroot

Hi, first of all excuse me for taking so long to answer. That is correct, all csv files that are updated regularly are encoded in UTF-8 and since the data containing deaths, were impossible to be published automatically we didn't deal with encoding. But finally it looks like, we will be updating it on daily basis, with UTF-8 encoding. I will keep this issue open, and let you know when we will include it in our daily update.

KristianSufliarsky avatar Mar 08 '21 11:03 KristianSufliarsky