covid19-data icon indicating copy to clipboard operation
covid19-data copied to clipboard

Rozbitý encoding v OpenData_Slovakia_CovidAutomat.csv

Open example-sk opened this issue 3 years ago • 5 comments

Dobrý deň, v súbore OpenData_Slovakia_CovidAutomat.csv sú názvy okresov trochu porozbíjané.

Napríklad okres Stará ?ubov?a alebo ?adca. V hex editore vidím všetky tie otáznikové znaky ako 3F (00111111), čo je naozaj otáznik v ASCII.

example-sk avatar Feb 09 '21 09:02 example-sk

+1 for this issue.

The wrong encoding can pose issues when someone will be using the raw data and filter them since some of the special characters in names of some cities/villages are misinterpreted. In case of need, I can create a Python script which would just perform a simple search and replace automatically to the CSV file. But this would need to be started manually by the user which is not the best-case scenario.

@matejmisik if you (or your team), because of whatever reason, are unable to fix the data before uploading them here, on GitHub, please, let me know and I'll create a simple Python script to fix the data.

neisor avatar Mar 04 '21 09:03 neisor

Hello,

it seems that encoding is still malformed and not easily readable. @neisor have you find a way to read this data automaticaly and reliably?

sakonn avatar Sep 16 '21 10:09 sakonn

(English text follows)

Zdravím,

dovolil som si spraviť veľmi jednoduchý python skrip, ktorý daný súbor opravý. Snáď niekomu pomôže.

Hello,

I made a very simple python script, which can repair the broken file. I hope that it can help someone.

achjaj avatar Sep 19 '21 09:09 achjaj

My solution is to convert OpenData_Slovakia_CovidAutomat.xlsx to csv through cloudconvert service. It's free and works perfectly.

# coding=UTF-8

import cloudconvert

api_key = 'XXXXXXX'
sandbox = False

cloudconvert.configure(api_key = api_key,sandbox = sandbox)

result = cloudconvert.Job.create(payload={
     "tasks": {
         'import-covid-data': {
              'operation': 'import/url',
              'url': 'https://github.com/Institut-Zdravotnych-Analyz/covid19-data/raw/main/OpenData_Slovakia_CovidAutomat.xlsx',
              'filename': 'OpenData_Slovakia_CovidAutomat.xlsx'
         },
         'convert-covid-data': {
             'operation': 'convert',
             'input': 'import-covid-data',
             'output_format': 'csv',
             'some_other_option': 'value'
         },
         'export-covid-data': {
             'operation': 'export/url',
             'input': 'convert-covid-data',
             'inline': False,
             'archive_multiple_files': False
         }
     }
 })

exported_url_task_id = result['tasks'][2]['id']
res = cloudconvert.Task.wait(id=exported_url_task_id) # Wait for job completion
file = res.get("result").get("files")[0]
res = cloudconvert.download(filename=file['filename'], url=file['url'])

Result is downloaded file: OpenData_Slovakia_CovidAutomat.csv without any encoding error.

sakonn avatar Sep 20 '21 18:09 sakonn

Btw, I also created Java wrapper around automat.gov.sk.

achjaj avatar Sep 20 '21 20:09 achjaj