JuliaDB.jl icon indicating copy to clipboard operation
JuliaDB.jl copied to clipboard

Difficulties for reading a file

Open jmathiasHts opened this issue 7 years ago • 2 comments

Hello, I am beginner in Julia. I try to import the big file of the french establishments (file opendata "sirene": http://files.data.gouv.fr/sirene/sirene_201711_L_M.zip).

I used this code or derivatives codes

Addprocs () Using JuliaDB Path = "F:/BD/labo/labo/siren.csv" sirene = loadtable(path)

And I have mistakes. First, I thought the file was too badly built to be imported via loadtable: The encoding was in WIN-1252 The strings were sometimes contained inside quote, sometimes was not The separator was ";" and no "," The separator could be contained in quoted chains Maybe the file was too big (...?) Maybe the successive separators linked to a missing field might have been misinterpreted, so I replaced ",," with ", NULL," Maybe the values ​​of the non-answers were badly recognized, especially in the numerical fields, I replaced in the numerical variables ", NR, by", NULL, "

So I applied a set of transformations to the initial file using Perl + Iconv regular expressions. I then extracted a small file of 2,500 lines first lines.

This extract can be donwload here : https://www.justbeamit.com/u42je

I did not notice a major flaw when considering this excel extract, and in particular the number of fields in each line is the same and equal to 100.

With sirene=loadtable(path)

julia> sirene=loadtable(chemin) Error parsing F:\BD\labo\labo\test.csv ERROR: On worker 2: previous rows had 98 fields but row 2 has 100 guesscolparsers at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:507 #_csvread_internal#35 at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:194 #_csvread_internal at .<missing>:0 #32 at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:92 open at .\iostream.jl:152 #_csvread_f at .<missing>:0 #csvread#34 at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:103 #csvread at .<missing>:0 #_loadtable_serial#2 at C:\Users\jerom.julia\v0.6\JuliaDB\src\util.jl:88 #_loadtable_serial at .<missing>:0 #217 at C:\Users\jerom.julia\v0.6\JuliaDB\src\io.jl:131 do_task at C:\Users\jerom.julia\v0.6\Dagger\src\compute.jl:319 #106 at .\distributed\process_messages.jl:268 [inlined] run_work_thunk at .\distributed\process_messages.jl:56 macro expansion at .\distributed\process_messages.jl:268 [inlined] #105 at .\event.jl:73

With sirene=loadtable(path,type_detect_rows=2500)

julia> sirene=loadtable(path,type_detect_rows=2500) Error parsing F:\BD\labo\labo\test.csv ERROR: On worker 2: previous rows had 98 fields but row 2 has 100 guesscolparsers at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:507 #_csvread_internal#35 at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:194 #_csvread_internal at .<missing>:0 #32 at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:92 open at .\iostream.jl:152 #_csvread_f at .<missing>:0 #csvread#34 at C:\Users\jerom.julia\v0.6\TextParse\src\csv.jl:103 #csvread at .<missing>:0 #_loadtable_serial#2 at C:\Users\jerom.julia\v0.6\JuliaDB\src\util.jl:88 #_loadtable_serial at .<missing>:0 #217 at C:\Users\jerom.julia\v0.6\JuliaDB\src\io.jl:131 do_task at C:\Users\jerom.julia\v0.6\Dagger\src\compute.jl:319 #106 at .\distributed\process_messages.jl:268 [inlined] run_work_thunk at .\distributed\process_messages.jl:56 macro expansion at .\distributed\process_messages.jl:268 [inlined] #105 at .\event.jl:73

Do you have any idea of ​​how to correctly load this file? Am I doing it wrong? I chose JuliaDB because of the size of the file to load (~ 8 GB / 11 000 000 lines and 100 variables)

Best regards

jmathiasHts avatar Jan 24 '18 05:01 jmathiasHts

I think your file has 98 column headers, but 100 columns?

shashi avatar Feb 01 '18 15:02 shashi

Hey @JeromeM75 , how did you manage to load this data? Did you get a solution?

kirui93 avatar Jan 14 '20 09:01 kirui93