Query.jl
Query.jl copied to clipboard
long wait times for first run of `@map` and secondary problem of REPL hanging.
I can only reproduce this with my data file which is hosted on github link. It's not a large file (270kb, 250 rows by 500 columns) of integer data, delimited by tab. Please download this file and save as symp.dat
(or alter the code below accordingly).
using DataFrames
using Statistics
using Query
using Base.Filesystem
using CSV
headers = ["sim$i" for i = 1:500]
dt = CSV.File("symp.dat", delim='\t', header=headers) |> DataFrame
dt.time = 1:250 ## add a time column for `melt` purposes
f(g) = g |> @map(mean(_))
@time f(dt)
Output:
julia> @time f(dt)
31.975361 seconds (7.63 M allocations: 415.528 MiB, 0.84% gc time)
And then the REPL hangs. It hangs for about 2 minutes! before it finishes printing the 250-element query result
.
julia> @time f(dt)
31.190349 seconds (7.64 M allocations: 416.400 MiB, 0.94% gc time)
250-element query result
0.249501
0.265469
0.0838323
0.237525
0.323353
0.359281
0.457086
0.566866
0.798403
1.06587
... with 240 more elements
In summary:
- First run takes about 35 seconds even when
@map
is in a function (maybe this has to something with the fact thatdt
is global?) - The REPL hangs for about 2 minutes after the
@time
macro prints its information. It seems to me it takes a long time to "collect" the results of the query and display it back in the REPL.
Version info:
julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, ivybridge)
This is also reproduced on 1.0.3 on a completely different machine. After printing the @time
value, the REPL hangs for > 2 minutes until the "results" come in..
julia> @time f(dt)
35.196039 seconds (7.61 M allocations: 414.922 MiB, 0.73% gc time)
...wait 2 minutes as timed by a stopwatch then
julia> @time f(dt)
35.196039 seconds (7.61 M allocations: 414.922 MiB, 0.73% gc time)
250-element query result
0.249501
0.265469
0.0838323
0.237525
0.323353
0.359281
0.457086
0.566866
0.798403
1.06587
... with 240 more elements
julia> versioninfo()
Julia Version 1.0.3
Commit 04330c0378 (2018-12-16 21:23 UTC)
Platform Info:
OS: Linux (x86_64-redhat-linux)
CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, broadwell)
Thanks, very helpful bug report!
So I've figured out a bit more, but don't have a solution yet.
The reason that you have to wait after the @time
macro reports something is this: the code that is actually timed here doesn't do anything, it only assembles the query itself. Then, once the @time
macro finishes and reports its timing, the return value from f
is passed to display
because you are in the REPL, and only at that point is the query actually executed so that we can show the first 10 query elements.
I did manage to narrow it down quite a bit more: even something like
dt |> getiterator |> first
takes ages. For some reason the iterator that gets created for this DataFrame
takes forever to actually iterate a first element. So this issue has actually nothing to do with Query.jl or @map
, it is the general row iteration implementation that is slow here.
I suspect that the reason are the 500 columns, it seems that the generated functions that are in play here (I think) spend a lot of time compiling for that many columns. But I'm not sure, I'll have to dig a little deeper to figure this out.
So this issue has actually nothing to do with Query.jl or @map, it is the general row iteration implementation that is slow here.
where is getiterator
implemented? Is that a DataFrames
thing? Maybe we can cross post there. I wish I could help, but it's going to take me weeks to get caught up to the code. Sorry
The real implementation for that is in Tables.jl.
@davidanthoff Do you still need the symp.dat
file in my repository to reproduce this bug? I was going to make my repository private (and even delete the data file from git tracking). Is there a way to attach a file to a GitHub issue?
No, I can replicate this independently!