Query.jl icon indicating copy to clipboard operation
Query.jl copied to clipboard

long wait times for first run of `@map` and secondary problem of REPL hanging.

Open affans opened this issue 6 years ago • 6 comments

I can only reproduce this with my data file which is hosted on github link. It's not a large file (270kb, 250 rows by 500 columns) of integer data, delimited by tab. Please download this file and save as symp.dat (or alter the code below accordingly).

using DataFrames
using Statistics
using Query
using Base.Filesystem
using CSV

headers = ["sim$i" for i = 1:500]  
dt = CSV.File("symp.dat", delim='\t', header=headers) |> DataFrame
dt.time = 1:250 ## add a time column for `melt` purposes

f(g) = g |> @map(mean(_))
@time f(dt)

Output:

julia> @time f(dt)
 31.975361 seconds (7.63 M allocations: 415.528 MiB, 0.84% gc time)

And then the REPL hangs. It hangs for about 2 minutes! before it finishes printing the 250-element query result.

julia> @time f(dt)
 31.190349 seconds (7.64 M allocations: 416.400 MiB, 0.94% gc time)

250-element query result
 0.249501
 0.265469
 0.0838323
 0.237525
 0.323353
 0.359281
 0.457086
 0.566866
 0.798403
 1.06587
... with 240 more elements

In summary:

  • First run takes about 35 seconds even when @map is in a function (maybe this has to something with the fact that dtis global?)
  • The REPL hangs for about 2 minutes after the @time macro prints its information. It seems to me it takes a long time to "collect" the results of the query and display it back in the REPL.

Version info:

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, ivybridge)

affans avatar Jan 07 '19 21:01 affans

This is also reproduced on 1.0.3 on a completely different machine. After printing the @time value, the REPL hangs for > 2 minutes until the "results" come in..

julia> @time f(dt)
 35.196039 seconds (7.61 M allocations: 414.922 MiB, 0.73% gc time)

...wait 2 minutes as timed by a stopwatch then

julia> @time f(dt)
 35.196039 seconds (7.61 M allocations: 414.922 MiB, 0.73% gc time)
250-element query result
 0.249501
 0.265469
 0.0838323
 0.237525
 0.323353
 0.359281
 0.457086
 0.566866
 0.798403
 1.06587
... with 240 more elements

julia> versioninfo()
Julia Version 1.0.3
Commit 04330c0378 (2018-12-16 21:23 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, broadwell)

affans avatar Jan 07 '19 21:01 affans

Thanks, very helpful bug report!

So I've figured out a bit more, but don't have a solution yet.

The reason that you have to wait after the @time macro reports something is this: the code that is actually timed here doesn't do anything, it only assembles the query itself. Then, once the @time macro finishes and reports its timing, the return value from f is passed to display because you are in the REPL, and only at that point is the query actually executed so that we can show the first 10 query elements.

I did manage to narrow it down quite a bit more: even something like

dt |> getiterator |> first

takes ages. For some reason the iterator that gets created for this DataFrame takes forever to actually iterate a first element. So this issue has actually nothing to do with Query.jl or @map, it is the general row iteration implementation that is slow here.

I suspect that the reason are the 500 columns, it seems that the generated functions that are in play here (I think) spend a lot of time compiling for that many columns. But I'm not sure, I'll have to dig a little deeper to figure this out.

davidanthoff avatar Jan 09 '19 00:01 davidanthoff

So this issue has actually nothing to do with Query.jl or @map, it is the general row iteration implementation that is slow here.

where is getiterator implemented? Is that a DataFrames thing? Maybe we can cross post there. I wish I could help, but it's going to take me weeks to get caught up to the code. Sorry

affans avatar Jan 09 '19 02:01 affans

The real implementation for that is in Tables.jl.

davidanthoff avatar Jan 10 '19 06:01 davidanthoff

@davidanthoff Do you still need the symp.dat file in my repository to reproduce this bug? I was going to make my repository private (and even delete the data file from git tracking). Is there a way to attach a file to a GitHub issue?

affans avatar Jan 15 '19 04:01 affans

No, I can replicate this independently!

davidanthoff avatar Jan 19 '19 16:01 davidanthoff