TypedTables.jl icon indicating copy to clipboard operation
TypedTables.jl copied to clipboard

`show` hangs for Tables with missing-allowed columns

Open fredcallaway opened this issue 6 years ago • 8 comments

I'm not exactly sure what the root of this problem is, but here's a near-minimal example:

I have a file, foo.csv with contents:

c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16

Now in a Julia console...

julia> t = Table(CSV.File("foo.csv"; allowmissing=:none))
Table with 16 columns and 1 row:
     c1  c2  c3  c4  c5  c6  c7  c8  c9  c10  c11  c12  c13  c14  c15  c16
   ┌──────────────────────────────────────────────────────────────────────
 1 │ 1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16

julia> t = Table(CSV.File("foo.csv"));

julia> show(t)

That last command hangs (for at least 10 minutes) with full CPU usage. It's also worth noting that I don't get the problem when the table has fewer columns. I don't know the exact cutoff, but 3 columns works.

Oh, I should also add, that the Julia process continues to consume increasing amount of memory until the system runs out of swap and pauses the process.

fredcallaway avatar Jan 02 '19 00:01 fredcallaway

It's also worth noting that I don't get the problem when the table has fewer columns.

That's... interesting. I'll check it out - hopefully it's not any sort of compiler bug.

andyferris avatar Jan 02 '19 01:01 andyferris

I believe I may have just encountered the same underlying problem.

High level situation: I was trying to use group on a FlexTable containing missing data. A bit reduced (specific data removed) this looks like:

julia> x = Union{Int,Missing}[1,2]
2-element Array{Union{Missing, Int64},1}:
 1
 2

julia> t = FlexTable(a=x, b=x, c=x, d=x, e=x, f=x, g=x, h=x, i=x)
FlexTable with 9 columns and 2 rows:
     a  b  c  d  e  f  g  h  i
   ┌──────────────────────────
 1 │ 1  1  1  1  1  1  1  1  1
 2 │ 2  2  2  2  2  2  2  2  2

# The following hangs julia, slowly eating all the memory.
julia> group(getproperty(:a), t)
^C^C^C^C^C^CWARNING: Force throwing a SIGINT
^C^C^C^C^C^C^C^C^C^C^C^CWARNING: Force throwing a SIGINT
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^CWARNING: Force throwing a SIGINT
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^CWARNING: Force throwing a SIGINT

It seems the table must have at least 9 columns to trigger the problem, and I think the place things fail is that FlexTable is converted into a Table when Tables.rows() is called on it by the group function. Either there, or something group tries to do with the resulting Table.

c42f avatar Feb 15 '19 06:02 c42f

@c42f I can execute your example in ~106 seconds on both julia 1.0.2 and 1.1.0. I suspect the compiler is rather inneficient at handling these named tuples with lots of union elements?

andyferris avatar Feb 16 '19 09:02 andyferris

Yep, it seems to fall off a cliff at 8 elements.

In my real problem I had a table with 17 columns with many missing values so julia slowly used all the memory on my machine.

I had in my head that FlexTable would have the performance characteristics of DataFrame (ie, perform well with a large number of columns) but with a cleaner API. Maybe that's not right though?

c42f avatar Feb 17 '19 00:02 c42f

Unfortunately, no, it’s a bit different to a DataFrame in that it’s still a tuple on the inside, not a vector, and all operations rewrap it as a table. At the moment, for large numbers of columns (especially abstractly typed ones) you are probably better off using DataFrames.jl.

This internal representation might need to change, I feel. Operations which are “columnwise” needn’t go through Table, for example. Involving the compiler to fiddle with horrendous tuple types when a symbol dictionary is fast enough seems dumb.

andyferris avatar Feb 17 '19 02:02 andyferris

Do you foresee any difficulties just swapping out the internal representation for a Dict{Symbol,Any}? I was having a quick look but I don't know much about the tables ecosystem yet.

c42f avatar Feb 18 '19 04:02 c42f

I'd like to do something like that... Three problems with vanilla Dict I'd like to overcome first:

  • People like to control the column ordering
  • I was considering a system where the rows can similarly be a hash-based indexable/getpropertiable/iterable thingies (e.g. interface like a named tuple but internally like a Dict).
  • Where iterating rows doesn't copy all the keys, etc on every iteration, as would happen with the native Dict constructors.

DataFrames.jl has an Index type which might be a useful pattern to mimic, there's also OrderedCollections.jl.

andyferris avatar Feb 28 '19 04:02 andyferris

@c42f if you're interested, check out "draft" PR #46 (nifty feature) for something that will have the interace of a Table but the internal representation a bit like a DataFrame.

andyferris avatar Feb 28 '19 12:02 andyferris