TypedTables.jl
TypedTables.jl copied to clipboard
`show` hangs for Tables with missing-allowed columns
I'm not exactly sure what the root of this problem is, but here's a near-minimal example:
I have a file, foo.csv with contents:
c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
Now in a Julia console...
julia> t = Table(CSV.File("foo.csv"; allowmissing=:none))
Table with 16 columns and 1 row:
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16
┌──────────────────────────────────────────────────────────────────────
1 │ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
julia> t = Table(CSV.File("foo.csv"));
julia> show(t)
That last command hangs (for at least 10 minutes) with full CPU usage. It's also worth noting that I don't get the problem when the table has fewer columns. I don't know the exact cutoff, but 3 columns works.
Oh, I should also add, that the Julia process continues to consume increasing amount of memory until the system runs out of swap and pauses the process.
It's also worth noting that I don't get the problem when the table has fewer columns.
That's... interesting. I'll check it out - hopefully it's not any sort of compiler bug.
I believe I may have just encountered the same underlying problem.
High level situation: I was trying to use group on a FlexTable containing missing data. A bit reduced (specific data removed) this looks like:
julia> x = Union{Int,Missing}[1,2]
2-element Array{Union{Missing, Int64},1}:
1
2
julia> t = FlexTable(a=x, b=x, c=x, d=x, e=x, f=x, g=x, h=x, i=x)
FlexTable with 9 columns and 2 rows:
a b c d e f g h i
┌──────────────────────────
1 │ 1 1 1 1 1 1 1 1 1
2 │ 2 2 2 2 2 2 2 2 2
# The following hangs julia, slowly eating all the memory.
julia> group(getproperty(:a), t)
^C^C^C^C^C^CWARNING: Force throwing a SIGINT
^C^C^C^C^C^C^C^C^C^C^C^CWARNING: Force throwing a SIGINT
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^CWARNING: Force throwing a SIGINT
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^CWARNING: Force throwing a SIGINT
It seems the table must have at least 9 columns to trigger the problem, and I think the place things fail is that FlexTable is converted into a Table when Tables.rows() is called on it by the group function. Either there, or something group tries to do with the resulting Table.
@c42f I can execute your example in ~106 seconds on both julia 1.0.2 and 1.1.0. I suspect the compiler is rather inneficient at handling these named tuples with lots of union elements?
Yep, it seems to fall off a cliff at 8 elements.
In my real problem I had a table with 17 columns with many missing values so julia slowly used all the memory on my machine.
I had in my head that FlexTable would have the performance characteristics of DataFrame (ie, perform well with a large number of columns) but with a cleaner API. Maybe that's not right though?
Unfortunately, no, it’s a bit different to a DataFrame in that it’s still a tuple on the inside, not a vector, and all operations rewrap it as a table. At the moment, for large numbers of columns (especially abstractly typed ones) you are probably better off using DataFrames.jl.
This internal representation might need to change, I feel. Operations which are “columnwise” needn’t go through Table, for example. Involving the compiler to fiddle with horrendous tuple types when a symbol dictionary is fast enough seems dumb.
Do you foresee any difficulties just swapping out the internal representation for a Dict{Symbol,Any}? I was having a quick look but I don't know much about the tables ecosystem yet.
I'd like to do something like that... Three problems with vanilla Dict I'd like to overcome first:
- People like to control the column ordering
- I was considering a system where the rows can similarly be a hash-based indexable/getpropertiable/iterable thingies (e.g. interface like a named tuple but internally like a
Dict). - Where iterating rows doesn't copy all the keys, etc on every iteration, as would happen with the native
Dictconstructors.
DataFrames.jl has an Index type which might be a useful pattern to mimic, there's also OrderedCollections.jl.
@c42f if you're interested, check out "draft" PR #46 (nifty feature) for something that will have the interace of a Table but the internal representation a bit like a DataFrame.