Frames
Frames copied to clipboard
A User Report
I am not sure this is an issue but I thought I would report back on my experience with a 100M file with 24,001 columns.
-
tableTypes "User" "clouds.csv"
takes a long time 5+ minutes. - I didn't dare do
:r
but just didlet pfStream :: Producer User IO (); pfStream = readTableOpt userParser "clouds.csv"
at theghci
prompt. This gaveContext reduction stack overflow; size = 101
with an error message of over 24,000 lines! - At this point, I am heading back to
R
:-(
Let me know if you want any more details.
Wow, thank you! To be frank, it didn't cross my mind to come up with a test case with that many columns. It's a shame that performance is so bad. I imagine that there are a couple things going wrong at that scale. The TemplateHaskell that infers the column types is running without optimizations, and so taking a long time. Normally, the advice would be to not run this in GHCi, but in this case it's the compilation time that is so long, so I don't think it would make a big difference. Then, once we have the types, we have constraints with 24k elements, while GHC starts off with a maximum context stack size of a few dozen by default.
This is really helpful feedback, as I would not have guessed somebody would have that many columns. I'll have to think about it some, but I don't think GHC is ever going to be happy with what is effectively a record with 24k fields. I will put a warning message to that effect in the README about that.
The way forward:
- Use
R
for this :-( - Make the column type inference sensitive to the number of columns. Right now it just reads some number of rows, but we should read fewer rows if there are so many columns. This will make inference less reliable, but 5 minutes is horrible.
- Figure out if we can make the initial inferred row type representation more compact somehow, rather than represented as a list. A problem is that if we want to, say, remove some fields in some operations, we'll need to take apart the initial row type. This means we can't just represent it as some opaque label, and if we work out a way to represent it as an array rather than a list, some operations will be a bit more awkward (e.g. cons'ing a new type on to the row).
Thanks again for this report. I'm sorry that the experiment wasn't a success, but I really appreciate you taking the time to let me know.
Thanks for your very comprehensive reply. To be fair we are probably using the wrong data structure (CSV) for our data as I may want to use 10x or even more particles. I suspect even R would baulk at that! However, pretty much any tool that claims to do data analysis can handle CSV and changing our data structures is going to be a reasonable amount of work.
BTW I managed to stay in Haskell land by using cassava in a dynamically typed way (everything is a string and I read
values I know to be Double
).
I would assume that # cols could be tens of thousands or more, that sort of use case comes up in a lot of machine learning contexts (it's not uncommon for there to be millions of predictors with mixed types). csv is not an ideal format of course, but for better or worse it's a common currency. Handling the tens of thousands case at least should be considered as that is quite common.
nice to hear cassava works but ideally there would be a better solution than re-inventing a poor-man's dynamic typing.
I've been thinking about this ever since @idontgetoutmuch's original report, and have made some progress. I'm still trying to cut down compile times, but maybe we should set up a test for this kind of extreme scalability. We should have a generator for a suitably-sized CSV file, and then I'll put my in-progress work up on a branch. It'd be cool if we could use cabal benchmarks to drive a parameterized test, eg "cabal bench scalability 10k" (but with whatever the correct syntax is).
@acowley Did you ever push that stuff to a branch? Perhaps if it is in bad condition just push it as is and name the branch *-wip
?
I just dug up my experiment and updated it to a current LTS, and it's bad news bears. What I did was re-implement part of vinyl
using a tree of types to index the record type. In a runtime benchmark, this greatly improves getting a field from a large record. However, compilation times are significantly worse for this approach than for standard vinyl
.
I apparently implemented the tree-indexed records two different ways, and wrote a compilation benchmark using the GHC API that gives me these results today with GHC-8.0.1:
vinyl: Definition: 17.78 ms; Use: 27.89 ms
tree: Definition: 221.95 ms; Use: 42.80 ms
tree2: Definition: 39.61 ms; Use: 31.05 ms
That is the type checking time for a module that defines a record value, and the type checking time for a module that pulls a field out of that record.
The one bright spot is a runtime benchmark. Though there is a ton of noise, you can see the lower and upper bound times for each variation are quite distinct:
benchmarking record get
time 7.530 ns (7.378 ns .. 7.683 ns)
0.996 R² (0.995 R² .. 0.998 R²)
mean 7.489 ns (7.338 ns .. 7.647 ns)
std dev 533.4 ps (447.7 ps .. 657.5 ps)
variance introduced by outliers: 86% (severely inflated)
benchmarking Rec get
time 919.9 ns (903.3 ns .. 934.2 ns)
0.997 R² (0.996 R² .. 0.998 R²)
mean 909.9 ns (891.7 ns .. 927.6 ns)
std dev 59.79 ns (49.87 ns .. 73.85 ns)
variance introduced by outliers: 77% (severely inflated)
benchmarking TRec get
time 28.47 ns (27.96 ns .. 29.00 ns)
0.996 R² (0.993 R² .. 0.997 R²)
mean 28.67 ns (27.87 ns .. 29.60 ns)
std dev 2.817 ns (2.338 ns .. 3.694 ns)
variance introduced by outliers: 91% (severely inflated)
That's pulling the 26th field out of a 26-field record using a regular Haskell data type, a vinyl Rec
, and a tree-indexed record.
I haven't looked at this code in about a year, so I'm not familiar with exactly what it's all about, but it looks like the effort to improve compile times was a bust.
It looks like a few data sets I'm playing with actually have 150 columns. In GHCI tableTypes' will work fine, but when I actually try to compile it I get 40GB memory usage. I bet if I compile with ghc -O0
it will compile as it did in ghci... though in my verbose compile I noticed it got hung up on the 3rd round of simplification so my intuition leads me to wonder if I found a case the simplification loop breakers are missing.
Is the thing you're compiling just a tableTypes'
splice, or is there some other code, too?
@acowley It's just a tableTypes' splice. I'll see if I can make a minimum reproduction.
A good thing we could do then is comment out most of what gets spliced in to see which part is blowing things up (hopefully it's not the type alias for the row type!).