Frames icon indicating copy to clipboard operation
Frames copied to clipboard

A User Report

Open idontgetoutmuch opened this issue 8 years ago • 10 comments

I am not sure this is an issue but I thought I would report back on my experience with a 100M file with 24,001 columns.

  1. tableTypes "User" "clouds.csv" takes a long time 5+ minutes.
  2. I didn't dare do :r but just did let pfStream :: Producer User IO (); pfStream = readTableOpt userParser "clouds.csv" at the ghci prompt. This gave Context reduction stack overflow; size = 101 with an error message of over 24,000 lines!
  3. At this point, I am heading back to R :-(

Let me know if you want any more details.

idontgetoutmuch avatar Sep 05 '15 10:09 idontgetoutmuch

Wow, thank you! To be frank, it didn't cross my mind to come up with a test case with that many columns. It's a shame that performance is so bad. I imagine that there are a couple things going wrong at that scale. The TemplateHaskell that infers the column types is running without optimizations, and so taking a long time. Normally, the advice would be to not run this in GHCi, but in this case it's the compilation time that is so long, so I don't think it would make a big difference. Then, once we have the types, we have constraints with 24k elements, while GHC starts off with a maximum context stack size of a few dozen by default.

This is really helpful feedback, as I would not have guessed somebody would have that many columns. I'll have to think about it some, but I don't think GHC is ever going to be happy with what is effectively a record with 24k fields. I will put a warning message to that effect in the README about that.

The way forward:

  • Use R for this :-(
  • Make the column type inference sensitive to the number of columns. Right now it just reads some number of rows, but we should read fewer rows if there are so many columns. This will make inference less reliable, but 5 minutes is horrible.
  • Figure out if we can make the initial inferred row type representation more compact somehow, rather than represented as a list. A problem is that if we want to, say, remove some fields in some operations, we'll need to take apart the initial row type. This means we can't just represent it as some opaque label, and if we work out a way to represent it as an array rather than a list, some operations will be a bit more awkward (e.g. cons'ing a new type on to the row).

Thanks again for this report. I'm sorry that the experiment wasn't a success, but I really appreciate you taking the time to let me know.

acowley avatar Sep 05 '15 15:09 acowley

Thanks for your very comprehensive reply. To be fair we are probably using the wrong data structure (CSV) for our data as I may want to use 10x or even more particles. I suspect even R would baulk at that! However, pretty much any tool that claims to do data analysis can handle CSV and changing our data structures is going to be a reasonable amount of work.

BTW I managed to stay in Haskell land by using cassava in a dynamically typed way (everything is a string and I read values I know to be Double).

idontgetoutmuch avatar Sep 06 '15 07:09 idontgetoutmuch

I would assume that # cols could be tens of thousands or more, that sort of use case comes up in a lot of machine learning contexts (it's not uncommon for there to be millions of predictors with mixed types). csv is not an ideal format of course, but for better or worse it's a common currency. Handling the tens of thousands case at least should be considered as that is quite common.

nice to hear cassava works but ideally there would be a better solution than re-inventing a poor-man's dynamic typing.

vixr avatar Jan 18 '16 17:01 vixr

I've been thinking about this ever since @idontgetoutmuch's original report, and have made some progress. I'm still trying to cut down compile times, but maybe we should set up a test for this kind of extreme scalability. We should have a generator for a suitably-sized CSV file, and then I'll put my in-progress work up on a branch. It'd be cool if we could use cabal benchmarks to drive a parameterized test, eg "cabal bench scalability 10k" (but with whatever the correct syntax is).

acowley avatar Jan 18 '16 18:01 acowley

@acowley Did you ever push that stuff to a branch? Perhaps if it is in bad condition just push it as is and name the branch *-wip?

codygman avatar Dec 05 '16 03:12 codygman

I just dug up my experiment and updated it to a current LTS, and it's bad news bears. What I did was re-implement part of vinyl using a tree of types to index the record type. In a runtime benchmark, this greatly improves getting a field from a large record. However, compilation times are significantly worse for this approach than for standard vinyl.

I apparently implemented the tree-indexed records two different ways, and wrote a compilation benchmark using the GHC API that gives me these results today with GHC-8.0.1:

vinyl: Definition: 17.78 ms; Use: 27.89 ms
tree: Definition: 221.95 ms; Use: 42.80 ms
tree2: Definition: 39.61 ms; Use: 31.05 ms

That is the type checking time for a module that defines a record value, and the type checking time for a module that pulls a field out of that record.

The one bright spot is a runtime benchmark. Though there is a ton of noise, you can see the lower and upper bound times for each variation are quite distinct:

benchmarking record get
time                 7.530 ns   (7.378 ns .. 7.683 ns)
                     0.996 R²   (0.995 R² .. 0.998 R²)
mean                 7.489 ns   (7.338 ns .. 7.647 ns)
std dev              533.4 ps   (447.7 ps .. 657.5 ps)
variance introduced by outliers: 86% (severely inflated)

benchmarking Rec get
time                 919.9 ns   (903.3 ns .. 934.2 ns)
                     0.997 R²   (0.996 R² .. 0.998 R²)
mean                 909.9 ns   (891.7 ns .. 927.6 ns)
std dev              59.79 ns   (49.87 ns .. 73.85 ns)
variance introduced by outliers: 77% (severely inflated)

benchmarking TRec get
time                 28.47 ns   (27.96 ns .. 29.00 ns)
                     0.996 R²   (0.993 R² .. 0.997 R²)
mean                 28.67 ns   (27.87 ns .. 29.60 ns)
std dev              2.817 ns   (2.338 ns .. 3.694 ns)
variance introduced by outliers: 91% (severely inflated)

That's pulling the 26th field out of a 26-field record using a regular Haskell data type, a vinyl Rec, and a tree-indexed record.

I haven't looked at this code in about a year, so I'm not familiar with exactly what it's all about, but it looks like the effort to improve compile times was a bust.

acowley avatar Dec 06 '16 23:12 acowley

It looks like a few data sets I'm playing with actually have 150 columns. In GHCI tableTypes' will work fine, but when I actually try to compile it I get 40GB memory usage. I bet if I compile with ghc -O0 it will compile as it did in ghci... though in my verbose compile I noticed it got hung up on the 3rd round of simplification so my intuition leads me to wonder if I found a case the simplification loop breakers are missing.

codygman avatar Dec 11 '16 06:12 codygman

Is the thing you're compiling just a tableTypes' splice, or is there some other code, too?

acowley avatar Dec 12 '16 03:12 acowley

@acowley It's just a tableTypes' splice. I'll see if I can make a minimum reproduction.

codygman avatar Dec 12 '16 04:12 codygman

A good thing we could do then is comment out most of what gets spliced in to see which part is blowing things up (hopefully it's not the type alias for the row type!).

acowley avatar Dec 12 '16 05:12 acowley