printing of large dataset fails
I made a big dataset of shape: [4 3724776]
and "printing" in the repl of even small parts fails:
(ds/head ds)
; Error printing return value (ArithmeticException) at java.lang.Math/toIntExact (Math.java:1074).
; integer overflow
clj꞉text-perf꞉>
clojure.main/repl (main.clj:442)
clojure.main/repl (main.clj:459)
clojure.main/repl (main.clj:368)
nrepl.middleware.interruptible-eval/evaluate (interruptible_eval.clj:84)
nrepl.middleware.interruptible-eval/evaluate (interruptible_eval.clj:56)
nrepl.middleware.interruptible-eval/interruptible-eval (interruptible_eval.clj:152)
nrepl.middleware.session/session-exec (session.clj:218)
nrepl.middleware.session/session-exec (session.clj:217)
java.lang.Thread/run (Thread.java:829)
Maybe I should mention that the dataset is really big, as it has 3 million rows of text. Roughly 8 GB of text.
I loaded it successfully using the "mmap" feature, so by doing:
(ds/->dataset "bigdata/repeatedAbstrcats_3.7m_.txt"
{:text-temp-dir "/tmp/xxx"
:file-type :tsv
:header-row? false})
Restricting it to 1 million rows, made it work. in this case the mmap file on disk was about 2G.
I have worked witg TMD datasets with far "more rows" before, so the "rows" are not the problem.
Not sure, if it is the "mmap" support or the "strings to print".
It seems top be a "printing" related issues. I can do operations on the dataset, just not print any result of them.