tech.ml.dataset icon indicating copy to clipboard operation
tech.ml.dataset copied to clipboard

printing of large dataset fails

Open behrica opened this issue 1 year ago • 2 comments

I made a big dataset of shape: [4 3724776]

and "printing" in the repl of even small parts fails:

(ds/head ds)
; Error printing return value (ArithmeticException) at java.lang.Math/toIntExact (Math.java:1074).
; integer overflow
clj꞉text-perf꞉> 
clojure.main/repl (main.clj:442)
clojure.main/repl (main.clj:459)
clojure.main/repl (main.clj:368)
nrepl.middleware.interruptible-eval/evaluate (interruptible_eval.clj:84)
nrepl.middleware.interruptible-eval/evaluate (interruptible_eval.clj:56)
nrepl.middleware.interruptible-eval/interruptible-eval (interruptible_eval.clj:152)
nrepl.middleware.session/session-exec (session.clj:218)
nrepl.middleware.session/session-exec (session.clj:217)
java.lang.Thread/run (Thread.java:829)

behrica avatar Oct 28 '24 11:10 behrica

Maybe I should mention that the dataset is really big, as it has 3 million rows of text. Roughly 8 GB of text.

I loaded it successfully using the "mmap" feature, so by doing:

(ds/->dataset "bigdata/repeatedAbstrcats_3.7m_.txt"
                {:text-temp-dir "/tmp/xxx"
                 :file-type :tsv
                 :header-row? false})

behrica avatar Oct 28 '24 11:10 behrica

Restricting it to 1 million rows, made it work. in this case the mmap file on disk was about 2G.

I have worked witg TMD datasets with far "more rows" before, so the "rows" are not the problem.

Not sure, if it is the "mmap" support or the "strings to print".

It seems top be a "printing" related issues. I can do operations on the dataset, just not print any result of them.

behrica avatar Oct 28 '24 11:10 behrica