tech.ml.dataset icon indicating copy to clipboard operation
tech.ml.dataset copied to clipboard

added printing of `meta` to Column

Open behrica opened this issue 2 years ago • 8 comments

I just changed one line, but the diff looks huge, Some whitespace mess.

behrica avatar Jun 21 '22 20:06 behrica

This PR clearly collides with #203 , which want a single line printing for columns...

behrica avatar Jun 21 '22 20:06 behrica

I'm watching this too. I like the suggestion in #203 as well. The single-line, reader-safe option seems nice. Also in connection with the project I'm engaged in right now adding a new column api to tablecloth (broader context here). A reader-safe print could do a lot, I think, to give the column it's own identity -- insofar as that ends up being useful.

ezmiller avatar Jun 29 '22 13:06 ezmiller

@ezmiller I was looking at your article and PR. Interesting work.

I could make some comments regarding "finally, consider importing ideas from R’s “factors”, where should I make those ?

behrica avatar Jun 30 '22 12:06 behrica

I don't think printing the metadata should be the default. The example in zulip had a relatively small lookup table but when you have, say 100 items in the lookup table just printing the column metadata is pretty tedious.

Regarding any single-line printing mechanism I don't think for datasets it is appropriate to have a printing mechanism that will print out a 100000 item column to the repl. I don't know what R does when you have a really large column and 'print' it to the repl but that type of thing results in repls hanging and users having to cancel and restart a repl session. This type of things works for toy datasets and not when someone has a substantial datasets which is exactly when you would want to use datasets in the first place. Nor do I think saving a column as 'edn' is a great idea for exactly the same reason. I want to avoid systems that break when someone actually needs them e.g. in a situation where their datasets is large enough that normal Clojure pathways are either too slow or just painful.

Putting this together, however, for tmdjs I do have single-line printing in vector style so #[column [dtype len] [...]] type of data by default. Maybe this is a better way overall for columns and perhaps for columns it could also print the lookup table if it has one and the table isn't too large. Then everyone gets a piece of what they want but we do it carefully to avoid too much noise in the repl.

cnuernber avatar Jun 30 '22 13:06 cnuernber

@ezmiller I was looking at your article and PR. Interesting work.

I could make some comments regarding "finally, consider importing ideas from R’s “factors”, where should I make those ?

Maybe somewhere in Zulip would be good? I guess dev conversations re: tablecloth also happen under the #tech.ml.dataset.dev stream. If you wanted a more structured context you could also use an issue in the tablecloth repo. I don't have a strong opinion. I'd be very pleased to hear your thoughts!

ezmiller avatar Jun 30 '22 13:06 ezmiller

R looks like this:

> is.vector(c(1:100))
[1] TRUE
> c(1:100)
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100

So a c(1: 100000) just fills up the repl window in a way that isn't particularly helpful.

Numpy does an abbreviation indicating the beginning and the end:

array([    0,     1,     2, ..., 99997, 99998, 99999])

To get additional information about the R vectors you have to access what is essentially metadata, i..e attributes. Interestingly, the Advanced R books says this about attributes:

Attributes should generally be thought of as ephemeral. For example, most attributes are lost by most operations.

(https://adv-r.hadley.nz/vectors-chap.html#getting-and-setting)

The only stable attributes are names (used to name values uniquely) and dims used to make a vector n-dimensional.

ezmiller avatar Jul 01 '22 10:07 ezmiller

I can only reiterated what I think. The current Colum metadata is important (but hardly visible). If I would re-invent tech.ml.dataset (and want to keep "Factors" and other column metadata being part of it) then I would make each column a map of colum values and metadata.

(or make factors complete invisible, as dataset does with String columns and the StringTable and partially with timestamps), so a compressed representation. But "factors" are important of the user, so liley not a good idea neither.

Or we could reengineer scicloj.ml and require the column metadata to be given into all scicloj.mlmethods which need it and remove it from dataset. A lot of work and breaking backwards compatibility.

-> "Printing meta by default (and maybe cutting it of "smartly") is a compromise for me

I personally hardly print datasets in the repl in any case.... I know that the metadata is important and I look at it, but a beginner will likely do the opposite:

  • print the dataset in Repl
  • not look at Column metadata explicitly > will not see it

behrica avatar Jul 01 '22 15:07 behrica

Maybe we could "missues" print-level to decide what to print ? So we re-interpret for a dataset the levels as:

level nil -> print all incl metadata level 0 -> print all , no metadata

or similar.

behrica avatar Jul 01 '22 15:07 behrica