added printing of `meta` to Column
I just changed one line, but the diff looks huge, Some whitespace mess.
This PR clearly collides with #203 , which want a single line printing for columns...
I'm watching this too. I like the suggestion in #203 as well. The single-line, reader-safe option seems nice. Also in connection with the project I'm engaged in right now adding a new column api to tablecloth (broader context here). A reader-safe print could do a lot, I think, to give the column it's own identity -- insofar as that ends up being useful.
@ezmiller I was looking at your article and PR. Interesting work.
I could make some comments regarding "finally, consider importing ideas from R’s “factors”, where should I make those ?
I don't think printing the metadata should be the default. The example in zulip had a relatively small lookup table but when you have, say 100 items in the lookup table just printing the column metadata is pretty tedious.
Regarding any single-line printing mechanism I don't think for datasets it is appropriate to have a printing mechanism that will print out a 100000 item column to the repl. I don't know what R does when you have a really large column and 'print' it to the repl but that type of thing results in repls hanging and users having to cancel and restart a repl session. This type of things works for toy datasets and not when someone has a substantial datasets which is exactly when you would want to use datasets in the first place. Nor do I think saving a column as 'edn' is a great idea for exactly the same reason. I want to avoid systems that break when someone actually needs them e.g. in a situation where their datasets is large enough that normal Clojure pathways are either too slow or just painful.
Putting this together, however, for tmdjs I do have single-line printing in vector style so #[column [dtype len] [...]] type of data by default. Maybe this is a better way overall for columns and perhaps for columns it could also print the lookup table if it has one and the table isn't too large. Then everyone gets a piece of what they want but we do it carefully to avoid too much noise in the repl.
@ezmiller I was looking at your article and PR. Interesting work.
I could make some comments regarding "finally, consider importing ideas from R’s “factors”, where should I make those ?
Maybe somewhere in Zulip would be good? I guess dev conversations re: tablecloth also happen under the #tech.ml.dataset.dev stream. If you wanted a more structured context you could also use an issue in the tablecloth repo. I don't have a strong opinion. I'd be very pleased to hear your thoughts!
R looks like this:
> is.vector(c(1:100))
[1] TRUE
> c(1:100)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100
So a c(1: 100000) just fills up the repl window in a way that isn't particularly helpful.
Numpy does an abbreviation indicating the beginning and the end:
array([ 0, 1, 2, ..., 99997, 99998, 99999])
To get additional information about the R vectors you have to access what is essentially metadata, i..e attributes. Interestingly, the Advanced R books says this about attributes:
Attributes should generally be thought of as ephemeral. For example, most attributes are lost by most operations.
(https://adv-r.hadley.nz/vectors-chap.html#getting-and-setting)
The only stable attributes are names (used to name values uniquely) and dims used to make a vector n-dimensional.
I can only reiterated what I think.
The current Colum metadata is important (but hardly visible).
If I would re-invent tech.ml.dataset (and want to keep "Factors" and other column metadata being part of it)
then I would make each column a map of colum values and metadata.
(or make factors complete invisible, as dataset does with String columns and the StringTable and partially with timestamps),
so a compressed representation. But "factors" are important of the user, so liley not a good idea neither.
Or we could reengineer scicloj.ml and require the column metadata to be given into all scicloj.mlmethods which need it and remove it from dataset. A lot of work and breaking backwards compatibility.
-> "Printing meta by default (and maybe cutting it of "smartly") is a compromise for me
I personally hardly print datasets in the repl in any case.... I know that the metadata is important and I look at it, but a beginner will likely do the opposite:
- print the dataset in Repl
- not look at Column metadata explicitly > will not see it
Maybe we could "missues" print-level to decide what to print ? So we re-interpret for a dataset the levels as:
level nil -> print all incl metadata level 0 -> print all , no metadata
or similar.