core.matrix icon indicating copy to clipboard operation
core.matrix copied to clipboard

pretty-printer does not print column names of dataset

Open paultopia opened this issue 9 years ago • 10 comments

Hi there,

So it appears that passing a dataset (as opposed to a plain matrix) to pm just prints the data, not column names.

This is actually a really easy fix if you don't mind a kinda hackish approach. pm does work if applied directly to the columns. e.g.

(require '[clojure.core.matrix.impl.pprint :as pp])
(pp/pm (:column-names some-dataset))

successfully pretty-prints the columns.

So the kinda hackish solution that I have a burning urge to do would be to define a separate pm in the dataset namespace that just prettyprints the column names then prettyprints the rest of the data using the existing function.

Would that be too ugly?

paultopia avatar Apr 04 '16 02:04 paultopia

Will that line up names and columns? If so, a temporary solution is better than nothing. I have some code that will print row names as well as column names, and in addition will "rotate" column names so that the characters are lined up vertically (first character at top, last character several lines down, etc.)--this allows a compact representation of the data. However, I wrote this early on, before I knew Clojure or core.matrix well, and it may have some parts that only work with my data structures. I haven't looked at it in a while. Happy to share if it might be helpful.

mars0i avatar Apr 04 '16 03:04 mars0i

hmm... Lining columns up would be slightly more difficult but not too silly hard. And yeah, does seem kinda necessary to be useful. Maybe just by truncating/padding colnames to the size of the elements in the data arrays. (Or your vertical idea... would love to see your code!) I'll try do some experiments tomorrow, time permitting. Will have to figure out how the existing pm function decides on length of rows.

paultopia avatar Apr 04 '16 04:04 paultopia

I think a separate pm (or pd for Print Dataset?) function in the dataset namespace makes sense. Probably with some kind of sane defaults and an optional options map.

To me printing with column names is a very distinct operation from simply printing the values in the array. In fact, for consistency you should probably print column indexes if someone does pd on a non-dataset array.

Very happy for any PRs / help people can give with this area!

mikera avatar Apr 04 '16 04:04 mikera

The current pprint code is pretty hacky.... could really use some refactoring and improvement.

But the basic idea is:

  • Run a "formatter" function on each element of the array to get string values
  • Count the max length of each column to work out column widths
  • Print out the array, recursively over array slices until you get down to individual rows

mikera avatar Apr 04 '16 04:04 mikera

(Or, since pm doesn't seem to insist on valid matrix input, just concatenate the columns to the body as an ordinary array then pretty-print that with existing functions?)

paultopia avatar Apr 04 '16 04:04 paultopia

(Oops, posted that before I saw mikeda's comments... will think about the problem in general!)

paultopia avatar Apr 04 '16 04:04 paultopia

pprint-nn is my top-level function. ("nn" stands for "neural network", which is what the matrix represents.) The guts of the process are in definitions below that in the same file. I had thought about reworking it and contributing it to core.matrix, but didn't ever find time. If someone else wants to use code from what I did, that's fine with me. Looking at it now, parts of the code are definitely designed specifically for the data structures in my application. The label information comes from outside the matrix. Happy to answer questions, but no time to work on it myself right now. (Um, I GPL'ed the code. What would I need to do to make it suitable for core.matrix, license-wise, if someone does more than simply copy ideas?)

mars0i avatar Apr 04 '16 13:04 mars0i

Cool, I've been mulling it over and will work on a new implementation (probably just copying ideas rather than code from mars0i). Can't promise it will be super fast, because I'm still learning! But a very simple implementation is taking shape in my head, one that doesn't involve dropping down to Java stringbuilder etc.

Might have to make a few passes over the array though. Is it ok to sacrifice a little speed for simplicity? I assume people don't usually try and prettyprint really big arrays repeatedly.

paultopia avatar Apr 04 '16 15:04 paultopia

@paultopia You shouldn't need too many passes - ideally there should be just two or three passes over the whole array:

  1. Once to convert all elements to strings
  2. Once to compute the maximum string lengths for each column
  3. Once to print out each row

Might be possible to do 1 and 2 in the same pass if you are clever :-)

I still think it is appropriate to use StringBuilder. Concatenating long Strings is potentially a very slow operation if you do it one-by-one, it's also helpful to do a single println for each row.

Basically - it's OK to sacrifice a bit of performance for simplicity / readability be we should still keep overhead to a minimum.

@mars0i core.matrix is EPL, so you'd need to contribute your code under an EPL-compatible license for it to be usable.

mikera avatar Apr 05 '16 01:04 mikera

@mikera heh, right-on. I'll see what I can come up with. I can think of a way to combine 1 and 2 by just re-implementing max in mutable C-style and then just doing a pass column-wise for both, but that seems a little aggro :-)

(Hmm.... thinking aloud... Maybe this is a good excuse to learn transducers... wonder if I can convince one to return a map as well as a reduction on a column. Insert slightly insane laugh here...)

paultopia avatar Apr 05 '16 14:04 paultopia