core.matrix icon indicating copy to clipboard operation
core.matrix copied to clipboard

withMeta of dataset changes the class of Dataset

Open behrica opened this issue 8 years ago • 3 comments

I maintain some code which relied on the fact that before core.marix dataset was a "defrecord", and the withMeta call kept the type.

See here for usage in my code:

https://github.com/behrica/rojure/blob/56f6641ab8244236ee267daa63a09864be5aad99/src/rojure/convert.clj#L193

I would like to port the code to core.matrix 0.61, but in here withMetadata returns a vector instead of a Dataset:

https://github.com/mikera/core.matrix/blob/801c55ed9a67eb2c2650e371607aa334ec908756/src/main/clojure/clojure/core/matrix/impl/dataset.clj#L85

I am not really an export on the withData function, but the clojure docu seems to suggest, that it should return an object of the same type.

Is there anything core.matrix could do here better ?

behrica avatar Oct 04 '17 19:10 behrica

Well mp/convert-to-nested-vectors will convert to a vector for sure! So I see how this is happening.

Perhaps the Dataset should revert to using defrecord instead of deftype, which I believe supports metadata?

mikera avatar Oct 07 '17 14:10 mikera

I am now wondering if there should be a change in core.matrix or in rojure. The code in rojure is using the meta information for mainly one use case:

It wants to keep the concept of storing "Strings" as factor variables as R does it. (So storing a numeric number for each possible value of the "String" in a column) So it stores this information in the metadata of the core.matrix.dataset.

I kind of thinking that it is better to change this and store string instead.

For me the rojure should be "compliant" with core.matrix.dataset data types. And the current "dataset" implementation in latest core.matrix has no concept of "levels" / categorical variables.

I believe that rojure should therefore do an internal conversion from R-factors to plain strings. @mikera any opinon on this ?

And this would mean, that I would not need to store the "meta" information in dataset anymore.

Nevertheless I believe it would be cleaner that dataset would not implement IObj and IMeta, at all or do it correctly (not sure how) I think that the current behavoiur is "wrong" or at least "unexpeced", as it changes the type of daaset in the "with-meta" call.

behrica avatar Oct 09 '17 09:10 behrica

I think that if you want to use an internal strategy that stores strings as numbers for example then it should be a hidden implementation detail.

I would use a deftype or defrecord that stores and index data required to perform such lookups, and extend the core.matrix protocols to this type as required (which would include doing any conversion etc.). I.e. you are effectively creating your own custom version of the default core.matrix Dataset.

It would be up to you if you wanted to stores the index data in metadata, though I would suggest using an actual field as it isn't reallt "metadata", it is part of your data structure.

Regarding categorical variables in general, note that core.matrix is generally pretty ambivalent to what kind of values you use. I would say it is up to higher level libraries to interpret variables as categorical, numerical, string etc. core.matrix itself can't really assume anything, so the protocols give the flexibility to the user.

mikera avatar Oct 10 '17 02:10 mikera