Frames icon indicating copy to clipboard operation
Frames copied to clipboard

Adding a derived column

Open rrottier opened this issue 9 years ago • 10 comments

Is it possible to add a derived columns to the frame? I see that there is a frameCons that can add a column to the frame but I am not sure how to go about defining a type signature for the new Column as all the other types where generated by TH and I have no idea how that works.

For example I have a table with number of girls and boys enrolled each year at primary school and I want to add a derived column for the total number of kids enrolled each year.

Adding a derived column is a fairly common operation in R so being able to duplicate it easily would be a plus.

Thanks Riaan

rrottier avatar Jul 07 '15 10:07 rrottier

I wrote up how to do this, but then added some things to hopefully make it easier and pushed a new version of Frames to hackage. In general, I use the :i command quite a lot in cabal repl to inspect the row type I get from some data. You can also use :browse to see all the generated things.

So here's how you might approach your example problem given the new Frames-0.1.1.0:

{-# LANGUAGE DataKinds, FlexibleContexts, TemplateHaskell, TypeOperators #-}
import Frames
import Lens.Family

tableTypes "Row" "data/SchoolEnrollment.csv"

loadRows :: IO (Frame Row)
loadRows = inCoreAoS $ readTable "data/SchoolEnrollment.csv"

totalEnrollment :: Frame Row
                -> Frame (Record ("Total" :-> Int ': RecordColumns Row))
totalEnrollment = fmap (\r -> frameConsA (r^.girls + r^.boys) r)

type Row' = Record ("Total" :-> Int ': RecordColumns Row)

I put the Row' definition there at the end as you will want to name that new row type if you refer to it more than once. Given that synonym, we'd have totalEnrollment :: Frame Row -> Frame Row'.

acowley avatar Jul 07 '15 23:07 acowley

That looks awesome, I tried building it though and get the following error:

src/Frames/RecF.hs:37:15:
    Not in scope: type constructor or class ‘Applicative’
    Perhaps you meant ‘RecApplicative’ (imported from Data.Vinyl)

src/Frames/RecF.hs:38:34:
    Not in scope: ‘pure’
    Perhaps you meant ‘rpure’ (imported from Data.Vinyl)
cabal: Error: some packages failed to install:
Frames-0.1.1.0 failed during the building phase. The exception was:
ExitFailure 1

I think you meant to add (Applicative, pure) to the imports for RecF. Works if I edit source code to add it.

rrottier avatar Jul 07 '15 23:07 rrottier

Works great, and really nice that the other functions is not broken by the additional column. Would it be possible to also add an Applicative version of frameSnoc so that the derived columns can be appended to the end?

Thanks for the quick response.

rrottier avatar Jul 08 '15 00:07 rrottier

Sorry about my screwup with the missing Applicative import. I use GHC 7.10 and didn't wait for Travis to build on 7.8. Frames-0.1.1.1 is up on hackage fixing that.

Sure, we can add frameSnocA but I had some trouble with the type when looking at your example here. The problem is that writing frameSnoc requires that the caller applies the Col newtype constructor (this is what tags each column type with a name).

The problem is that the type level ++ leads to ambiguity. It's worth playing with a bit more to see if this can be improved, but I didn't find a solution in my quick look. Maybe the injective type family features coming to GHC will be able to help there, or maybe carefully annotating the operation can help.

acowley avatar Jul 08 '15 00:07 acowley

No problem, I forced me to look at what you have changed and hopefully increased my haskell knowledge a little bit.

W.r.t. frameSnocA I won't pretend to understand the problem. My haskell type foo stops at understanding Monads. But reading between the lines it sounds like possible but not straightforward.

I can work around it for now by using fmap (select [pr|Boys,Girls,Total|]) totalEnrollment and/or constructing the type signature by hand but this is means that I cannot really be agnostic to the full data structure, still in most cases where I am creating a derived column I have a fairly clear idea of the fields I am considering so it is not too bad.

I did get the feeling though that is was getting a bit slower doing it this way but maybe this is of the same order as snoc? I am on toy data at the moment but the real data structures will be much larger so if this workaround comes with a heavy penalty compared to snoc then it would be better to investigate other options, including just leaving all the derived columns at the start :-)

Thanks anyway for your help and all the work you put into this library. I am enjoying the challenge of trying to understand how it works.

rrottier avatar Jul 08 '15 00:07 rrottier

Ah, so that's another direction that I de-emphasized here in order to make the the row type change clear. You can also write,

te :: (Boys ∈ rs, Girls ∈ rs) => FrameRec rs -> FrameRec ("Total" :-> Int ': rs)
te = fmap (\r -> frameConsA (r^.girls + r^.boys) r)

And now you are agnostic to the rest of the row.

The cons'ing onto one end of the list is annoying, but I'm sure if everything were built around a snoc-list structure we'd have symmetric annoyances. This aspect of the design is inherited from Vinyl, but the cons list is primary in FP languages, so it's the data structure of least surprise.

It's worth thinking about if this can be improved in Frames. Maybe it's a user experience thing, and just letting the programmer write the types as a snoc list would be better. I'm not sure if that can be done, as it may well run into more problems of not being able to teach the type checker facts about lists. I'll try to find some time to play around with that to see what works out; we could probably offer something like FrameR (Row ':| "Total" :-> Int), but if we wanted the row to print out in reverse order, it'd have to be a newtype rather than just a type family.

acowley avatar Jul 08 '15 02:07 acowley

I would want the row to index and print out like Total is the last column. As this is the way R presents dependent columns it would be the most intuitive I think.

Thanks for clarifying how to make the function more generic. I was just thinking agnostic to the other columns in the table that are generated from the csv file. Will te also work accross other structures that have the columns ("boys" :-> Int) and ("girls" :-> Int) ?

By the way, how do you get the ∈ symbol in haskell? I have to copy and paste it every time because I have no idea how to type it. I use emacs as editor.

rrottier avatar Jul 08 '15 08:07 rrottier

I ran into another problem with the derived columns, when I try to get the data from the derived column using view I get a not in scope error.

How do I access this column?

rrottier avatar Jul 08 '15 11:07 rrottier

  • Yes, the te example I gave works with any structure that has Boys and Girls columns.
  • Regarding , you can also write out an application of RElem. I enter the symbol in emacs by running toggle-input-method (C-\), and enabling the TeX input method. This lets you write TeX commands such as \in to produce symbols.
  • Since we're creating a new column, no code has been generated. I added the following to our running program as an example,
type Total = "Total" :-> Int

total :: (Total ∈ rs) => Functor f => LensLike' f (Record rs) Int
total = rlens [pr|Total|]

getTotal :: (Total ∈ rs) => Record rs -> Int
getTotal = rget total

acowley avatar Jul 08 '15 22:07 acowley

Awesome, all working now, still a bit of boilerplate to shuffle the columns around but not too bad. Great tip on the input method as well.

Thanks

rrottier avatar Jul 09 '15 02:07 rrottier