Deedle icon indicating copy to clipboard operation
Deedle copied to clipboard

Add merge functionality between data frames

Open jeremyhoughton opened this issue 11 years ago • 7 comments

Add a Pandas style merge function to make joining data frames together easier especially when there are non unique columns/indexes. Please see this stackoverflow question here for more detail.

jeremyhoughton avatar Feb 16 '14 22:02 jeremyhoughton

  • One concrete way to do this would be to add a parameterized join operation that uses the specified columns (rather than using the index).
  • Another option would be to allow repeated keys in the index and then this could be done just by using the standard join function.

tpetricek avatar Feb 17 '14 13:02 tpetricek

I think the first idea is easier to implement, and would be useful in general.

adamklein avatar Feb 24 '14 23:02 adamklein

This might also be related, although here is not matching on columns, but on parts of the index: http://stackoverflow.com/questions/22048401/how-to-join-frames-using-fs-deedle-where-one-of-the-frame-has-a-composite-key/22049066

tpetricek avatar Feb 26 '14 17:02 tpetricek

Are there any elegant ways to do this with Deedle now? I created a function for this. But it copies only one column only. Users have to specify a type parameter explicitly.

let mergeColumnFromForeignFrame<'X, 'C, 'R1, 'R2 
              when 'C: equality and 'R1: equality and 'R2: equality>
   (column_to_be_copied: 'C) (missingValue: 'X) (indexed_frame: Frame<'R2,'C>)
   (key_column: 'C) (primary_frame: Frame<'R1,'C>)  =

  let srs_references: Series<'R1, 'R2> = 
      primary_frame |> getCol key_column

  let (srs_cols_to_be_merged: Series<_,'X>) = 
    getCol column_to_be_copied indexed_frame

  let new_srs = 
    srs_references 
    |> Series.mapValues 
        (fun x -> if srs_cols_to_be_merged.ContainsKey(x) then
                    srs_cols_to_be_merged.[x]
                  else missingValue)

  addCol column_to_be_copied new_srs primary_frame

murata2makoto avatar Feb 10 '22 04:02 murata2makoto

I have created a function which selects the keys first and performs a lookup:

let partJoin (keySelect:'R->'V) (df1:Frame<'R,_>) (df2:Frame<_,_>)=

    // 1. Make the key available
    let keySeries =
        df1
        |> Frame.mapRows (fun k _ -> keySelect k)

    // 2. Create a combind column via lookup
    let combined =
        keySeries
        |> Series.mapAll (fun _ vOpt -> 
            vOpt
            |> Option.bind (fun v ->
                df2.TryGetRow v |> OptionalValue.asOption)
                )

    // 3. Split into single columns and append
    df2.ColumnKeys
    |> Seq.fold (fun acc key ->
        let col =
            combined
            |> Series.mapAll (fun _ sOpt ->
                sOpt
                |> Option.bind (fun s ->
                    s.TryGet key |> OptionalValue.asOption
                )
            )
        acc
        |> Frame.addCol key col) df1

Example:

let toFrame (str:string) =
    let bytes = System.Text.Encoding.UTF8.GetBytes str
    let stream =  new MemoryStream( bytes )

    Frame.ReadCsv(
        stream = stream,
        separators = ";",
        hasHeaders = true
    )

let df1 = 
    toFrame "A;B;C\na;1;text1\na;2;text2\nc;2;text3\nc;3;text4"
    |> Frame.indexRowsUsing(fun row ->
        row.GetAs<string>"A",row.GetAs<int>"B"
        )
    |> Frame.sliceCols ["C"]

df1.Print(true)

let df2 = 
    toFrame "B;D;E\n1;2020;bla\n2;;blb\n3;2014;blc"
    |> Frame.indexRowsInt "B"

df2.Print(true)

let test = partJoin snd df1 df2
test.Print(true)
       C        
       (string) 
a 1 -> text1    
  2 -> text2    
c 2 -> text3    
  3 -> text4    


     D         E        
     (int)     (string) 
1 -> 2020      bla      
2 -> <missing> blb      
3 -> 2014      blc      

       C        D         E     
       (string) (obj)     (obj) 
a 1 -> text1    2020      bla   
  2 -> text2    <missing> blb   
c 2 -> text3    <missing> blb   
  3 -> text4    2014      blc   

Unfortunately it loses the type information. Maybe somebody knows how to prevent this.

jim108dev avatar Jun 25 '22 06:06 jim108dev

@jim108dev Another program of mine does something similar. It also loses the type information.

murata2makoto avatar Jun 25 '22 09:06 murata2makoto

@murata2makoto Yes, you are right. I should have paid more attention to your solution. Maybe you could turn on syntax highlighting, makes it easier to read. Also at your program you have to specify the missing value explicitly and you have to perform a lookup for every added column. If we just had somebody who knows about type information. (@tpetricek)

jim108dev avatar Jun 25 '22 10:06 jim108dev