Deedle
Deedle copied to clipboard
Add merge functionality between data frames
Add a Pandas style merge function to make joining data frames together easier especially when there are non unique columns/indexes. Please see this stackoverflow question here for more detail.
- One concrete way to do this would be to add a parameterized join operation that uses the specified columns (rather than using the index).
- Another option would be to allow repeated keys in the index and then this could be done just by using the standard join function.
I think the first idea is easier to implement, and would be useful in general.
This might also be related, although here is not matching on columns, but on parts of the index: http://stackoverflow.com/questions/22048401/how-to-join-frames-using-fs-deedle-where-one-of-the-frame-has-a-composite-key/22049066
Are there any elegant ways to do this with Deedle now? I created a function for this. But it copies only one column only. Users have to specify a type parameter explicitly.
let mergeColumnFromForeignFrame<'X, 'C, 'R1, 'R2
when 'C: equality and 'R1: equality and 'R2: equality>
(column_to_be_copied: 'C) (missingValue: 'X) (indexed_frame: Frame<'R2,'C>)
(key_column: 'C) (primary_frame: Frame<'R1,'C>) =
let srs_references: Series<'R1, 'R2> =
primary_frame |> getCol key_column
let (srs_cols_to_be_merged: Series<_,'X>) =
getCol column_to_be_copied indexed_frame
let new_srs =
srs_references
|> Series.mapValues
(fun x -> if srs_cols_to_be_merged.ContainsKey(x) then
srs_cols_to_be_merged.[x]
else missingValue)
addCol column_to_be_copied new_srs primary_frame
I have created a function which selects the keys first and performs a lookup:
let partJoin (keySelect:'R->'V) (df1:Frame<'R,_>) (df2:Frame<_,_>)=
// 1. Make the key available
let keySeries =
df1
|> Frame.mapRows (fun k _ -> keySelect k)
// 2. Create a combind column via lookup
let combined =
keySeries
|> Series.mapAll (fun _ vOpt ->
vOpt
|> Option.bind (fun v ->
df2.TryGetRow v |> OptionalValue.asOption)
)
// 3. Split into single columns and append
df2.ColumnKeys
|> Seq.fold (fun acc key ->
let col =
combined
|> Series.mapAll (fun _ sOpt ->
sOpt
|> Option.bind (fun s ->
s.TryGet key |> OptionalValue.asOption
)
)
acc
|> Frame.addCol key col) df1
Example:
let toFrame (str:string) =
let bytes = System.Text.Encoding.UTF8.GetBytes str
let stream = new MemoryStream( bytes )
Frame.ReadCsv(
stream = stream,
separators = ";",
hasHeaders = true
)
let df1 =
toFrame "A;B;C\na;1;text1\na;2;text2\nc;2;text3\nc;3;text4"
|> Frame.indexRowsUsing(fun row ->
row.GetAs<string>"A",row.GetAs<int>"B"
)
|> Frame.sliceCols ["C"]
df1.Print(true)
let df2 =
toFrame "B;D;E\n1;2020;bla\n2;;blb\n3;2014;blc"
|> Frame.indexRowsInt "B"
df2.Print(true)
let test = partJoin snd df1 df2
test.Print(true)
C
(string)
a 1 -> text1
2 -> text2
c 2 -> text3
3 -> text4
D E
(int) (string)
1 -> 2020 bla
2 -> <missing> blb
3 -> 2014 blc
C D E
(string) (obj) (obj)
a 1 -> text1 2020 bla
2 -> text2 <missing> blb
c 2 -> text3 <missing> blb
3 -> text4 2014 blc
Unfortunately it loses the type information. Maybe somebody knows how to prevent this.
@jim108dev Another program of mine does something similar. It also loses the type information.
@murata2makoto Yes, you are right. I should have paid more attention to your solution. Maybe you could turn on syntax highlighting, makes it easier to read. Also at your program you have to specify the missing value explicitly and you have to perform a lookup for every added column. If we just had somebody who knows about type information. (@tpetricek)