smile
smile copied to clipboard
[Feature proposal] Dataframe merge by ID
I've got a few different dataframes that I'd like to merge when doing calculating some regression, and right now I do so by converting to a matrix of doubles, aligning the rows by id, and then rebuilding a dataframe. In spark and pandas, they have utility methods that allow you to merge dataframes with a by
option to specify which column is used to match the data.
Describe the solution you'd like
Extend the merge method with either a simple by
option to specific key to merge on, add a mergeWith
method, or a MergeOptions
parameter that contains information such as by
(key to join on), and mergeType
(inner vs outerjoins, left vs right join).
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html
Are you interested in join
or a simple merge? You can merge
two or more data frames suppose that rows are in the same order with existing API.
More of a join
. I've got a lot of dataframes, including some I receive from other departments, and it's sometimes painful to get these into a cohesive, single dataframe that contains the feature set I need.
As an edit: This functionality is exactly what I'd like https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html
We add smile.data.SQL
for database management that supports join. The query/join result will be return as DataFrame
. See SQLTest for examples.