jodie icon indicating copy to clipboard operation
jodie copied to clipboard

Remove Duplicates Order by a column

Open jsubm5 opened this issue 2 years ago • 17 comments

Need to modify Remove Duplicates function to remove duplicates from delta table/parquet file and keep latest record (sort by timestamp column)

jsubm5 avatar Jan 22 '23 23:01 jsubm5

I would like to contribute to this. Let me know if I can take up this issue

jsubm5 avatar Jan 22 '23 23:01 jsubm5

This seems like a useful addition. @brayanjuls - are you cool with this?

I think the sorting should be optional though and we should make sure it only happens if the user wants it (cause we shouldn't have them incur the additional sort cost).

MrPowers avatar Jan 23 '23 08:01 MrPowers

For this function, there is a lot of uses cases, but i have a question the role of this function is to give possibility to remove rows from result request or directly from table itself ?

ilyasse05 avatar Jan 23 '23 08:01 ilyasse05

@ilyasse05 - this is to remove from directly the table itself ;)

MrPowers avatar Jan 23 '23 12:01 MrPowers

@Jegan7 - I just chatted with @brayanjuls on this one. He's going to think about this and ping you with a suggested implementation. After he pings you, it should be easy for you to write the code. We're looking forward to collaborating with you!!

MrPowers avatar Jan 23 '23 12:01 MrPowers

I think it's not secure to do it like that, the more secure will do it on resulat with select and after do a merge.

ilyasse05 avatar Jan 23 '23 13:01 ilyasse05

@ilyasse05 - All the functions that remove duplicates in this library were created with the idea of performing the action in the table and not to work as an in-memory transformation. If you have use cases where you would like to have only the transformation, please open an issue with what you need and we would be happy to brainstorm its implementation.

brayanjuls avatar Jan 23 '23 23:01 brayanjuls

@Jegan7 @MrPowers - This is a good use case. I think to implement it we just need to add .desc to the current orderBy that is used in the row_number window function. Let me just think a little bit more about how we can make it a parameter and I will ping you back with my final suggestions. link to the function that we need to modify: https://github.com/MrPowers/jodie/blob/9614cce474e0253a1c8876075f223dcee99735f0/src/main/scala/mrpowers/jodie/DeltaHelpers.scala#L64

brayanjuls avatar Jan 24 '23 00:01 brayanjuls

@Jegan7 - I finally had the time to think about it, I think you can achieve this uses case in the following manner, assuming input table is the following:

+----+---------+------+-----------+
|  id|firstname| lastname| timestamp|
+----+---------+------+-----------+
|   1|   Benito|  Jackson| t1|     # duplicate
|   2|    Maria|   Willis| t1|
|   3|     Jose| Travolta| t1|     # duplicate
|   4|   Benito|  Jackson| t2|     # duplicate
|   5|     Jose| Travolta| t2|     # duplicate
|   6|    Maria|     Pitt| t1|
|   9|   Benito|  Jackson| t3|     # duplicate
+----+---------+------+------------+

you can call the function in this way

removeDuplicateRecords(deltaTable = table, primaryKey = "timestamp", duplicateColumns = Seq("firstname","lastname"), sorDirection = SortDirection.DESCENDING)

Note that I added a new parameter called sort direction, that's an enum from the spark API (org.apache.spark.sql.catalyst.expressions). Try to implement it using that parameter and send a PR.

brayanjuls avatar Jan 28 '23 23:01 brayanjuls

Thanks Bryan I will look into it

Thanks Jegan


From: Brayan Jules @.> Sent: Saturday, January 28, 2023 6:01:26 PM To: MrPowers/jodie @.> Cc: Jegan7 @.>; Mention @.> Subject: Re: [MrPowers/jodie] Remove Duplicates Order by a column (Issue #22)

@Jegan7https://github.com/Jegan7 - I finally had the time to think about it, I think you can achieve this uses case in the following manner, assuming input table is the following:

+----+---------+------+-----------+ | id|firstname| lastname| timestamp| +----+---------+------+-----------+ | 1| Benito| Jackson| t1| # duplicate | 2| Maria| Willis| t1| | 3| Jose| Travolta| t1| # duplicate | 4| Benito| Jackson| t2| # duplicate | 5| Jose| Travolta| t2| # duplicate | 6| Maria| Pitt| t1| | 9| Benito| Jackson| t3| # duplicate +----+---------+------+------------+

you can call the function in this way

removeDuplicateRecords(deltaTable = table, primaryKey = "timestamp", duplicateColumns = Seq("firstname","lastname"), sorDirection = SortDirection.DESCENDING)

Note that I added a new parameter called sort direction, that's an enum from the spark API (org.apache.spark.sql.catalyst.expressions). Try to implement it using that parameter and send a PR.

— Reply to this email directly, view it on GitHubhttps://github.com/MrPowers/jodie/issues/22#issuecomment-1407506663, or unsubscribehttps://github.com/notifications/unsubscribe-auth/APDZ3ROD77KAGZNQKXQZO7DWUWQMNANCNFSM6AAAAAAUDH3UFE. You are receiving this because you were mentioned.Message ID: @.***>

jsubm5 avatar Jan 30 '23 03:01 jsubm5

@Jegan7 Just checking, do you need help to advance on this issue?

brayanjuls avatar Mar 02 '23 21:03 brayanjuls

Hi @brayanjuls , in your above comment why do you set primarykey to timestamp? Isn't it the order by column?

jsubm5 avatar Mar 16 '23 23:03 jsubm5

@jsubm5 - my proposal is to use the primarykey as an ordering column as well because currently, we use it to sort in ascending order and to identify duplicates. The primarykey should be a column that differentiates(is unique on each duplication set) the duplicated records that's why I use the timestamp there.

The change should be to add something like this new Column(SortOrder(col(primaryKey).expr, sort)) in the orderBy that currently exists in that function. Let me know your thoughts.

brayanjuls avatar Mar 23 '23 13:03 brayanjuls

@brayanjuls ,the sort column can be any column that user pass it in the function (it preferably should be a timestamp column but if there is no timestamp then they can choose any other column from the dataset) and we should let them pass the sort order as well

jsubm5 avatar Mar 24 '23 22:03 jsubm5

@jsubm5 - Indeed the primaryKey could be any other column. The timestamp colum that I mention above is just an example of how the function could be used after the feature is implemented.

brayanjuls avatar Mar 24 '23 23:03 brayanjuls

@brayanjuls should we let the user sort by multiple columns and sort order for each columns?

jsubm5 avatar Apr 01 '23 16:04 jsubm5

@jsubm5 - to achieve that we should support composed primaryKey in this function. I think it would be good to firstly implement this using a single primary key and afterward open a new issue to implement the composed primary key feature which would also allow to order by multiple columns.

brayanjuls avatar Apr 25 '23 12:04 brayanjuls