jodie
jodie copied to clipboard
Delta lake and filesystem helper methods
We faced a production issue with some of our pipelines while using Delta 1.0.1 and as a follow-up, we raised an [issue](https://github.com/delta-io/delta/issues/2455) on Delta core. While it seems that this...
See here for the API: https://github.com/MrPowers/mack/#append-data-with-constraints
As mentioned in this [post](https://lakefs.io/blog/how-to-implement-write-audit-publish/#delta-lake) OSS delta does not support the WAP(write audit publish) pattern I think this is something we can implement here in Jodie. If you don't know...
I think it will be interesting to add optionnal parameter ["PathRejects"], to write deduplicated rows, if we need to do some analyse of DataQuality when we have DuplicatedRow from source....
Need to modify Remove Duplicates function to remove duplicates from delta table/parquet file and keep latest record (sort by timestamp column)
Add function delete from Deltabale where exist in dataframe + update Readme
Hello, There is an interesting function, is to delete rows from Table when value of some columns exist in dataframe, i searched a function like that i haven't found it...
``` val duplicates = df .select() .withColumn("__file_path", col("_metadata.file_path")) .withColumn("__row_index", col("_metadata.row_index")) .withColumn( "rank", row_number().over( Window() .partitionBy() .orderBy())) .filter("rank > 1") .drop("rank") ``` And then: ``` df.alias("old") .merge( duplicates.alias("new"), "old. = new....
Add all jodie-related blogs to the project README.
We want to let developers know about the public interface of this project on social media: * `Type2Scd.upsert` * `DeltaHelpers.removeDuplicateRecords` (remove all occurrences) * ~~`DeltaHelpers.removeDuplicateRecords` (leave one occurrence)~~ * `DeltaHelpers.latestVersion`...