Daft icon indicating copy to clipboard operation
Daft copied to clipboard

add support for adding file path as a new column when read from csv/json etc

Open djouallah opened this issue 1 year ago • 4 comments

a common pattern when reading from csv. json etc, is to add a column in the destination table with the files processed already, so the next time you add new csv files, you will not endup with duplicate values, duckdb/polars for example support this function using filename = true

djouallah avatar Sep 07 '24 01:09 djouallah

Thanks for raising this up! This has been something we have been thinking about adding as well.

kevinzwang avatar Sep 10 '24 00:09 kevinzwang

@colin-ho could you pick this issue up?

jaychia avatar Sep 25 '24 19:09 jaychia

As an added bonus: if we could figure out that the dataframe is partitioned by filename (if no file splitting was performed) that could be really cool.

This could enable easy and cheap data manipulation such as: df.read_parquet("...", filename=True).groupby("filename").count()

jaychia avatar Sep 26 '24 09:09 jaychia

Example use-case for counting number of distinct rows, grouped by filename: image

jaychia avatar Sep 26 '24 09:09 jaychia

any update on this ?

djouallah avatar Oct 11 '24 12:10 djouallah

Hi @djouallah , sorry for the delay, I'm currently finalizing the PR for this, will let you know once it is ready

colin-ho avatar Oct 11 '24 18:10 colin-ho

Hey @djouallah, this feature should be ready in the next release!

colin-ho avatar Oct 15 '24 17:10 colin-ho

This feature is ready in v0.3.9, closing the issue.

colin-ho avatar Oct 24 '24 16:10 colin-ho