Daft add support for adding file path as a new column when read from csv/json etc

a common pattern when reading from csv. json etc, is to add a column in the destination table with the files processed already, so the next time you add new csv files, you will not endup with duplicate values, duckdb/polars for example support this function using filename = true

Sep 07 '24 01:09 djouallah

Thanks for raising this up! This has been something we have been thinking about adding as well.

Sep 10 '24 00:09 kevinzwang

@colin-ho could you pick this issue up?

Sep 25 '24 19:09 jaychia

As an added bonus: if we could figure out that the dataframe is partitioned by filename (if no file splitting was performed) that could be really cool.

This could enable easy and cheap data manipulation such as: df.read_parquet("...", filename=True).groupby("filename").count()

Sep 26 '24 09:09 jaychia

Example use-case for counting number of distinct rows, grouped by filename:

Sep 26 '24 09:09 jaychia

any update on this ?

Oct 11 '24 12:10 djouallah

Hi @djouallah , sorry for the delay, I'm currently finalizing the PR for this, will let you know once it is ready

Oct 11 '24 18:10 colin-ho

Hey @djouallah, this feature should be ready in the next release!

Oct 15 '24 17:10 colin-ho

This feature is ready in v0.3.9, closing the issue.

Oct 24 '24 16:10 colin-ho