polars icon indicating copy to clipboard operation
polars copied to clipboard

Add option to include source filename and filepath in dataframe

Open D1xieFlatline opened this issue 1 year ago • 12 comments

Problem description

When reading data from a large number of files, it can be helpful to keep track of the source file for a few reasons:

  • Identifying the source of data issues
  • Ability to reload a specific file rather than refreshing the entire data set
  • Giving users visibility into where data came from

You can always capture the file name/path in a variable and add that to the df after a file is loaded, but this creates extra steps and doesn't seem to work well with globs.

Adding options to include the name/path of source files when a file is read would be a nice quality of life feature.

I propose adding twonew parameters to each file input function:

  • include_source_path: If True, includes the file path as an additional column. Default: False
  • include_source_name: If True, includes the file name as an additional column. Default: False

D1xieFlatline avatar Aug 14 '23 16:08 D1xieFlatline

Hey - I think https://github.com/pola-rs/polars/issues/5117#issue-1398256074 would allow you to do this

MarcoGorelli avatar Aug 14 '23 17:08 MarcoGorelli

Hi Marco, thanks for sharing that enhancement. I think that looks useful, but after reading the linked examples I see two key differences from this suggestion:

  • Users need to tag metadata rather than providing an option to capture it automatically when a dataframe is created
  • It doesn't appear that metadata is automatically passed on when a dataframe is written to a file

I usually add some variation of these three lines every time I read a file into a dataframe, and I think I'd still need to do some version of that if #5117 was implemented.

pl.lit(source_path).alias('SOURCE_FILEPATH'),
pl.lit(source_file).alias('SOURCE_FILENAME'),
pl.lit(source_sheet).alias('SOURCE_SHEETNAME'),

Maybe a good compromise would be to capture any parameter that's used to read a file as metadata by default?

D1xieFlatline avatar Aug 14 '23 19:08 D1xieFlatline

Thanks for your suggestion!

Writing

pl.lit(source_path).alias('SOURCE_FILEPATH'),
pl.lit(source_file).alias('SOURCE_FILENAME'),

looks fine, I don't really see the advantage compared with

include_source_path=True,
include_source_name=True,

Closing then

MarcoGorelli avatar Aug 14 '23 20:08 MarcoGorelli

I'm sorry but how does the above actually address the concerns? If you are doing it file by file then it is fine, but using bulk read commands do not expose that option.

MSKDom avatar Mar 03 '24 22:03 MSKDom

Thanks for the ping - reopening for now, will take another look in the week

MarcoGorelli avatar Mar 03 '24 22:03 MarcoGorelli

Just to clarify and add some context.

I've used CSV bulk read on path e.g. foo/*/*.csv. Part of the filename is a timestamp of when it was processed/uploaded, so something like this would solve the issue as I could simply apply a transform on that column to extract it.

Instead I had to traverse the file tree, use existing standard library globing to find matches and read file by file. Not a massive workaround, but does hinder the ability to use built in methods as well as making it a bit slower.

MSKDom avatar Mar 04 '24 11:03 MSKDom

I think another valid use case is when using a remote glob? e.g. scan_csv("http://.../foo/*.csv")

In which case the workaround approach is not applicable.

For reference, DuckDB has filename=true for its readers

  • https://duckdb.org/docs/data/multiple_files/overview#filename

cmdlineluser avatar Mar 04 '24 13:03 cmdlineluser

per discussion: accepted (pending some discussion on the dtype of the filename column)

MarcoGorelli avatar Mar 08 '24 13:03 MarcoGorelli

I wonder if this would intrinsically fix https://github.com/pola-rs/polars/issues/14936 or if that issue would still persist.

deanm0000 avatar Apr 02 '24 14:04 deanm0000

It would persist @deanm0000. This only appends metadata to files so I would guess it's a different area involved

MSKDom avatar Apr 03 '24 13:04 MSKDom

I have the same problem as described by @MSKDom, i.e. having to loop over files to import instead of bulk loading because I need some filename info inside my created DataFrame. A simple but somewhat inelegant workaround is to use Duckdb instead of polars for the bulk loading of flat files.

klwlevy avatar Apr 15 '24 20:04 klwlevy

+1. Having a functionality similar to DuckDB's filename flag would be great!

pietrolesci avatar Apr 22 '24 10:04 pietrolesci