polars
polars copied to clipboard
Add option to include source filename and filepath in dataframe
Problem description
When reading data from a large number of files, it can be helpful to keep track of the source file for a few reasons:
- Identifying the source of data issues
- Ability to reload a specific file rather than refreshing the entire data set
- Giving users visibility into where data came from
You can always capture the file name/path in a variable and add that to the df after a file is loaded, but this creates extra steps and doesn't seem to work well with globs.
Adding options to include the name/path of source files when a file is read would be a nice quality of life feature.
I propose adding twonew parameters to each file input function:
- include_source_path: If True, includes the file path as an additional column. Default: False
- include_source_name: If True, includes the file name as an additional column. Default: False
Hey - I think https://github.com/pola-rs/polars/issues/5117#issue-1398256074 would allow you to do this
Hi Marco, thanks for sharing that enhancement. I think that looks useful, but after reading the linked examples I see two key differences from this suggestion:
- Users need to tag metadata rather than providing an option to capture it automatically when a dataframe is created
- It doesn't appear that metadata is automatically passed on when a dataframe is written to a file
I usually add some variation of these three lines every time I read a file into a dataframe, and I think I'd still need to do some version of that if #5117 was implemented.
pl.lit(source_path).alias('SOURCE_FILEPATH'),
pl.lit(source_file).alias('SOURCE_FILENAME'),
pl.lit(source_sheet).alias('SOURCE_SHEETNAME'),
Maybe a good compromise would be to capture any parameter that's used to read a file as metadata by default?
Thanks for your suggestion!
Writing
pl.lit(source_path).alias('SOURCE_FILEPATH'),
pl.lit(source_file).alias('SOURCE_FILENAME'),
looks fine, I don't really see the advantage compared with
include_source_path=True,
include_source_name=True,
Closing then
I'm sorry but how does the above actually address the concerns? If you are doing it file by file then it is fine, but using bulk read commands do not expose that option.
Thanks for the ping - reopening for now, will take another look in the week
Just to clarify and add some context.
I've used CSV bulk read on path e.g. foo/*/*.csv
. Part of the filename is a timestamp of when it was processed/uploaded, so something like this would solve the issue as I could simply apply a transform on that column to extract it.
Instead I had to traverse the file tree, use existing standard library globing to find matches and read file by file. Not a massive workaround, but does hinder the ability to use built in methods as well as making it a bit slower.
I think another valid use case is when using a remote glob? e.g. scan_csv("http://.../foo/*.csv")
In which case the workaround approach is not applicable.
For reference, DuckDB has filename=true
for its readers
- https://duckdb.org/docs/data/multiple_files/overview#filename
per discussion: accepted (pending some discussion on the dtype of the filename
column)
I wonder if this would intrinsically fix https://github.com/pola-rs/polars/issues/14936 or if that issue would still persist.
It would persist @deanm0000. This only appends metadata to files so I would guess it's a different area involved
I have the same problem as described by @MSKDom, i.e. having to loop over files to import instead of bulk loading because I need some filename info inside my created DataFrame. A simple but somewhat inelegant workaround is to use Duckdb instead of polars for the bulk loading of flat files.
+1. Having a functionality similar to DuckDB's filename
flag would be great!