polars
polars copied to clipboard
Add add_filename to pl.read_csv (and read operations others)
Description
It is ocasionaly true that the filename of a data file is fairly critical information
Illustratively
Users/
Alice.csv
Bob.csv
Charlie.csv
When using glob patterns to read this data, the file name itself is lost - which all but forces the user to loop over the files and read them manually.
# This does not preserve what row is for what user
df = pl.read_csv('Users/*.csv')
# This is a bit long
df = (pl.concat([
pl.read_csv(file).with_columns(filename=pl.lit(file)
for file in glob('Users/*.csv')
])
)
A parameter to add a column with the specific file name when reading data via a glob pattern would be a nice to have.
include_file_paths was added for most of the formats: https://github.com/pola-rs/polars/pull/17563
>>> pl.scan_csv("*.csv", include_file_paths="filename").collect()
shape: (2, 4)
┌─────┬─────┬─────┬──────────┐
│ a ┆ b ┆ c ┆ filename │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╪══════════╡
│ 1 ┆ 2 ┆ 3 ┆ a.csv │
│ 4 ┆ 5 ┆ 6 ┆ b.csv │
└─────┴─────┴─────┴──────────┘
Seems it just needs to be exposed via read_csv
Thank you, I was on an old version of Polars and had not noticed. Adding it to the eager methods would be nice.
Yes, let's expose this to the eager methods a well.
This would greatly benefit from using a categorical for the include_file_paths columns, no? Presumably the number of records is typically much greater than the number of files.
Trying to tackle this one
@ritchie46 is this still open or a stale issue. I would love to work on this