polars Add add_filename to pl.read_csv (and read operations others)

Add add_filename to pl.read_csv (and read operations others)

Open mkleinbort-wl opened this issue 1 year ago • 5 comments

Description

It is ocasionaly true that the filename of a data file is fairly critical information

Illustratively

Users/
   Alice.csv
   Bob.csv
   Charlie.csv

When using glob patterns to read this data, the file name itself is lost - which all but forces the user to loop over the files and read them manually.


# This does not preserve what row is for what user
df = pl.read_csv('Users/*.csv') 

 # This is a bit long
df = (pl.concat([
                pl.read_csv(file).with_columns(filename=pl.lit(file) 
                for file in glob('Users/*.csv')
            ])
        )

A parameter to add a column with the specific file name when reading data via a glob pattern would be a nice to have.

Oct 16 '24 20:10 mkleinbort-wl

include_file_paths was added for most of the formats: https://github.com/pola-rs/polars/pull/17563

>>> pl.scan_csv("*.csv", include_file_paths="filename").collect()
shape: (2, 4)
┌─────┬─────┬─────┬──────────┐
│ a   ┆ b   ┆ c   ┆ filename │
│ --- ┆ --- ┆ --- ┆ ---      │
│ i64 ┆ i64 ┆ i64 ┆ str      │
╞═════╪═════╪═════╪══════════╡
│ 1   ┆ 2   ┆ 3   ┆ a.csv    │
│ 4   ┆ 5   ┆ 6   ┆ b.csv    │
└─────┴─────┴─────┴──────────┘

Seems it just needs to be exposed via read_csv

Oct 16 '24 20:10 cmdlineluser

Thank you, I was on an old version of Polars and had not noticed. Adding it to the eager methods would be nice.

Oct 16 '24 20:10 mkleinbort-wl

Yes, let's expose this to the eager methods a well.

Oct 17 '24 06:10 ritchie46

This would greatly benefit from using a categorical for the include_file_paths columns, no? Presumably the number of records is typically much greater than the number of files.

Oct 17 '24 18:10 mcrumiller

Trying to tackle this one

Oct 17 '24 19:10 alonme

@ritchie46 is this still open or a stale issue. I would love to work on this

Sep 01 '25 18:09 ghost

polars polars copied to clipboard

Add add_filename to pl.read_csv (and read operations others)

Description

polars
polars copied to clipboard