polars icon indicating copy to clipboard operation
polars copied to clipboard

Combine `engine_options` and `read_options` into a single parameter in `read_excel`

Open mcrumiller opened this issue 1 year ago • 5 comments

Description

Somewhat related to #17263.

read_excel has two parameters, engine_options and read_options:

  • engine_options - Additional options passed to the underlying engine’s primary parsing constructor (given below), if supported
  • read_options - Options passed to the underlying engine method that reads the sheet data

This is really confusing from a user perspective. What's the difference between engine options that parse the data and engine options that reads the data? IMO, it would be easier to provide a universal set of options that are converted to the engine-specific options behind the scenes. All engines provide the same functionality for reading at this point, so it would be nice to instead have the following parameters (feel free to add):

  • sheet_name - specifies sheet by name
  • sheet_id - specifies sheet by ID, 1-indexed. If 0, return all sheets as a dict.
  • range - e.g. B3:AC52 or something like [(2,5), (83, 15)]
  • ~dtypes - dtypes of columns~ (already available as "schema_overrides")

mcrumiller avatar Jun 28 '24 12:06 mcrumiller

I'm pretty sure @alexander-beedie already has a design for this (see https://github.com/pola-rs/polars/pull/15808#issuecomment-2067944349). Not sure it will make it into 1.0.0 though.

stinodego avatar Jun 28 '24 13:06 stinodego

I do, and it's "mostly" option 3, where the most common options are all exposed at the top-level and we convert them to engine-specific calls ourselves so the user doesn't have to (as we're now doing with schema_overrides, for example).

We will still maintain read_options and engine_options for what should become a vanishingly small number of cases not handled by the top-level parameters.

alexander-beedie avatar Jun 28 '24 13:06 alexander-beedie

Thanks; should I go ahead and close this one?

mcrumiller avatar Jun 28 '24 13:06 mcrumiller

Thanks; should I go ahead and close this one?

It's ok, can leave it here so I have something to focus on and close when it's done (hopefully very shortly) 🤣

alexander-beedie avatar Jun 28 '24 16:06 alexander-beedie

FYI: #17263 now contains a common columns param - more to come...

alexander-beedie avatar Jun 29 '24 21:06 alexander-beedie

...and just added a common has_header param for all engines too👌

alexander-beedie avatar Aug 07 '24 14:08 alexander-beedie

I'd like to comment on the suggestion for range with a +1. Often times spreadsheets have extra information in additional cells, so it's important to build the dataframe from a specified range or something like an excel table (in calamine and usage)

acxz avatar Sep 07 '24 21:09 acxz