`select` and order of columns
Today I had to figure out how to generate a CSV with Explorer.DataFrame, in a way that ensures the CSV columns will be in a specific order.
Here are my findings, which lead me to select:
- If you feed a list of map to the
DataFrame, the fields appear to be defined in alphabetical order (at least whenIO.inspectis called on it):
#Explorer.DataFrame<
Polars[1 x 40]
accessibilite_pmr string ["Accessible mais non réservé PMR"]
adresse_station string ["26 rue des écluses, 17430 Champdolent"]
cable_t2_attache boolean [false]
code_insee_commune string ["17085"]
condition_acces string ["Accès libre"]
contact_amenageur string ["[email protected]"]
contact_operateur string ["[email protected]"]
coordonneesXY string ["[-0.799141,45.91914]"]
date_maj string ["2024-10-17"]
date_mise_en_service string ["2024-10-02"]
- Calling
selectwith a (ordered) list of fields appears to implicitly order them as provided:
columns = ["nom_amenageur", ...]
valid_record = DB.Factory.IRVE.generate_row()
Explorer.DataFrame.new([valid_record])
|> Explorer.DataFrame.select(columns)
#Explorer.DataFrame<
Polars[1 x 40]
nom_amenageur string ["Métropole de Nulle Part"]
siren_amenageur string ["123456782"]
contact_amenageur string ["[email protected]"]
nom_operateur string ["Opérateur de Charge"]
- This sorting appears to be an expectation of some users of Polars:
- https://github.com/pola-rs/polars/issues/24636
- https://stackoverflow.com/questions/71353113
- The documentation (https://hexdocs.pm/explorer/Explorer.DataFrame.html#select/2) does not mention anything specific
(essentially, maybe the behavior is unspecified, or covered by tests only, or just by Polars itself)
So this leads me to wonder if I can safely (future-proof) rely on select to order a generated CSV, or not.
Does anyone has certainties around that topic?
Thank you!
Hi @thbar!
The theme here is that if the input Enumerable has a well defined order, then so will the output.
Concerning the order of columns:
- There is no expected order from
new([map]) - There is an expected order from
new([keyword_list]) - There is an expected order from
select(df, list_of_column_names)
Some examples:
iex> Mix.install [:explorer]
# Swapped order here is fine
iex> df = Explorer.DataFrame.new([%{b: 2, a: 1}, %{a: 3, b: 4}])
#Explorer.DataFrame<
Polars[2 x 2]
a s64 [1, 3]
b s64 [2, 4]
>
# Swapped order here causes an issue
iex> df = Explorer.DataFrame.new([[b: 2, a: 1], [a: 3, b: 4]])
** (RuntimeError) key-value records must have columns in the same order, expected :b, but got :a
(table 0.1.2) lib/table/reader/enumerable.ex:100: Table.Reader.Enumerable.keyval_values/2
(table 0.1.2) lib/table/mapper.ex:40: anonymous fn/4 in Enumerable.Table.Mapper.reduce/3
(elixir 1.18.3) lib/enum.ex:4968: Enumerable.List.reduce/3
(elixir 1.18.3) lib/enum.ex:4515: Enum.reduce/3
(table 0.1.2) lib/table.ex:171: Table.read_columns/2
(explorer 0.11.1) lib/explorer/polars_backend/data_frame.ex:560: Explorer.PolarsBackend.DataFrame.from_tabular/2
iex:3: (file)
# The order here is respected
iex> df |> Explorer.DataFrame.select([:b, :a])
#Explorer.DataFrame<
Polars[2 x 2]
b s64 [2, 5]
a s64 [1, 4]
>
The documentation (https://hexdocs.pm/explorer/Explorer.DataFrame.html#select/2) does not mention anything specific
I agree the documentation can be more clear on this. Part of the problem is that select and friends accept a number of different types:
- A single column name
- A list of column names
- A single integer (the column index)
- A list of integers
- A range of integers
- A callback function of type
name() -> boolean() - A callback function of type
(name(), dtype()) -> boolean() - A regex
The reason this is so expansive is because Polars provides a fairly expansive "selector" API, of which we implement a good chunk. But I think we should do better about documenting that.
In short: select(df, list_of_names) is guaranteed to respect the column order, so you can rely on that.
In short: select(df, list_of_names) is guaranteed to respect the column order, so you can rely on that.
Many thanks @billylanchantin for the confirmation!