explorer icon indicating copy to clipboard operation
explorer copied to clipboard

`select` and order of columns

Open thbar opened this issue 2 months ago • 2 comments

Today I had to figure out how to generate a CSV with Explorer.DataFrame, in a way that ensures the CSV columns will be in a specific order.

Here are my findings, which lead me to select:

  1. If you feed a list of map to the DataFrame, the fields appear to be defined in alphabetical order (at least when IO.inspect is called on it):
#Explorer.DataFrame<
  Polars[1 x 40]
  accessibilite_pmr string ["Accessible mais non réservé PMR"]
  adresse_station string ["26 rue des écluses, 17430 Champdolent"]
  cable_t2_attache boolean [false]
  code_insee_commune string ["17085"]
  condition_acces string ["Accès libre"]
  contact_amenageur string ["[email protected]"]
  contact_operateur string ["[email protected]"]
  coordonneesXY string ["[-0.799141,45.91914]"]
  date_maj string ["2024-10-17"]
  date_mise_en_service string ["2024-10-02"]
  1. Calling select with a (ordered) list of fields appears to implicitly order them as provided:
columns = ["nom_amenageur", ...]
valid_record = DB.Factory.IRVE.generate_row()
Explorer.DataFrame.new([valid_record])
|> Explorer.DataFrame.select(columns)

#Explorer.DataFrame<
  Polars[1 x 40]
  nom_amenageur string ["Métropole de Nulle Part"]
  siren_amenageur string ["123456782"]
  contact_amenageur string ["[email protected]"]
  nom_operateur string ["Opérateur de Charge"]
  1. This sorting appears to be an expectation of some users of Polars:
  • https://github.com/pola-rs/polars/issues/24636
  • https://stackoverflow.com/questions/71353113
  1. The documentation (https://hexdocs.pm/explorer/Explorer.DataFrame.html#select/2) does not mention anything specific

(essentially, maybe the behavior is unspecified, or covered by tests only, or just by Polars itself)

So this leads me to wonder if I can safely (future-proof) rely on select to order a generated CSV, or not.

Does anyone has certainties around that topic?

Thank you!

thbar avatar Oct 15 '25 15:10 thbar

Hi @thbar!

The theme here is that if the input Enumerable has a well defined order, then so will the output.

Concerning the order of columns:

  1. There is no expected order from new([map])
  2. There is an expected order from new([keyword_list])
  3. There is an expected order from select(df, list_of_column_names)

Some examples:

iex> Mix.install [:explorer]

# Swapped order here is fine
iex> df = Explorer.DataFrame.new([%{b: 2, a: 1}, %{a: 3, b: 4}])
#Explorer.DataFrame<
  Polars[2 x 2]
  a s64 [1, 3]
  b s64 [2, 4]
>

# Swapped order here causes an issue
iex> df = Explorer.DataFrame.new([[b: 2, a: 1], [a: 3, b: 4]])
** (RuntimeError) key-value records must have columns in the same order, expected :b, but got :a
    (table 0.1.2) lib/table/reader/enumerable.ex:100: Table.Reader.Enumerable.keyval_values/2
    (table 0.1.2) lib/table/mapper.ex:40: anonymous fn/4 in Enumerable.Table.Mapper.reduce/3
    (elixir 1.18.3) lib/enum.ex:4968: Enumerable.List.reduce/3
    (elixir 1.18.3) lib/enum.ex:4515: Enum.reduce/3
    (table 0.1.2) lib/table.ex:171: Table.read_columns/2
    (explorer 0.11.1) lib/explorer/polars_backend/data_frame.ex:560: Explorer.PolarsBackend.DataFrame.from_tabular/2
    iex:3: (file)

# The order here is respected
iex> df |> Explorer.DataFrame.select([:b, :a])
#Explorer.DataFrame<
  Polars[2 x 2]
  b s64 [2, 5]
  a s64 [1, 4]
>

The documentation (https://hexdocs.pm/explorer/Explorer.DataFrame.html#select/2) does not mention anything specific

I agree the documentation can be more clear on this. Part of the problem is that select and friends accept a number of different types:

  • A single column name
  • A list of column names
  • A single integer (the column index)
  • A list of integers
  • A range of integers
  • A callback function of type name() -> boolean()
  • A callback function of type (name(), dtype()) -> boolean()
  • A regex

The reason this is so expansive is because Polars provides a fairly expansive "selector" API, of which we implement a good chunk. But I think we should do better about documenting that.

In short: select(df, list_of_names) is guaranteed to respect the column order, so you can rely on that.

billylanchantin avatar Oct 17 '25 13:10 billylanchantin

In short: select(df, list_of_names) is guaranteed to respect the column order, so you can rely on that.

Many thanks @billylanchantin for the confirmation!

thbar avatar Nov 04 '25 13:11 thbar