explorer icon indicating copy to clipboard operation
explorer copied to clipboard

Would it make sense to introduce a DataFrame.to_tensor?

Open NduatiK opened this issue 3 years ago • 15 comments

We have a Series.to_tensor method, however, for many ML tasks, we will want more than one column.

Would it make sense to add one for DataFrames?

Right now I am doing the following as a quick hack:

def df_to_tensor(df) do
    df
    |> DataFrame.names()
    |> Enum.map(fn name ->
      df[name]
      |> Series.to_tensor()
      |> Nx.reshape({:auto, 1})
    end)
    |> Nx.concatenate(axis: 1)
  end

x =
  DataFrame.select(df, ["Life", "Country"], :drop)
  |> df_to_tensor()

A sufficient implementation would at minimum include the following:

  1. Control over which columns are selected and in what order
  2. Control over the final data type (not sure if this would require a single type option to be passed to the inner Nx.tensor)

Beyond that, I can't think of any other changes that could not be trivially performed in Explorer before conversion or Nx after.


Why not do this?

  • Tensor options become non-obvious. Right now, it is not clear how to handle names. What happens if more options are added to Nx.tensor?

Proposal

If this is a good idea:

  @doc """
  Converts a dataframe to a `t:Nx.Tensor.t/0`.

  Can also convert a subset of columns by name

  ## Supported dtypes

    * `:float`
    * `:integer`

  ## Examples

      iex> df = Explorer.DataFrame.from_map(%{floats: [1.0, 2.0], ints: [1, 2]})
      #Explorer.DataFrame<
        [rows: 2, columns: 2]
        floats float [1.0, 2.0]
        ints integer [1, 2]
      >
      iex> Explorer.DataFrame.to_tensor(df)
      #Nx.Tensor<
        f32[2][2]
        [
          [1.0, 1.0],
          [2.0, 2.0]
        ]
      >
  """
  def to_tensor(%DataFrame{} = df, column_names) do
    column_names
    |> Enum.map(fn name ->
      df[name]
      |> Series.to_tensor()
      |> Nx.reshape({:auto, 1})
    end)
    |> Nx.concatenate(axis: 1)
  end

NduatiK avatar Jan 27 '22 20:01 NduatiK

To me, a dataframe is a map of 1-dimensional tensors with the column names as keys.

In fact, Nx defines a Nx.Container protocol. My hope is that you can fully pass a dataframe into a defn function, and if you implement the protocol, all "columns" would automatically convert to tensors.

So for example, your floats-ints dataframe, if we implement the Nx.Container protocol, you could do this:

defn add_floats_and_ints(df) do
  df["floats"] + df["ints"]
end

And it would return a tensor with the results of adding those.

Therefore, if we were to add some functionality, it would be like Df.to_tensors_map(df).

josevalim avatar Jan 27 '22 20:01 josevalim

This is interesting, I had only thought of DataFrame as tables. It definitely makes sense to return a map since it gives flexibility on the datatype of the individual 1-d tensors.

I will have a look at the Nx.Container and create a PR for discussion tomorrow.

Todo:

  • [ ] Df.to_tensors_map(df)
  • [ ] Nx integration through Nx.Container

NduatiK avatar Jan 27 '22 21:01 NduatiK

I am not sure if we should do the to_tensors_map version. Today you can easily get a column as a series right? This means you can get a tensor for it. :) but exploration into containers would be cool!

josevalim avatar Jan 27 '22 21:01 josevalim

Hmmm... giving this some thought. @josevalim I totally agree. However, re: your point about a map of 1d tensors... while this is usually how I think about it too, sometimes a dataframe has portions which should be treated as two-dimensional tensors. For example, you might have generated dummy variables and it totally makes sense to treat that as a matrix. You may want to pass a dataframe of features to an ML algorithm -- even just linear regression would treat a dataframe as a 2d matrix.

Agreed that this should be achieved with the Nx.Container protocol, just calling it out for consideration and some colour. For example, in R you pass a dataframe to lm and identify the formula with regressand ~ regressor_1 + regressor_2 etc. So you might have a dataframe df with variables height, age and sex and you might write model = lm(height ~ age + sex, data = df). In this case, R treats age and sex together as a 2d matrix. In Python you can do very similar with sklearn and pandas.

At the simplest level, this definitely gets used pretty frequently in pandas.

R is a little more similar to what I think we'd do, which is very similar to your suggestion above @NduatiK. If you have a purely numerical dataframe, you can just call as.matrix on it.

cigrainger avatar Jan 27 '22 23:01 cigrainger

The sklearn and pandas example is indeed what I was thinking of.

I had been worried about the datatypes on the final tensor but it is already possible to isolate the numeric columns with a select and then convert the result to a preferred tensor type using Nx.as_type.

I will look into Nx.Container

NduatiK avatar Jan 28 '22 05:01 NduatiK

So what do you think about DF.to_tensor(df, ["height", "width"])? This ends up a 2d tensor.

josevalim avatar Jan 28 '22 07:01 josevalim

Not sure if you are are asking me or @cigrainger, but that sounds looks good to me. I am currently doing the same thing but just using a DF.select first.

Would this allow column names to be shared or is it a normal tensor?

NduatiK avatar Jan 28 '22 09:01 NduatiK

I was asking both/everyone. :)

Would this allow column names to be shared or is it a normal tensor?

What do you mean?

josevalim avatar Jan 28 '22 09:01 josevalim

One thing is that tensor names are atoms and columns are strings. But perhaps we could introduce a convention: if the column name is a string, then it is not named in the tensor. If an atom, it preserves its name in the tensor?

josevalim avatar Jan 28 '22 10:01 josevalim

What do you mean?

I meant: Given dataframes do not have an obvious sorting of columns, would the returned tensor contain information about which series on the dataframe a column was created from. A DF with age and salary would lose data semantics as soon as it turns into a tensor. One would have to get the DataFrame.names and keep that around.

One thing is that tensor names are atoms and columns are strings

Whoops, just reread the docs. Turns out tensor names actually identify axes not columns, DataFrame column names do not map onto names on Nx.Tensor. If we were to do this, it would be using something else.

I am not sure it would make sense to add a field to the Nx.Tensor type since the column names would have to be tracked and could easily destroyed (eg when you run a dot product). Perhaps naming is something that should be tracked by ML algorithms that consume DataFrames.

Of course, like you said, this issue goes away if we can just treat DataFrames as tensors.

NduatiK avatar Jan 28 '22 11:01 NduatiK

Whoops, just reread the docs. Turns out tensor names actually identify axes not columns, DataFrame column names do not map onto names on Nx.Tensor. If we were to do this, it would be using something else.

Ah, good call. The information is then lost, I am afraid.

Of course, like you said, this issue goes away if we can just treat DataFrames as tensors.

Right, my understand right now is that those are two separate problems. Nx.Container is about accessing each column individually. What you told me is that sometimes you may also want to get certain (or all?) columns from DF as a matrix, that would be a different operation (either to_matrix or to_tensor).

josevalim avatar Jan 28 '22 11:01 josevalim

Yes, my original issue was about a convenience function on the DataFrame itself.

By default it would convert all columns, but a list of names can be passed in. When the list of names is provided, the tensor is ordered by that list.

When the names are not provided, we might need to return the list of columns. names depends on the backend and I am not sure if all will guarantee ordering.

NduatiK avatar Jan 28 '22 13:01 NduatiK

Hi @josevalim, what has prevented the implementation of an Nx.Container implementation so far? I had assumed that the solution would be as simple as adapting the Map implementation.

Are there any special cases to be aware of given the different backends and the possible need to transfer values to the device?

NduatiK avatar Jan 31 '22 10:01 NduatiK

I think there are no blockers, it is just that nobody implemented the solution so far. I think the implementation is similar to maps, yeah. I would even make it so the Dataframe becomes a map.

josevalim avatar Jan 31 '22 10:01 josevalim

Got it, thanks

NduatiK avatar Jan 31 '22 10:01 NduatiK

@josevalim should this be closed now that we have TensorFrame?

cigrainger avatar Dec 11 '22 22:12 cigrainger

Thanks @josevalim and @cigrainger, I think your zero-copy work is much better than what I would have been able to do. ❤️💚💙

NduatiK avatar Dec 12 '22 11:12 NduatiK