explorer
explorer copied to clipboard
Would it make sense to introduce a DataFrame.to_tensor?
We have a Series.to_tensor
method, however, for many ML tasks, we will want more than one column.
Would it make sense to add one for DataFrames?
Right now I am doing the following as a quick hack:
def df_to_tensor(df) do
df
|> DataFrame.names()
|> Enum.map(fn name ->
df[name]
|> Series.to_tensor()
|> Nx.reshape({:auto, 1})
end)
|> Nx.concatenate(axis: 1)
end
x =
DataFrame.select(df, ["Life", "Country"], :drop)
|> df_to_tensor()
A sufficient implementation would at minimum include the following:
- Control over which columns are selected and in what order
- Control over the final data type (not sure if this would require a single
type
option to be passed to the inner Nx.tensor)
Beyond that, I can't think of any other changes that could not be trivially performed in Explorer before conversion or Nx after.
Why not do this?
- Tensor options become non-obvious. Right now, it is not clear how to handle names. What happens if more options are added to
Nx.tensor
?
Proposal
If this is a good idea:
@doc """
Converts a dataframe to a `t:Nx.Tensor.t/0`.
Can also convert a subset of columns by name
## Supported dtypes
* `:float`
* `:integer`
## Examples
iex> df = Explorer.DataFrame.from_map(%{floats: [1.0, 2.0], ints: [1, 2]})
#Explorer.DataFrame<
[rows: 2, columns: 2]
floats float [1.0, 2.0]
ints integer [1, 2]
>
iex> Explorer.DataFrame.to_tensor(df)
#Nx.Tensor<
f32[2][2]
[
[1.0, 1.0],
[2.0, 2.0]
]
>
"""
def to_tensor(%DataFrame{} = df, column_names) do
column_names
|> Enum.map(fn name ->
df[name]
|> Series.to_tensor()
|> Nx.reshape({:auto, 1})
end)
|> Nx.concatenate(axis: 1)
end
To me, a dataframe is a map of 1-dimensional tensors with the column names as keys.
In fact, Nx
defines a Nx.Container
protocol. My hope is that you can fully pass a dataframe into a defn
function, and if you implement the protocol, all "columns" would automatically convert to tensors.
So for example, your floats-ints dataframe, if we implement the Nx.Container
protocol, you could do this:
defn add_floats_and_ints(df) do
df["floats"] + df["ints"]
end
And it would return a tensor with the results of adding those.
Therefore, if we were to add some functionality, it would be like Df.to_tensors_map(df)
.
This is interesting, I had only thought of DataFrame as tables. It definitely makes sense to return a map since it gives flexibility on the datatype of the individual 1-d tensors.
I will have a look at the Nx.Container and create a PR for discussion tomorrow.
Todo:
- [ ] Df.to_tensors_map(df)
- [ ] Nx integration through Nx.Container
I am not sure if we should do the to_tensors_map
version. Today you can easily get a column as a series right? This means you can get a tensor for it. :) but exploration into containers would be cool!
Hmmm... giving this some thought. @josevalim I totally agree. However, re: your point about a map of 1d tensors... while this is usually how I think about it too, sometimes a dataframe has portions which should be treated as two-dimensional tensors. For example, you might have generated dummy variables and it totally makes sense to treat that as a matrix. You may want to pass a dataframe of features to an ML algorithm -- even just linear regression would treat a dataframe as a 2d matrix.
Agreed that this should be achieved with the Nx.Container
protocol, just calling it out for consideration and some colour. For example, in R you pass a dataframe to lm
and identify the formula with regressand ~ regressor_1 + regressor_2
etc. So you might have a dataframe df
with variables height
, age
and sex
and you might write model = lm(height ~ age + sex, data = df)
. In this case, R treats age
and sex
together as a 2d matrix. In Python you can do very similar with sklearn
and pandas
.
At the simplest level, this definitely gets used pretty frequently in pandas.
R is a little more similar to what I think we'd do, which is very similar to your suggestion above @NduatiK. If you have a purely numerical dataframe, you can just call as.matrix
on it.
The sklearn
and pandas
example is indeed what I was thinking of.
I had been worried about the datatypes on the final tensor but it is already possible to isolate the numeric columns with a select
and then convert the result to a preferred tensor type using Nx.as_type
.
I will look into Nx.Container
So what do you think about DF.to_tensor(df, ["height", "width"])
? This ends up a 2d tensor.
Not sure if you are are asking me or @cigrainger, but that sounds looks good to me. I am currently doing the same thing but just using a DF.select
first.
Would this allow column names to be shared or is it a normal tensor?
I was asking both/everyone. :)
Would this allow column names to be shared or is it a normal tensor?
What do you mean?
One thing is that tensor names are atoms and columns are strings. But perhaps we could introduce a convention: if the column name is a string, then it is not named in the tensor. If an atom, it preserves its name in the tensor?
What do you mean?
I meant: Given dataframes do not have an obvious sorting of columns, would the returned tensor contain information about which series on the dataframe a column was created from. A DF with age and salary would lose data semantics as soon as it turns into a tensor. One would have to get the DataFrame.names
and keep that around.
One thing is that tensor names are atoms and columns are strings
Whoops, just reread the docs. Turns out tensor names actually identify axes not columns, DataFrame
column names do not map onto names
on Nx.Tensor
. If we were to do this, it would be using something else.
I am not sure it would make sense to add a field to the Nx.Tensor
type since the column names would have to be tracked and could easily destroyed (eg when you run a dot product). Perhaps naming is something that should be tracked by ML algorithms that consume DataFrame
s.
Of course, like you said, this issue goes away if we can just treat DataFrame
s as tensors.
Whoops, just reread the docs. Turns out tensor names actually identify axes not columns, DataFrame column names do not map onto names on Nx.Tensor. If we were to do this, it would be using something else.
Ah, good call. The information is then lost, I am afraid.
Of course, like you said, this issue goes away if we can just treat DataFrames as tensors.
Right, my understand right now is that those are two separate problems. Nx.Container is about accessing each column individually. What you told me is that sometimes you may also want to get certain (or all?) columns from DF as a matrix, that would be a different operation (either to_matrix
or to_tensor
).
Yes, my original issue was about a convenience function on the DataFrame itself.
By default it would convert all columns, but a list of names can be passed in. When the list of names is provided, the tensor is ordered by that list.
When the names are not provided, we might need to return the list of columns. names
depends on the backend and I am not sure if all will guarantee ordering.
Hi @josevalim, what has prevented the implementation of an Nx.Container implementation so far? I had assumed that the solution would be as simple as adapting the Map
implementation.
Are there any special cases to be aware of given the different backends and the possible need to transfer values to the device?
I think there are no blockers, it is just that nobody implemented the solution so far. I think the implementation is similar to maps, yeah. I would even make it so the Dataframe becomes a map.
Got it, thanks
@josevalim should this be closed now that we have TensorFrame
?
Thanks @josevalim and @cigrainger, I think your zero-copy work is much better than what I would have been able to do. ❤️💚💙