DataFrames.jl
DataFrames.jl copied to clipboard
Make row lookup easier
This is a speculative idea. Maybe we could define GroupedDataFrame
to be callable like this:
(gdf::GroupedDataFrame)(idxs...) = gdf[idxs]
In this way instead of writing:
gdf[("val",)]
users could write gdf("val")
.
@nalimilan, @pdeffebach - what do you think?
Ref: https://discourse.julialang.org/t/any-plan-for-functionality-like-pandas-loc/81134
I'm not a fan of this. You only save 3 characters and the syntax is harder to understand.
It wouldn't really improve the workflow discussed in the Discourse thread, would it?
It would not - that is why we originally have not implemented it. It is just non-conflicting and would make life of newcomers slightly easier. I wanted to discuss it to make sure we are OK with the current design.
I was thinking of the case of a unique index. I could convert the DataFrame to a dictionary of DataFrame rows:
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
d = Dict( df.name .=> eachrow(df) )
This then allows d["Sally"].age
But loses all the other DataFrame functionality.
There are so many times I want to do several lookups on different columns but using the same index column.
Example
I have a table of FX Forward rates I'm using to bootstrap discount factors. Columns are :Currency :Tenor :Rate.
The bootstrap is done separately per currency so first I'd like to @groupby :Currency.
Then @setindex :Tenor.
Possible Tenor values are: ON TN SPOT 1W 2W 1M 3M 6M 1Y 2Y
Using IndexValue:Column notation, the formulas are
:Fwd = SPOT:Rate + :Rate
SPOT:Fwd = :SPOT:Rate
TN:Fwd = SPOT:Rate - TN:Rate
ON:Fwd = TN:Rate - ON:Rate
TodayRate = ON:Fwd
:Discount = :Fwd / TodayRate * USDiscount(:Tenor)
Currently I'm calling a function for each group. The function then converts each sub-DataFrame to named arrays. This works fine. it would just be nice to do it all in a DataFrame.
I am moving it to 1.5 release. Essentially the requirement is for a Dict
-like object allowing an easy row lookup.
Essentially it should be a wrapper around GroupedDataFrame
allowing easy indexing on group keys.
So this would be something like (in general probably this could be made more efficient, but this shows the idea - please look at the API not at the implementation):
struct RowLookup where T<:GroupedDataFrame
gdf::T
end
function rowlookup(df::AbstractDataFrame, cols)
gdf = groupby(df, cols)
length(gdf) == nrow(df) || throw(ArgumentError("rows are not unique"))
return RowLookup(gdf)
end
Base.getindex(rl::RowLookup, idxs...) = only(rl.gdf[idxs])
Base.parent(rl) = parent(rl.gdf)
This would allow for easy indexing like (referring to the example above):
d1 = rowlookup(df, :name)
d1["Sally"].age
d2 = rowlookup(df, :name, :age)
d2["Sally", 49].children
Do we think it is worth to add something like this?
If this is for a unique index. Would it be better to return a DataFrameRow rather than a 1-row DataFrame.
So replace [1:1, :]
with [1,:]
Then d1["Sally].age
returns a number instead of a 1-element vector. I think generally the number would be preferred.
It would be nice to have the rowlookup
functionality but keep the object as a DataFrame, so you can add columns and do other stuff to it. With this implementation, you can't for example do d1.ageBy2 = d1.age
because d1 is a GroupedDataFrame. I guess this would make the implementation much more difficult as you couldn't use the pre-existing GroupedDataFrame functionality.
So replace [1:1, :] with [1,:]
Yes - it was a typo - I meant [1, :]
. Otherwise it would be needed. Actually it should be only
. I will fix :).
It would be nice to have the rowlookup functionality but keep the object as a
DataFrame
This cannot be done as indexing DataFrame
has a different semantics.
What we could consider doing is to add syntax like:
df[:name => "Sally", :age => 49].children
with the requirement that the condition must identify a unique row. The only drawback would be that it would be a O(n) operation, but maybe this is not the end of the world as it would be a convenience function?
@nalimilan - what do you think?
It seems problematic to assume that df[:name => "Sally", :age => 49].children
should reference a single row, as nothing in the syntax indicates that. I would rather expect it to be equivalent to subset(df, :name => ==("Sally"), :age => ==(49)).children
.
I feel like we're not targeting the real problem here. Maybe the problem is just that when using @rtransform
, one needs to repeat the data frame name. Using @transform
, this works and is not too ugly IMO:
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
@chain df begin
@transform :Age_relative_to_Sally = :age .- only(:age[:name .== "Sally"])
end
Or even without only
(broadcasting will fail if more than one row is selected):
@chain df begin
@transform :Age_relative_to_Sally = :age .- :age[:name .== "Sally"]
end
Maybe what's missing is a way to do the same using @rtransform
? For example, wrapping :age[:name .== "Sally"]
within a special marker to indicate that it should be evaluated on whole vectors rather than row-by-row? It could also be useful for things like :x - mean(x)
.
CC @pdeffebach
I can't think of an easy way to mark just some columns as "whole column" references and others as not. The current implementation just takes the anonymous function created by the parsing and wraps it in ByRow
.
We could have something that expands to :x - mean(df.x)
but that results in type instabilities and would slow down performance.
Aren't there array types that allow for
@rtransform df :Age_relative_to_Sally = :age .- :age."Sally"
? That seems pretty clean to me.
That's a really nice syntax if its possible.
With NamedArrays.jl you can do
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
@chain df begin
@aside age = NamedArray( _.age, _.name )
@rtransform :Age_relative_to_Sally = :age - age["Sally"]
end
But of course changing the NamedArray won't change the column.
@chain df begin
@aside age = NamedArray( _.age, _.name )
@rtransform :Age_relative_to_Sally = :age - age["Sally"]
end
This still has some performance issues, since age["Sally"]
is called every time. Better to use @transform
instead of @rtransform
and broadcast.
Ok. @transform :Age_relative_to_Sallly = :age .- age["Sally"]
suppose there are a few columns you would like to attach the index to.
@aside age = NamedArray( _.age, _.name )
@aside height = NamedArray( _.height, _.name )
@aside weight = NamedArray(_.weight, _.name )
could this be condensed to something like?
addindex( df, :name, to = [:age, :height, :weight] )
and could the NamedArray functionality be merged with the DataFrame column such that you can alter cells as well as referencing them. e.g.
@transform :age["Sally"] = 30
Something I just realised. DataFrame columns can be defined as NamedArrays, and cell values referenced within a @transform
statement. A cell value can't be altered within a @transform
block but can altered as below.
df = DataFrame( T = ["1M","2M","3M","6M"], x = [1,2,3,4], y = [10,11,12,13] )
@chain df begin
@DataFramesMeta.transform begin
:x = NamedArray(:x,:T)
:y = NamedArray(:y,:T)
end
@DataFramesMeta.transform :W = :x["1M"]
(
_.x["1M"] = 100;
_
)
end
https://discourse.julialang.org/t/why-is-it-so-complicated-to-access-a-row-in-a-dataframe/103162/10
- I find this solution very elegant, although I agree that it is problematic in the case of multiple rows.
What we could consider doing is to add syntax like:
df[:name => "Sally", :age => 49].children
with the requirement that the condition must identify a unique row. The only drawback would be that it would be a O(n) operation, but maybe this is not the end of the world as it would be a convenience function?
As indicated in the post above, "The biggest issue is that the condition you might want to use could return exactly one row, or multiple rows (where 0 rows is a special case of multiple)".
Although I don't think this is really in line with the philosophy of DataFrame.jl or its internal implementation, one feature I really like in Pandas is MultiIndex (https://pandas.pydata.org/docs/user_guide/advanced.html). Even if they're not obvious to use, they make the DataFrame much more readable and could help solve this kind of problem.
I had some bad experiences with pandas.Multiindex,. Would something as flexible as boost.MultiIndex be more desirable?
Having thought about it:
- If one wants to select multiple rows then
GroupedDataFrames
gives all the functionality one might want. - So essentially we are looking for a solution when user expects exactly one row. We just need to settle on the syntax.
The syntax (note a slight change, but this is a minor thing):
df[(:name => "Sally", :age => 49), :]
is tempting, but in the past with @nalimilan we wanted to avoid adding too much functionality to indexing of df
. We would need to think about it. The second downside, as was commented is that the operation would be O(n).
The alternative would be to define something like:
l = lookup(df, :name, :age)
which would be a wrapper around df
(like GroupedDataFrame
), call it RowLookup
tentatively that would assume a unique key situation.
Then one could write l["Sally", 49]
to get the desired row. The additional benefit of this approach is that l
would probably do some pre-processing, so that the lookup operation later would be O(1)
.
If a single lookup were required then one could of course write lookup(df, :name, :age)["Sally", 49]
(but the whole design would be around ensuring fast repeated lookup).
Is the problem here the syntax or the speed? If it's the former, I'm not sure if there's any benefit from notation like this
df[(:name => "Sally", :age => 49), :]
lookup(df, :name, :age)["Sally", 49]
relative to
df[df.name .== "Sally" .&& df.age .== 49,:]
We'd avoid writing df.
repeatedly. But, apart from that, it'd be quite limited since these cases can only handle the case with ".&&". It'd also add the problem that new users could expect the functionality to behave as ".||".
I'm usually against adding new notation, especially if it creates additional ways of doing the same, while having approximately the same readibility/number of characters.
Maybe it's a problem more related to documentation? In indexing syntax, I noticed that the explanations for column selectors and row indexing are intertwined. Moreover, there's only one example with the broadcasting notation, and it's at the end of the section.
If the problem is speed, then I agree that having an optimized function like "lookup" could be beneficial.
My proposal with lookup
wrapper is related to speed and correctness (and indirectly a bit syntax). The point is that df[df.name .== "Sally" .&& df.age .== 49,:]
is slow as was noted. Given this we have two calls for action:
- Improve documentation (this is clearly doable, and I can add more examples)
- Decide if we want the
lookup
variant for speed and correctness. Note the "correctness" aspect. The point is thatlookup
would make sure that the row is unique and error otherwise (so users would avoid a common error of having data with duplicates but thinking that they generate a unique row).
So the question is if we want the addition of lookup
or not.
If it's for speed and correctness, I like the idea. Maybe we could think more about the API? the API and name of lookup
doesn't convince me.
I'm thinking of two possibilities. If it emulates filter (in the sense of a function returning a dataframe), something along the lines:
rowsubset(df, :id => 10, :name => "John")
the name makes it clear that it should return a row and that the API is similar to subset. Regardless of the internals, it makes it clear that it emulates the subset
API.
If the intention is to be used for indexing, maybe the name row
would suffice
df[row(:id => 10, :name => "John"), cols]
BTW, is the use of grouped dfs for selecting multiple rows documented?
is the use of grouped dfs for selecting multiple rows documented?
see https://dataframes.juliadata.org/stable/lib/indexing/#Indexing-GroupedDataFrames
Maybe we could think more about the API?
This is the point we need a good API. But your proposals will not work well. We need a structure of the following type (I make it verbose):
- first a lookup object is created (one operation)
- then lookup is performed (second operation)
We need this separation to ensure performance.
For example in groupby you have:
- create an object
gdf = groupby(df, :a, :b)
- do lookup
gdf[(1, 2)]
(get a group with value1
in column:a
and value2
in column:b
)
We need the same kind of separation because for group-by exactly the fact that we do groupby(df, :a, :b)
ensures that later the lookup is fast.
In database parlance the first operation is making a unique primary key (done once) and the second operation is actual lookup (done many times).
I hope it is clear where the issue is.
Just my two cents as a user, I think that the problem is then more related to the documentation. I'd expect the use of groupby for subsetting a dataframe here.
In fact, I didn't know about this possibility or the fact this option is faster (although I rarely think about performance when I'm working with dataframes). And I've been using dataframes for a while. In hindsight, it can be obvious that groupby could be used for subsetting, but this is not immediate and easy to discasrd as it seems a more cumbersome way to subset rows. Anticipating it's faster also can't be done without knowing the internals.
As for the API, if it needs to respect that form, I'm not sure how useful it is. It seems that the user would only avoid using only
or [1]
, it requires that you're completely sure that a row is returned, it requires knowing a new function, etc. I'd rather improve the documentation and show how general groupby is.
Again, this is just my perspective as a user, maybe some people could find this addition critical. I always prefer to avoid new syntax if it's possible to achieve the same result with what we have.
One more comment about the documentation. The section API and in particular the part you mention has been always confusing to me. Some of the reasons.
-
Since it's not part of "User Guide", it's not the first place where I search for a technique.
-
There are so many cases that I always end up confused. Maybe there could be a way to distinguish between what's essential/recommended and secondary? For instance, I always get confused by the difference in use of
.=
relative to native Julia, and its relation with!
and:
. If you see, there are countless bullets indicating what each case implies (I know that it was necessary to accommodate all cases). -
This is especially important since the syntax deviates from Julia's language. For isntance, I'm still not sure why
df[!,col] .= 1
is slower thandf[:,col] .= 1
, as I would expect that the former is a view and performs and in-place operation. Not important the explanation, I'm just using it as an example that there's so much information that I took this as a fact and moved on.
I always prefer to avoid new syntax if it's possible to achieve the same result with what we have.
I agree, and that is why we have not added this functionality yet. However, this issue is meant to discuss this possibility and make a decision.
The section API and in particular the part you mention has been always confusing to me.
This section is "specification documentation" i.e. it is meant for someone who wants a definitive and precise definition of how things work. Such documentation is essential to have, although probably, it is not very useful for most users.
I agree that improvements in the "guide" section of the documentation are welcome. I will make them based on your comments at some point.
I always get confused by the difference in use of .= relative to native Julia, and its relation with ! and :.
The simple rule is:
:
works the same as Base Julia. !
is for cases when you want something else than Base Julia allows for.
A small question about speed because I kept thinking about the use of groupby
for looking up values. Testing that option, it always gives me the slowest option.
using DataFrames, BenchmarkTools
nr_obs = 1_000_000
dff = DataFrame(x = rand(nr_obs), y = repeat([1,2,3,4], Int(nr_obs/4)), z = repeat(["a","b"], Int(nr_obs/2)))
groupby(dff, [:z,:y])
@btime groupby(dff, [:z,:y])[("a", 1)] # 26.121 ms (221 allocations: 40.81 MiB)
@btime filter([:z,:y] => ((z,y) -> (z == "a" && y == 1)), dff) # 8.552 ms (59 allocations: 7.76 MiB)
@btime subset(dff, :z => ByRow(==("a")), :y => ByRow(==(1))) # 9.255 ms (398 allocations: 9.68 MiB)
@btime dff[dff.z .== "a" .&& dff.y .== 1,:] # 8.761 ms (58 allocations: 7.76 MiB)
You are computing both the lookup and the creation time. The benchmark should be:
julia> gdf = groupby(dff, [:z,:y]);
julia> @btime gdf[("a", 1)];
302.100 μs (11 allocations: 1.91 MiB)
julia> @btime filter([:z,:y] => ((z,y) -> (z == "a" && y == 1)), dff);
5.160 ms (32 allocations: 7.75 MiB)
and you see that indexing of gdf
is much faster (after the initial cost of instantiation of the object which you include in your timing).
Call the function indexby
then (rather than lookup
) and maintain similar API to groupby
.
groupby
will assume multiple rows per group. indexby
will assume one row per index.
df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
idf = indexby(df, :name)
sally_age = idf["Sally"].age
(I first considered just overloading groupby
with groupby(df, :name; unique=true)
, but then I think the function name doesn't make much sense.)
In fact, I didn't know about this possibility or the fact this option is faster (although I rarely think about performance when I'm working with dataframes). And I've been using dataframes for a while. In hindsight, it can be obvious that groupby could be used for subsetting, but this is not immediate and easy to discasrd as it seems a more cumbersome way to subset rows.
I agree. I have avoided grouped data frames until today. After today, I still find them confusing. This is a lot of parenthesis/brackets to index two keys: (link)
iris_gdf = groupby(iris, :Species)
iris_gdf[[("Iris-virginica",), ("Iris-setosa",)]]
Why does this error?
julia> iris_gdf["Iris-virginica"]
ERROR: ArgumentError: invalid index: "Iris-virginica" of type String
It is the API proposed for lookup
/indexby
:
Then one could write l["Sally", 49] to get the desired row.
It should also work for GroupedDataFrame
(just returning a different object), so the two functions operate consistently. Requiring Tuples or Dicts is pretty cumbersome; even Pairs don't work for indexing unless wrapped in Dict: iris_gdf[Dict(:Species => "Iris-virginica")]
.
Why does this error?
See the first post of this discussion. That was the original proposal. The idea is that you need a tuple, such that
iris_gdf[("Iris-virginica",)]
I misinterpreted the whole discussion I think. Given the benchmarks I showed, I thought the discussion was about performing fast indexing, but it's more related to subsetting a dataframe assuming that you'll perform multiple operations on that.
Maybe there should be a clarification that if you want to index, the approach with groupby
is not ideal.
I'm still unsure about the benefits of adding this functionality, even more now that it's not necessarily more efficient.
In any case, I think the name proposed by @nathanrboyer makes it clear its purpose and that the API follows groupby
.
Why does this error?
julia> iris_gdf["Iris-virginica"]
This is an issue with legacy groupby
API that was inherited from its initial design. The point is that gdf[1]
picks the first group of gdf
(not the group whose key value is 1
). This is unfortunate as picking a group by its number is rarely useful. It would be much better if gdf
did not allow indexing by a group number. Then we could allow what you ask for.
Because of this I proposed iris_gdf("Iris-virginica")
syntax as an alternative that was not taken and could work exactly like you would want it to. However, this proposal did not get much support. But maybe we could reconsider it and then we would have:
gdf = groupby(df, :x)
idf = indexby(df, :x) # this will error if grouping by :x creates duplicates
gdf("v") # get a data frame that corresponds to group with key value "v"
idf("v") # get a unique row that corresponds to group with key value "v"
This would be consistent. The only drawback is that we use (
)
brackets and not [
]
brackets.
It would be much better if gdf did not allow indexing by a group number. Then we could allow what you ask for.
DataFrames 2.0? Could you still iterate over a GroupedDataFrame if you disallow indexing into it by Integer?
This would be consistent. The only drawback is that we use ( ) brackets and not [ ] brackets.
Square brackets would be best, but I like gdf("v")
better than gdf[("v",)]
.