DataFrames.jl Make row lookup easier

This is a speculative idea. Maybe we could define GroupedDataFrame to be callable like this:

(gdf::GroupedDataFrame)(idxs...) = gdf[idxs]

In this way instead of writing:

gdf[("val",)]

users could write gdf("val").

@nalimilan, @pdeffebach - what do you think?

Ref: https://discourse.julialang.org/t/any-plan-for-functionality-like-pandas-loc/81134

May 16 '22 08:05 bkamins

I'm not a fan of this. You only save 3 characters and the syntax is harder to understand.

It wouldn't really improve the workflow discussed in the Discourse thread, would it?

May 16 '22 08:05 nalimilan

It would not - that is why we originally have not implemented it. It is just non-conflicting and would make life of newcomers slightly easier. I wanted to discuss it to make sure we are OK with the current design.

May 16 '22 08:05 bkamins

I was thinking of the case of a unique index. I could convert the DataFrame to a dictionary of DataFrame rows:

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

d = Dict( df.name .=> eachrow(df) )

This then allows d["Sally"].age

But loses all the other DataFrame functionality.

There are so many times I want to do several lookups on different columns but using the same index column.

Example I have a table of FX Forward rates I'm using to bootstrap discount factors. Columns are :Currency :Tenor :Rate. The bootstrap is done separately per currency so first I'd like to @groupby :Currency. Then @setindex :Tenor.
Possible Tenor values are: ON TN SPOT 1W 2W 1M 3M 6M 1Y 2Y Using IndexValue:Column notation, the formulas are

:Fwd      = SPOT:Rate  +  :Rate
SPOT:Fwd   = :SPOT:Rate
TN:Fwd      = SPOT:Rate - TN:Rate
ON:Fwd     = TN:Rate - ON:Rate
TodayRate  = ON:Fwd
:Discount    =  :Fwd / TodayRate * USDiscount(:Tenor)

Currently I'm calling a function for each group. The function then converts each sub-DataFrame to named arrays. This works fine. it would just be nice to do it all in a DataFrame.

May 16 '22 12:05 Lincoln-Hannah

I am moving it to 1.5 release. Essentially the requirement is for a Dict-like object allowing an easy row lookup. Essentially it should be a wrapper around GroupedDataFrame allowing easy indexing on group keys.

So this would be something like (in general probably this could be made more efficient, but this shows the idea - please look at the API not at the implementation):

struct RowLookup where T<:GroupedDataFrame
    gdf::T
end

function rowlookup(df::AbstractDataFrame, cols)
    gdf = groupby(df, cols)
    length(gdf) == nrow(df) || throw(ArgumentError("rows are not unique"))
    return RowLookup(gdf)
end

Base.getindex(rl::RowLookup, idxs...) = only(rl.gdf[idxs])
Base.parent(rl) = parent(rl.gdf)

This would allow for easy indexing like (referring to the example above):

d1 = rowlookup(df, :name)
d1["Sally"].age
d2 = rowlookup(df, :name, :age)
d2["Sally", 49].children

Do we think it is worth to add something like this?

Jun 07 '22 07:06 bkamins

If this is for a unique index. Would it be better to return a DataFrameRow rather than a 1-row DataFrame. So replace [1:1, :] with [1,:]

Then d1["Sally].age returns a number instead of a 1-element vector. I think generally the number would be preferred.

It would be nice to have the rowlookup functionality but keep the object as a DataFrame, so you can add columns and do other stuff to it. With this implementation, you can't for example do d1.ageBy2 = d1.age because d1 is a GroupedDataFrame. I guess this would make the implementation much more difficult as you couldn't use the pre-existing GroupedDataFrame functionality.

Jun 07 '22 10:06 Lincoln-Hannah

So replace [1:1, :] with [1,:]

Yes - it was a typo - I meant [1, :]. Otherwise it would be needed. Actually it should be only. I will fix :).

It would be nice to have the rowlookup functionality but keep the object as a DataFrame

This cannot be done as indexing DataFrame has a different semantics.

What we could consider doing is to add syntax like:

df[:name => "Sally", :age => 49].children

with the requirement that the condition must identify a unique row. The only drawback would be that it would be a O(n) operation, but maybe this is not the end of the world as it would be a convenience function?

@nalimilan - what do you think?

Jun 07 '22 10:06 bkamins

It seems problematic to assume that df[:name => "Sally", :age => 49].children should reference a single row, as nothing in the syntax indicates that. I would rather expect it to be equivalent to subset(df, :name => ==("Sally"), :age => ==(49)).children.

I feel like we're not targeting the real problem here. Maybe the problem is just that when using @rtransform, one needs to repeat the data frame name. Using @transform, this works and is not too ugly IMO:

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
@chain df begin
    @transform :Age_relative_to_Sally = :age .- only(:age[:name .== "Sally"])
end

Or even without only (broadcasting will fail if more than one row is selected):

@chain df begin
    @transform :Age_relative_to_Sally = :age .- :age[:name .== "Sally"]
end

Maybe what's missing is a way to do the same using @rtransform? For example, wrapping :age[:name .== "Sally"] within a special marker to indicate that it should be evaluated on whole vectors rather than row-by-row? It could also be useful for things like :x - mean(x).

Jun 07 '22 11:06 nalimilan

CC @pdeffebach

Jun 07 '22 11:06 bkamins

I can't think of an easy way to mark just some columns as "whole column" references and others as not. The current implementation just takes the anonymous function created by the parsing and wraps it in ByRow.

We could have something that expands to :x - mean(df.x) but that results in type instabilities and would slow down performance.

Aren't there array types that allow for

@rtransform df :Age_relative_to_Sally = :age .- :age."Sally"

? That seems pretty clean to me.

Jun 07 '22 14:06 pdeffebach

That's a really nice syntax if its possible.

With NamedArrays.jl you can do

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

@chain df begin

    @aside age = NamedArray( _.age, _.name )

    @rtransform :Age_relative_to_Sally = :age - age["Sally"]    

end

But of course changing the NamedArray won't change the column.

Jun 08 '22 01:06 Lincoln-Hannah

@chain df begin

    @aside age = NamedArray( _.age, _.name )

    @rtransform :Age_relative_to_Sally = :age - age["Sally"]    

end

This still has some performance issues, since age["Sally"] is called every time. Better to use @transform instead of @rtransform and broadcast.

Jun 08 '22 11:06 pdeffebach

Ok. @transform :Age_relative_to_Sallly = :age .- age["Sally"]

suppose there are a few columns you would like to attach the index to.

@aside  age     = NamedArray( _.age,    _.name )
@aside  height  = NamedArray( _.height, _.name )
@aside  weight  = NamedArray(_.weight, _.name )

could this be condensed to something like?

addindex( df, :name, to = [:age, :height, :weight] )

and could the NamedArray functionality be merged with the DataFrame column such that you can alter cells as well as referencing them. e.g.

@transform :age["Sally"] = 30

Jun 09 '22 05:06 Lincoln-Hannah

Something I just realised. DataFrame columns can be defined as NamedArrays, and cell values referenced within a @transform statement. A cell value can't be altered within a @transform block but can altered as below.

df = DataFrame( T = ["1M","2M","3M","6M"],  x = [1,2,3,4],  y = [10,11,12,13] )

@chain df begin


    @DataFramesMeta.transform begin
         :x = NamedArray(:x,:T)
         :y = NamedArray(:y,:T)
    end

    @DataFramesMeta.transform :W = :x["1M"]


    (
        _.x["1M"] = 100;
        _
    )
    
end

Jun 15 '22 06:06 Lincoln-Hannah

https://discourse.julialang.org/t/why-is-it-so-complicated-to-access-a-row-in-a-dataframe/103162/10

I find this solution very elegant, although I agree that it is problematic in the case of multiple rows.

What we could consider doing is to add syntax like:
df[:name => "Sally", :age => 49].children
with the requirement that the condition must identify a unique row. The only drawback would be that it would be a O(n) operation, but maybe this is not the end of the world as it would be a convenience function?

As indicated in the post above, "The biggest issue is that the condition you might want to use could return exactly one row, or multiple rows (where 0 rows is a special case of multiple)".

Although I don't think this is really in line with the philosophy of DataFrame.jl or its internal implementation, one feature I really like in Pandas is MultiIndex (https://pandas.pydata.org/docs/user_guide/advanced.html). Even if they're not obvious to use, they make the DataFrame much more readable and could help solve this kind of problem.

Aug 25 '23 09:08 fdekerme

I had some bad experiences with pandas.Multiindex,. Would something as flexible as boost.MultiIndex be more desirable?

Aug 25 '23 09:08 xgdgsc

Having thought about it:

If one wants to select multiple rows then GroupedDataFrames gives all the functionality one might want.
So essentially we are looking for a solution when user expects exactly one row. We just need to settle on the syntax.

The syntax (note a slight change, but this is a minor thing):

df[(:name => "Sally", :age => 49), :]

is tempting, but in the past with @nalimilan we wanted to avoid adding too much functionality to indexing of df. We would need to think about it. The second downside, as was commented is that the operation would be O(n).

The alternative would be to define something like:

l = lookup(df, :name, :age)

which would be a wrapper around df (like GroupedDataFrame), call it RowLookup tentatively that would assume a unique key situation.

Then one could write l["Sally", 49] to get the desired row. The additional benefit of this approach is that l would probably do some pre-processing, so that the lookup operation later would be O(1).

If a single lookup were required then one could of course write lookup(df, :name, :age)["Sally", 49] (but the whole design would be around ensuring fast repeated lookup).

Aug 25 '23 15:08 bkamins

Is the problem here the syntax or the speed? If it's the former, I'm not sure if there's any benefit from notation like this

df[(:name => "Sally", :age => 49), :]
lookup(df, :name, :age)["Sally", 49]

relative to

df[df.name .== "Sally" .&& df.age .== 49,:]

We'd avoid writing df. repeatedly. But, apart from that, it'd be quite limited since these cases can only handle the case with ".&&". It'd also add the problem that new users could expect the functionality to behave as ".||".

I'm usually against adding new notation, especially if it creates additional ways of doing the same, while having approximately the same readibility/number of characters.

Maybe it's a problem more related to documentation? In indexing syntax, I noticed that the explanations for column selectors and row indexing are intertwined. Moreover, there's only one example with the broadcasting notation, and it's at the end of the section.

If the problem is speed, then I agree that having an optimized function like "lookup" could be beneficial.

Aug 25 '23 22:08 alfaromartino

My proposal with lookup wrapper is related to speed and correctness (and indirectly a bit syntax). The point is that df[df.name .== "Sally" .&& df.age .== 49,:] is slow as was noted. Given this we have two calls for action:

Improve documentation (this is clearly doable, and I can add more examples)
Decide if we want the lookup variant for speed and correctness. Note the "correctness" aspect. The point is that lookup would make sure that the row is unique and error otherwise (so users would avoid a common error of having data with duplicates but thinking that they generate a unique row).

So the question is if we want the addition of lookup or not.

Aug 29 '23 18:08 bkamins

If it's for speed and correctness, I like the idea. Maybe we could think more about the API? the API and name of lookup doesn't convince me.

I'm thinking of two possibilities. If it emulates filter (in the sense of a function returning a dataframe), something along the lines:

rowsubset(df, :id => 10, :name => "John")

the name makes it clear that it should return a row and that the API is similar to subset. Regardless of the internals, it makes it clear that it emulates the subset API.

If the intention is to be used for indexing, maybe the name row would suffice

df[row(:id => 10, :name => "John"), cols]

BTW, is the use of grouped dfs for selecting multiple rows documented?

Aug 29 '23 19:08 alfaromartino

is the use of grouped dfs for selecting multiple rows documented?

see https://dataframes.juliadata.org/stable/lib/indexing/#Indexing-GroupedDataFrames

Maybe we could think more about the API?

This is the point we need a good API. But your proposals will not work well. We need a structure of the following type (I make it verbose):

first a lookup object is created (one operation)
then lookup is performed (second operation)

We need this separation to ensure performance.

For example in groupby you have:

create an object gdf = groupby(df, :a, :b)
do lookup gdf[(1, 2)] (get a group with value 1 in column :a and value 2 in column :b)

We need the same kind of separation because for group-by exactly the fact that we do groupby(df, :a, :b) ensures that later the lookup is fast.

In database parlance the first operation is making a unique primary key (done once) and the second operation is actual lookup (done many times).

I hope it is clear where the issue is.

Aug 29 '23 20:08 bkamins

Just my two cents as a user, I think that the problem is then more related to the documentation. I'd expect the use of groupby for subsetting a dataframe here.

In fact, I didn't know about this possibility or the fact this option is faster (although I rarely think about performance when I'm working with dataframes). And I've been using dataframes for a while. In hindsight, it can be obvious that groupby could be used for subsetting, but this is not immediate and easy to discasrd as it seems a more cumbersome way to subset rows. Anticipating it's faster also can't be done without knowing the internals.

As for the API, if it needs to respect that form, I'm not sure how useful it is. It seems that the user would only avoid using only or [1], it requires that you're completely sure that a row is returned, it requires knowing a new function, etc. I'd rather improve the documentation and show how general groupby is.

Again, this is just my perspective as a user, maybe some people could find this addition critical. I always prefer to avoid new syntax if it's possible to achieve the same result with what we have.

Aug 29 '23 21:08 alfaromartino

One more comment about the documentation. The section API and in particular the part you mention has been always confusing to me. Some of the reasons.

Since it's not part of "User Guide", it's not the first place where I search for a technique.
There are so many cases that I always end up confused. Maybe there could be a way to distinguish between what's essential/recommended and secondary? For instance, I always get confused by the difference in use of .= relative to native Julia, and its relation with ! and :. If you see, there are countless bullets indicating what each case implies (I know that it was necessary to accommodate all cases).
This is especially important since the syntax deviates from Julia's language. For isntance, I'm still not sure why df[!,col] .= 1 is slower than df[:,col] .= 1, as I would expect that the former is a view and performs and in-place operation. Not important the explanation, I'm just using it as an example that there's so much information that I took this as a fact and moved on.

Aug 29 '23 21:08 alfaromartino

I always prefer to avoid new syntax if it's possible to achieve the same result with what we have.

I agree, and that is why we have not added this functionality yet. However, this issue is meant to discuss this possibility and make a decision.

The section API and in particular the part you mention has been always confusing to me.

This section is "specification documentation" i.e. it is meant for someone who wants a definitive and precise definition of how things work. Such documentation is essential to have, although probably, it is not very useful for most users.

I agree that improvements in the "guide" section of the documentation are welcome. I will make them based on your comments at some point.

I always get confused by the difference in use of .= relative to native Julia, and its relation with ! and :.

The simple rule is: : works the same as Base Julia. ! is for cases when you want something else than Base Julia allows for.

Aug 29 '23 21:08 bkamins

A small question about speed because I kept thinking about the use of groupby for looking up values. Testing that option, it always gives me the slowest option.

using DataFrames, BenchmarkTools
nr_obs = 1_000_000

dff = DataFrame(x = rand(nr_obs), y = repeat([1,2,3,4], Int(nr_obs/4)), z = repeat(["a","b"], Int(nr_obs/2)))

groupby(dff, [:z,:y])

@btime groupby(dff, [:z,:y])[("a", 1)]                              # 26.121 ms (221 allocations: 40.81 MiB)
@btime filter([:z,:y] => ((z,y) -> (z == "a" && y == 1)), dff)      # 8.552 ms (59 allocations: 7.76 MiB)     
@btime subset(dff, :z => ByRow(==("a")), :y => ByRow(==(1)))        # 9.255 ms (398 allocations: 9.68 MiB)    
@btime dff[dff.z .== "a" .&& dff.y .== 1,:]                         # 8.761 ms (58 allocations: 7.76 MiB)

Aug 29 '23 22:08 alfaromartino

You are computing both the lookup and the creation time. The benchmark should be:

julia> gdf = groupby(dff, [:z,:y]);

julia> @btime gdf[("a", 1)];
  302.100 μs (11 allocations: 1.91 MiB)

julia> @btime filter([:z,:y] => ((z,y) -> (z == "a" && y == 1)), dff);
  5.160 ms (32 allocations: 7.75 MiB)

and you see that indexing of gdf is much faster (after the initial cost of instantiation of the object which you include in your timing).

Aug 30 '23 08:08 bkamins

Call the function indexby then (rather than lookup) and maintain similar API to groupby. groupby will assume multiple rows per group. indexby will assume one row per index.

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
idf = indexby(df, :name)
sally_age = idf["Sally"].age

(I first considered just overloading groupby with groupby(df, :name; unique=true), but then I think the function name doesn't make much sense.)

Aug 30 '23 16:08 nathanrboyer

In fact, I didn't know about this possibility or the fact this option is faster (although I rarely think about performance when I'm working with dataframes). And I've been using dataframes for a while. In hindsight, it can be obvious that groupby could be used for subsetting, but this is not immediate and easy to discasrd as it seems a more cumbersome way to subset rows.

I agree. I have avoided grouped data frames until today. After today, I still find them confusing. This is a lot of parenthesis/brackets to index two keys: (link)

iris_gdf = groupby(iris, :Species)
iris_gdf[[("Iris-virginica",), ("Iris-setosa",)]]

Why does this error?

julia> iris_gdf["Iris-virginica"]
ERROR: ArgumentError: invalid index: "Iris-virginica" of type String

It is the API proposed for lookup/indexby:

Then one could write l["Sally", 49] to get the desired row.

It should also work for GroupedDataFrame (just returning a different object), so the two functions operate consistently. Requiring Tuples or Dicts is pretty cumbersome; even Pairs don't work for indexing unless wrapped in Dict: iris_gdf[Dict(:Species => "Iris-virginica")].

Aug 30 '23 19:08 nathanrboyer

Why does this error?

See the first post of this discussion. That was the original proposal. The idea is that you need a tuple, such that iris_gdf[("Iris-virginica",)]

I misinterpreted the whole discussion I think. Given the benchmarks I showed, I thought the discussion was about performing fast indexing, but it's more related to subsetting a dataframe assuming that you'll perform multiple operations on that.

Maybe there should be a clarification that if you want to index, the approach with groupby is not ideal.

I'm still unsure about the benefits of adding this functionality, even more now that it's not necessarily more efficient. In any case, I think the name proposed by @nathanrboyer makes it clear its purpose and that the API follows groupby.

Aug 30 '23 20:08 alfaromartino

Why does this error? julia> iris_gdf["Iris-virginica"]

This is an issue with legacy groupby API that was inherited from its initial design. The point is that gdf[1] picks the first group of gdf (not the group whose key value is 1). This is unfortunate as picking a group by its number is rarely useful. It would be much better if gdf did not allow indexing by a group number. Then we could allow what you ask for.

Because of this I proposed iris_gdf("Iris-virginica") syntax as an alternative that was not taken and could work exactly like you would want it to. However, this proposal did not get much support. But maybe we could reconsider it and then we would have:

gdf = groupby(df, :x)
idf = indexby(df, :x) # this will error if grouping by :x creates duplicates
gdf("v") # get a data frame that corresponds to group with key value "v"
idf("v") # get a unique row that corresponds to group with key value "v"

This would be consistent. The only drawback is that we use ( ) brackets and not [ ] brackets.

Aug 30 '23 21:08 bkamins

It would be much better if gdf did not allow indexing by a group number. Then we could allow what you ask for.

DataFrames 2.0? Could you still iterate over a GroupedDataFrame if you disallow indexing into it by Integer?

This would be consistent. The only drawback is that we use ( ) brackets and not [ ] brackets.

Square brackets would be best, but I like gdf("v") better than gdf[("v",)].

Aug 31 '23 02:08 nathanrboyer

DataFrames.jl DataFrames.jl copied to clipboard

Make row lookup easier

DataFrames.jl
DataFrames.jl copied to clipboard