dm
dm copied to clipboard
Self-join filtered table
Hello, I apologise if this is an inappropriate place to ask this question; I am very new to the open source community.
I am wondering whether it is possible to self-join a filtered zoomed table to the unfiltered zoomed table. So far, the only way I have been able to achieve a similar result is by inserting a new table into the dm and defining a FK relation between the new filtered table and the original unfiltered table.
Below is an example of this approach using data from the nycflights13 package
dm_weather <- dm(nycflights13::weather) %>%
dm_rename_tbl(weather = `nycflights13::weather`)
dm_weather <- dm_weather %>%
dm_zoom_to(weather) %>%
count(origin, year, month, day, hour) %>%
filter(n > 1) %>%
dm_insert_zoomed("dupe_weather_tups")
dm_weather %>%
dm_add_fk(weather, c('origin', 'year', "month", "day", "hour"),
dupe_weather_tups,c('origin', 'year', "month", "day", "hour")) %>%
dm_join_to_tbl(weather, dupe_weather_tups, semi_join)
Using dplyr, I can achieve an equivalent result with
df_weather <- nycflights13::weather
df_weather %>%
count(origin, year, month, day, hour) %>%
filter(n > 1) %>%
dplyr::semi_join(x = df_weather, y = .)
Thanks. Off the top of my head inserting a copy of the table looks right. Should a relationship between the original and the copy be added automatically? Is this operation frequent enough that we should think about defining a macro?
Thanks for your reply.
Having to insert a new zoomed table and define a FK relation with the original table just seems a bit cumbersome compared to an SQL style self join using table aliases. Another solution I found which seems more elegant to me is below. I don't know if this could be somehow incorporated into a macro for self-joins? I am not an expert in SQL, but I believe self-joins are a reasonably common operation.
dm_weather %>%
dm_zoom_to(weather) %>%
count(origin, year, month, day, hour) %>%
filter(n > 1) %>%
pull_tbl() %>%
dplyr::semi_join(x = dm_weather$weather, y = ., copy = TRUE)
Thanks. This kind of self-join could be substituted with add_count()
, essentially a window function:
dm_weather$weather %>%
add_count(origin, year, month, day, hour) %>%
filter(n > 1)
I agree there might be other cases where this is not as straightforward, so that a self-join might make sense here. It's not a macro, rather the support for an edge case.
The other problem I see is that we don't support %>% xxx_join(x = ...)
on dm objects, because the method is then dispatched on whatever is passed to the x
argument. Do we need xxx_join_rev(y, x, ...)
that dispatches on y
and reverses the argument order?
Finally, self-joins are also useful when the data represents a tree-like structure (parent-child relationships). Self-joins on the same column should explicitly use the by
argument.