dm icon indicating copy to clipboard operation
dm copied to clipboard

Self-join filtered table

Open morgan-dgk opened this issue 3 years ago • 3 comments

Hello, I apologise if this is an inappropriate place to ask this question; I am very new to the open source community.

I am wondering whether it is possible to self-join a filtered zoomed table to the unfiltered zoomed table. So far, the only way I have been able to achieve a similar result is by inserting a new table into the dm and defining a FK relation between the new filtered table and the original unfiltered table.

Below is an example of this approach using data from the nycflights13 package

dm_weather <- dm(nycflights13::weather) %>%
  dm_rename_tbl(weather = `nycflights13::weather`)
      
dm_weather <- dm_weather %>% 
    dm_zoom_to(weather) %>% 
    count(origin, year, month, day, hour) %>% 
    filter(n > 1) %>% 
    dm_insert_zoomed("dupe_weather_tups")

dm_weather  %>% 
  dm_add_fk(weather, c('origin', 'year', "month", "day", "hour"),
            dupe_weather_tups,c('origin', 'year', "month", "day", "hour")) %>% 
  dm_join_to_tbl(weather, dupe_weather_tups, semi_join)

Using dplyr, I can achieve an equivalent result with

df_weather <- nycflights13::weather

df_weather %>% 
  count(origin, year, month, day, hour) %>% 
  filter(n > 1) %>% 
  dplyr::semi_join(x = df_weather, y = .)

morgan-dgk avatar Nov 15 '21 05:11 morgan-dgk

Thanks. Off the top of my head inserting a copy of the table looks right. Should a relationship between the original and the copy be added automatically? Is this operation frequent enough that we should think about defining a macro?

krlmlr avatar Nov 15 '21 07:11 krlmlr

Thanks for your reply.

Having to insert a new zoomed table and define a FK relation with the original table just seems a bit cumbersome compared to an SQL style self join using table aliases. Another solution I found which seems more elegant to me is below. I don't know if this could be somehow incorporated into a macro for self-joins? I am not an expert in SQL, but I believe self-joins are a reasonably common operation.

dm_weather %>%
  dm_zoom_to(weather) %>% 
  count(origin, year, month, day, hour) %>%
  filter(n > 1) %>% 
  pull_tbl() %>% 
  dplyr::semi_join(x = dm_weather$weather, y = ., copy = TRUE)

morgan-dgk avatar Nov 15 '21 23:11 morgan-dgk

Thanks. This kind of self-join could be substituted with add_count(), essentially a window function:

dm_weather$weather %>%
  add_count(origin, year, month, day, hour) %>%
  filter(n > 1)

I agree there might be other cases where this is not as straightforward, so that a self-join might make sense here. It's not a macro, rather the support for an edge case.

The other problem I see is that we don't support %>% xxx_join(x = ...) on dm objects, because the method is then dispatched on whatever is passed to the x argument. Do we need xxx_join_rev(y, x, ...) that dispatches on y and reverses the argument order?

Finally, self-joins are also useful when the data represents a tree-like structure (parent-child relationships). Self-joins on the same column should explicitly use the by argument.

krlmlr avatar Nov 16 '21 05:11 krlmlr