kaskada icon indicating copy to clipboard operation
kaskada copied to clipboard

feat: Eliminate entity grouping

Open kerinin opened this issue 1 year ago • 8 comments

Summary See discussion in https://github.com/kaskada-ai/kaskada/issues/268

kerinin avatar Apr 28 '23 16:04 kerinin

This is likely a good idea, but some things to note:

  1. We likely want to think about a way to make the implicit join more explicit. Otherwise this will increase the rate of "confusing implicit joins".
  2. I think I took a look at this at one point and there was some reason why it wouldn't be trivial to do this. It may just be that there are a lot of places that are affected (since the "entity grouping" is used to determine whether things need to be repartitioned), but it may be that there are deeper issues to iron out. This may be easier to address once we have a logical plan instead of the existing DFG.

bjchambers avatar Apr 28 '23 17:04 bjchambers

A quick and dirty way of accomplishing this would be to use the key's type (i.e. string) as the grouping ID.

kerinin avatar Apr 28 '23 19:04 kerinin

If we want to go the "quick and dirty way", I can make this change in wren for now:

Stop asking for grouping_id on the external API. Always set the internal (sparrow-api) grouping_id to the entityKey type at query time.

epinzur avatar May 08 '23 15:05 epinzur

I support this direction. Any downvotes?

kerinin avatar May 08 '23 17:05 kerinin

@bjchambers any objection to using the "quick and dirty" solution for now?

epinzur avatar May 10 '23 05:05 epinzur

My concern here is that we have multiple times had issues related to the implicit join that can occur. This change will increase the occurrences of implicit joins, and likely increase the occurrences of that problem. I think the current compilation will also cause us to join everything with the same grouping. This could actually create problems.

Consider:

  1. Two different entity types (user and store), but both keyed by i64. With this change, they will now be the same entity type.
  2. A query which does a lookup between the two. Before, these were seen as two different entities, so we'd run two separate operations. After this change, there is a chance that we will treat them as a single entity type, thus merging them.
  3. Normally, this wouldn't be a problem due to the way "newness" is handled, but there are aggregations that would allow an aggregation over the "user" entity to count the occurrences of the store as well.

One place that we've seen this occur is something like ticks, which enumerate the entities. With this change, it will enumerate all users and store IDs rather than just all user IDs.

bjchambers avatar May 10 '23 15:05 bjchambers

seems like we should NOT do the "quick and dirty" solution then.

epinzur avatar May 15 '23 20:05 epinzur

The thing is that grouping doesn't prevent, or in my experience meaningfully reduce, the incidence of implicit joins, however it does increase the complexity of the 'user model'.

Adjacent query languages (SQL, CEP, Cascalog, etc) don't provide typed groupings, so users are unlikely to expect the language to provide this type of "safety", and if they experience case (1) I think it's unlikely they'll perceive the resulting empty-join as a failure of Kaskada or the query language. OTOH, if they are trying to work across tables with the "same entity" but without the appropriate configuration I think they're likely to perceive grouping as a confusing feature of Kaskada's design, as a result of its novelty within data query languages.

I think the possibility of implicit joins is orthogonal to grouping - we should solve this on its own terms, not with grouping, since grouping doesn't actually prevent implicit joins.

I also feel that "typed" entities is a layer of complexity in the user model that doesn't pay for itself.

If there are errors in the implementation of joins across tables with the same grouping or in how ticks operate that should be addressed separately, since we currently allow configuration of the table's grouping - the proposal here is simply to use the existing system mechanisms - nothing proposed here changes the current system's behavior.

kerinin avatar May 15 '23 21:05 kerinin