Unite "join" equality comparison syntax with expression syntax, informing the user where it's limited
When coming up with the example Zed in https://github.com/brimdata/zed/issues/2715#issuecomment-846649422, I was thrown at first by how the portion that's establishing the pairing of columns that establish a successful join (join on uid=uid ...) currently uses a single = (e.g. as of Zed commit 7adbb36) rather than ==. To me as a user, it appears to be expressing a test for equality, and other recent Zed language changes have == as the equality testing operator in other contexts such as expressions.
In terms of implementation, @mccanne pointed out that the way it's been using = was in some ways a reflection of it not being the same kind of equality test as appears elsewhere in the language (e.g. you can’t join on arbitrary boolean expressions). Specifically, he explained that join can be efficient if this equality comparison is implemented as a "merge join" or a "hash join", but if we tried to support everything that's possible currently via Zed expressions, there's corner cases such as comparing to regexps that would require comparing the cross product across columns, which is technically feasible but will be prohibitively expensive when there's lots of data to join. That said, @nwt did some research to compare our limitations with that of SQL and it looked to him like they do allow these arbitrary expressions, so that gives us more to think about.
In a group discussion, the main take-away is that we intend to do more research on the compare-to-SQL front. Beyond that, there did seem to be consensus that we unite the syntax as best we can (e.g. move to using == in join) and if there's permutations we can't support yet (such as the expensive cross product) we could inform the user they're not supported.
The Zed docs published in #2836 cited this issue as a current limitation. When this issue is addressed, the docs should be updated.
@mattnibs and I recently bumped into a specific use case that would have benefitted from what's described here. The Geolocation data from MaxMind is bucketed by CIDR block, e.g.:
$ zq -Z 'head 1' GeoLite2-City-Blocks-IPv4.csv
{
network: "1.0.0.0/24",
geoname_id: 2077456.,
registered_country_geoname_id: 2077456.,
represented_country_geoname_id: null,
is_anonymous_proxy: 0.,
is_satellite_provider: 0.,
postal_code: null,
latitude: -33.494,
longitude: 143.2104,
accuracy_radius: 1000.
}
Meanwhile, a user's network security data they'd want to map to locations will likely be based on individual IP address. Therefore it would be beneficial to be able to use "truthy functions" (e.g., ones that return booleans, such as cidr_match() in this case) in the join expression so the user could map a single IP to the location for its matching CIDR block.