pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Add Support for IP Address Function

Open SabrinaZhaozyf opened this issue 2 years ago • 9 comments

Label = feature

This PR adds a scalar function isSubnetOff(IP_PREFIX, IP_ADDRESS) which checks if IP_ADDRESS is in the subnet of IP_PREFIX (this function handles both IPv4 and IPv6).

Both IP_PREFIX and IP_ADDRESS are STRING type.

Use Cases SELECT isSubnetOf('1.2.3.128/26', '1.2.3.129') FROM FOO --> returns true
SELECT isSubnetOf('64:fa9b::17/64', '64:ffff::17') FROM FOO --> returns false

SabrinaZhaozyf avatar Sep 29 '22 22:09 SabrinaZhaozyf

Context - One of our observability users wanted subnet-contains and other similar functionality for IP addresses stored as STRINGs. To be used in SQL WHERE clause mostly

siddharthteotia avatar Sep 30 '22 01:09 siddharthteotia

Can we add some e2e query execution tests where we use this in query execution (WHERE clause) similar to how you added other transform functions in the past ?

Let's add some tests with meaningful IP addresses and then test out the functionality for both IPv4 and IPv6.

siddharthteotia avatar Oct 04 '22 04:10 siddharthteotia

Codecov Report

Merging #9501 (a4093cb) into master (9f5aa89) will increase coverage by 39.29%. The diff coverage is 42.85%.

@@              Coverage Diff              @@
##             master    #9501       +/-   ##
=============================================
+ Coverage     24.62%   63.91%   +39.29%     
- Complexity       53     4910     +4857     
=============================================
  Files          1935     1893       -42     
  Lines        103815   101871     -1944     
  Branches      15758    15529      -229     
=============================================
+ Hits          25560    65107    +39547     
+ Misses        75635    31997    -43638     
- Partials       2620     4767     +2147     
Flag Coverage Δ
integration2 ?
unittests1 67.40% <42.85%> (?)
unittests2 15.76% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...not/common/function/scalar/IpAddressFunctions.java 42.85% <42.85%> (ø)
...va/org/apache/pinot/core/routing/RoutingTable.java 0.00% <0.00%> (-100.00%) :arrow_down:
...va/org/apache/pinot/common/config/NettyConfig.java 0.00% <0.00%> (-100.00%) :arrow_down:
...a/org/apache/pinot/common/metrics/MinionMeter.java 0.00% <0.00%> (-100.00%) :arrow_down:
...g/apache/pinot/common/metrics/ControllerMeter.java 0.00% <0.00%> (-100.00%) :arrow_down:
.../apache/pinot/common/metrics/BrokerQueryPhase.java 0.00% <0.00%> (-100.00%) :arrow_down:
.../apache/pinot/common/metrics/MinionQueryPhase.java 0.00% <0.00%> (-100.00%) :arrow_down:
...ache/pinot/server/access/AccessControlFactory.java 0.00% <0.00%> (-100.00%) :arrow_down:
...he/pinot/common/messages/SegmentReloadMessage.java 0.00% <0.00%> (-100.00%) :arrow_down:
...he/pinot/common/messages/TableDeletionMessage.java 0.00% <0.00%> (-100.00%) :arrow_down:
... and 1578 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Oct 04 '22 19:10 codecov-commenter

We've been using https://github.com/seancfoley/IPAddress for exactly this type of address parsing. It's very well tested, and seems robust. Wondering why you wouldn't want to use it for this functionality (and eventually other IP address functions supported by Presto, such as ip_prefix).

kkrugler avatar Oct 05 '22 17:10 kkrugler

Nit - PR description says the function name is isSubnetOf, but code seems to be using is_subnet_of.

kkrugler avatar Oct 05 '22 17:10 kkrugler

One more comment - I think Presto handles either VARCHAR or IPADDRESS for the ip address. For one of our use cases, we've got IPv4 stored as longs in the Pinot table, which would be more efficient to query as-is versus using a custom function to convert back to a string. We could extend this support in the future, wondering if there would be a reason we shouldn't plan on being able to do that.

kkrugler avatar Oct 05 '22 17:10 kkrugler

@walterddr - please take a look as well

siddharthteotia avatar Oct 07 '22 23:10 siddharthteotia

Based on @kkrugler 's suggestion to explore library, I had asked @SabrinaZhaozyf to do some investigation along the following lines.

  • Do we think library will offer comprehensive IP processing functionality besides subnet_contains (which is the current focus of your PR but we want to add more more in future) ?

  • Suggest taking a look at IP related functions supported by Kusto / Presto / Snowflake / Postgres to get an idea of IP processing landscape

  • Do we think our impl can be simplified by using the library and be less error prone ?

  • Probably not a good idea to use the library if it only supports specific functionality.

  • Some systems support IP address as a data type (NETWORK_ADDRESS IIRC ). We don't want to go there (yet) but good to check once if using the library will force us down a particular approach

@SabrinaZhaozyf , you may want to share any of your findings here for reference / discussion.

@Jackie-Jiang @walterddr @xiangfu0 - both approaches are implemented here. Can we get consensus on how we move forward ? One thing is that we will have to overload this for different types. Have you seen a similar requirement ?

siddharthteotia avatar Oct 18 '22 00:10 siddharthteotia

Based on @kkrugler 's suggestion to explore library, I had asked @SabrinaZhaozyf to do some investigation along the following lines.

  • Do we think library will offer comprehensive IP processing functionality besides subnet_contains (which is the current focus of your PR but we want to add more more in future) ?
  • Suggest taking a look at IP related functions supported by Kusto / Presto / Snowflake / Postgres to get an idea of IP processing landscape
  • Do we think our impl can be simplified by using the library and be less error prone ?
  • Probably not a good idea to use the library if it only supports specific functionality.
  • Some systems support IP address as a data type (NETWORK_ADDRESS IIRC ). We don't want to go there (yet) but good to check once if using the library will force us down a particular approach

@SabrinaZhaozyf , you may want to share any of your findings here for reference / discussion.

@Jackie-Jiang @walterddr @xiangfu0 - both approaches are implemented here. Can we get consensus on how we move forward ? One thing is that we will have to overload this for different types. Have you seen a similar requirement ?

I have summarized my research on IP processing functions in other systems and the library in this doc: https://docs.google.com/document/d/1SS04jQCojcvCrrIGJX_aHocUEXXy8rpX0uUhwAren4k/edit?usp=sharing

Please feel free to provide feedback / comments and let's get consensus on how we want to proceed from here.

SabrinaZhaozyf avatar Oct 18 '22 20:10 SabrinaZhaozyf

I was about to suggest the same as Rong. Given that we are going toward s postgres with joins, let's emulate the functions as well.

https://www.postgresql.org/docs/current/functions-net.html

kishoreg avatar Oct 22 '22 23:10 kishoreg