clickhouse-docs
clickhouse-docs copied to clipboard
Approach to IP-based Geolocation based on CIDR ranges is wrong
I'm referring to this article: https://clickhouse.com/blog/geolocating-ips-in-clickhouse-and-grafana
Steps
- Let's follow an author in everything up to the point, when
ip_triedictionary is ready to be used. An important part is that source data includes the following lines (currently is does, but it may change):
169.136.101.0,169.136.101.254,SG,,,Singapore (Orchard),,1.30283,103.834, 169.136.101.255,169.136.107.255,HK,Central and Western,,Hong Kong,,22.3193,114.169,
- Execute:
select dictGet('ip_trie', ('country_code'), tuple('169.136.101.255'::IPv4))
Result:
SG
Expected results
HK
Investigation
With this query:
select
ip_range_start,
ip_range_end,
bitXor(ip_range_start, ip_range_end) as xor,
bin(xor) as xor_binary,
if(xor != 0, ceil(log2(xor)), 0) as unmatched,
32-unmatched as cidr_suffix
from geoip_url
where ip_range_start = '169.136.101.0'::IPv4 or ip_range_start = '169.136.101.255'::IPv4
169.136.101.0 169.136.101.254 254 11111110 8.0 24.0 169.136.101.0
169.136.101.255 169.136.107.255 3584 0000111000000000 12.0 20.0 169.136.96.0
We can see that in the first row's cidr_ columns we've lost the information about where exactly the SG range ends. It may look expected, since the next ip address range may be more specific and be something like 169.136.107.255/32, and everything will work, but in reality, the next ip range spans over a bigger range too.
So the second line has lost information about where the second range starts too.
From this point, all further steps will not be able to recover that information, and it's correct to say that results for such ranges will not be determined.
This problem exists not only for ranges with last byte breaks, and not only for 1 invalid IP address. It's possible to propose a theoretical example, when it will produce much bigger harm:
> ip_range_start ip_range_end xor xor_binary unmatched cidr_suffix cidr_address
> 169.136.0.0 169.136.254.255 65279 1111111011111111 16.0 16.0 169.136.0.0
> 169.136.255.0 169.140.255.255 262399 000001000000000011111111 19.0 13.0 169.136.0.0
In this case, the entire 169.136.255.0/24 subnet was assigned to the wrong, and potentially very distant location.
Thanks @astef - will take a look
@ArctypeZach Let's take a look.
@astef sorry for the delay. I think I've written a new query that should yield much more accurate results:
with
toUInt64(ip_range_start) as int_start,
bitXor(ip_range_start, ip_range_end) as
xor,
bin(xor) as xor_binary,
if(xor != 0,ceil(log2(xor)),0) as unmatched,
range(32 - toUInt8(unmatched) + 1, 33, 1) as cidrs,
arrayMap(x -> (toUInt64(pow(2, 32 - x))), cidrs) as cidr_length,
arrayJoin(arrayZip(cidrs, cidr_length)) as cidr_info,
cidr_info.1 as suffix,
cidr_info.2 as length,
sum(length) over w as block_offset,
toIPv4(int_start + block_offset) as next_block_start,
toIPv4(toUInt64(next_block_start::UInt64 -1)) as block_end,
toIPv4(bitAnd(bitNot(pow(2, 32 - suffix)),next_block_start)::UInt64) as block_start
select
ip_range_start,
ip_range_end,
block_start,
block_end,
suffix,
concat(toString(block_start),'/',toString(suffix)) as cidr
from
geoip_url
window w as (partition by ip_range_start order by suffix asc)
qualify toUInt64(block_end) <= toUInt64(ip_range_end)
Using one of your addresses from above, 169.136.101.0 -> 169.136.101.254, we can see some better results:
┌─ip_range_start─┬─ip_range_end────┬─block_start─────┬─block_end───────┬─suffix─┬─cidr───────────────┐
1. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.0 │ 169.136.101.127 │ 25 │ 169.136.101.0/25 │
2. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.128 │ 169.136.101.191 │ 26 │ 169.136.101.128/26 │
3. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.192 │ 169.136.101.223 │ 27 │ 169.136.101.192/27 │
4. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.224 │ 169.136.101.239 │ 28 │ 169.136.101.224/28 │
5. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.240 │ 169.136.101.247 │ 29 │ 169.136.101.240/29 │
6. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.248 │ 169.136.101.251 │ 30 │ 169.136.101.248/30 │
7. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.252 │ 169.136.101.253 │ 31 │ 169.136.101.252/31 │
8. │ 169.136.101.0 │ 169.136.101.254 │ 169.136.101.254 │ 169.136.101.254 │ 32 │ 169.136.101.254/32 │
└────────────────┴─────────────────┴─────────────────┴─────────────────┴────────┴────────────────────┘
This would appear to line up with the ARIN CIDR Calculator as well:
wdyt? I can adjust the blog post if we feel that this is now correct.
One addendum - to address the original reproduction. I loaded the contents of this revised query into the same dictionary.
select dictGet('ip_trie', ('country_code'), tuple('169.136.101.255'::IPv4)) as country
Now yields
┌─country─┐
1. │ HK │
└─────────┘