quilkin
quilkin copied to clipboard
More in-depth network metrics about clients (IPv4 and IPv6)
Introduction
Quilkin will do many things at Embark for our games. One of the things we want to use Quilkin for is re-routing players to specific proxies or black-list specific proxies for specific players when we notice that there is broken routing, or just poor performance to a specific proxy.
In order to make good decisions you will need some more metrics to base these decisions on instead of just having IP addresses to base them on. In our case we would like to have AS-number, AS-name, AS-location, IPv(4/6)-prefix, Prefix Owner for each IP address that establishes a game session with us.
With this information we can then build tools and heatmaps which will show where players are having poor connection performance or quality to our infrastrastructure.
Proposed solution
I myself recently came across IPNetDB which gave fire to this idea I have been thinking about for some time. Essentially I would like Quilkin to have support for doing a query for each connection when established, getting specific information about the IP address and then have this sent with other metrics downstream.
IPNetDB in this case is based on MaxMind Data Base (mmdb
) but support should probably also include GeoIP2 and GeoLite2 databases. There are librarys in rust for this already, for instance oschwald/maxminddb-rust.
Below you can see output from a query of my home IP address. The information we would need is all in the query.
{'allocation': '37.247.0.0/19',
'allocation_cc': 'SE',
'allocation_registry': 'ripe',
'allocation_status': 'allocated',
'as': 33885,
'as_cc': 'SE',
'as_entity': 'Ownit Broadband AB',
'as_name': 'OWNIT',
'as_private': False,
'as_registry': 'ripe',
'prefix': '37.247.0.0/19',
'prefix_asset': [],
'prefix_assignment': 'allocated pa',
'prefix_bogon': False,
'prefix_entity': 'Ownit Broadband AB',
'prefix_name': 'SE-OWNIT-20120417',
'prefix_origins': [33885],
'prefix_registry': 'ripe'}
Problems? Since these databases are file based, we would need to figure out a solution where Quilkin uses and updates the file on disk in order to have the most up to date version of the database.
Sounds reasonable to me overall! I'm thinking impl wise it should rather be that Quilkin makes the connection info available, then some service hosts the DB and can use the info to answer queries. Unless there's an upside to talking directly to quilkin for this (I couldn't come up with any)? It'll mean the proxy won't enter the data/analytics business since that isn't the goal of a proxy and when the types of queries/datasources need to change or be improved, users don't have to wait for a new version of quilkin.
For example, a filter that pushes, this new client IP:Port showed up at this timestamp to some data store. A tiny http service somewhere with an api can consume that data to show e.g what clients are currently talking to what proxies, and do a join on the ipnetdb on disk to answer any query
@suom1 Are there any particular kind of metrics you're envisioning being included with this usecase? atm I can mostly come up with being able to track latency - e.g the proxy can say a packet from IP:Port arrived at this timestamp, but not sure how useful that would be since a single proxy doesn't have enough info to say how fast/slow things are going
With this information we can then build tools and heatmaps which will show where players are having poor connection performance or quality to our infrastrastructure.
Interesting topic, and definitely something I want to see us be able to do with Quilkin! 👍🏻
Sounds reasonable to me overall! I'm thinking impl wise it should rather be that Quilkin makes the connection info available, then some service hosts the DB and can use the info to answer queries.
I would agree on this sentiment. I don't think it's Quilkin's job to host this database -- but I do think it should provide relevant information out to an external service such that it can join this data together in a meaningful way -- so we would need to make sure that Quilkin is exporting the data that would be needed to join the data.
I have a few questions here (actually some of these have also been rolling around in my head lately as well):
- How should we be measuring latency? Do we send an echo packet? Is there another way? Is this something we should bake into Quilkin (I think the answer is yes, but there are some interesting technical design discussions there I think) - if we agree we should do this, then we should start a new ticket.
- How do we want to export this type of data? My initial thought was to export some kind of distributed trace information (OpenTelemetry seems to be the leader here), down to the player/address level - which could include latency information. (Some details in #258). I'd prefer not to create a new standard if we can avoid it.
Exposing this as traces makes a lot of sense from Quilkin's perspective. It does make it a lot harder to consume the data though (a user would need to have some custom tooling to extract the traces from whatever store it ends up in order to use it which would be a lot of work) but likely its more reasonable than doing something custom
How should we be measuring latency?
Do you mean latency from the proxy to every connected upstream endpoint?
I agree that running the lookup service inside Quilkin might not be the optimal solution since then each instance of Quilkin would need to be shipped with a geo-ip database. Since our players game traffic will go through Quilkin instances first, I think it's suitable to implement some kind of filter that pushes the IPv4/IPv6 address of the client connected to some data store like @iffyio suggested.
The end of end measurement of latency however would need to be done from the client all the way to the game-server. But an interesting metric would also be to monitor latency between client and the active Quilkin instance. This information would over time give us better understanding of where we are in need of hosting more or new Quilkins.
Not sure if this answers your question @iffyio
Exposing this as traces makes a lot of sense from Quilkin's perspective. It does make it a lot harder to consume the data though (a user would need to have some custom tooling to extract the traces from whatever store it ends up in order to use it which would be a lot of work) but likely its more reasonable than doing something custom
I say this as someone who has never used distributed tracing/open tracing - so please feel free to take the ideas with large grains of salt 😄 This is all theory at this stage.
This is GCP specific, but there is an export from Tracing -> BigQuery: https://cloud.google.com/trace/docs/trace-export-bigquery
Doing some digging around, there are a few tracing backend services that allow querying of data. My theory is that that arbitrary querying of trace data would be a common enough problem for observability platforms that we should be able to find some patterns or tools to export data without the need for writing our own custom tooling.
But like I said - this is my theory at least. 😄
But an interesting metric would also be to monitor latency between client and the active Quilkin instance.
That was what I was thinking. I'd recommend a ticket to design a filter that could measure, capture and expose this information. I can think of a few potential approaches.