snuba
snuba copied to clipboard
Fix snuba admin's query tracing to connect to right storage and query nodes
Snuba admin's tracing tool shows query trace output. In addition to that, it will now show the profile events data. This PR parses the trace output and builds a dict of node name to query id. Then system query is executed on those nodes to get the profile events for the corresponding query id.
Before connecting to the nodes, a socket connection is attempted to see if the hostname resolves. If it does not, then the node is assumed to be 127.0.0.1. This is required because while running locally, snuba admin process will connect to clickhouse container. But the container's hostname does not resolve, defaulting to 127.0.0.1 as the hostname.
To make sure hostnames in the CI jobs resolve, the CI docker-compose file is updated.
:x: 1 Tests Failed:
| Tests completed | Failed | Passed | Skipped |
|---|---|---|---|
| 472 | 1 | 471 | 1 |
View the top 1 failed tests by shortest run time
tests.admin.test_api test_query_traceStack Traces | 0.192s run time
Traceback (most recent call last): File ".../tests/admin/test_api.py", line 258, in test_query_trace assert response.status_code == 200 AssertionError: assert 500 == 200 + where 500 = <WrapperTestResponse streamed [500 INTERNAL SERVER ERROR]>.status_code
To view individual test run time comparison to the main branch, go to the Test Analytics Dashboard
and actually this does not seem like a good idea at all. what is the problem you're solving? having everyone edit
/etc/hostson their mac is not a good solution and is going to cause endless pain for developers
@asottile-sentry thanks for the review. Let's take step back to the larger problem. I will edit the description if this explanation makes sense.
At Sentry, snuba developers use a tool called snuba admin. It has a query tracing section that allows snuba developers to trace how a query is executed in clickhouse. We wanted to enhance that tool to also show clickhouse's profile events. Without this enhancement, snuba developers have to do these manual steps to get profile events:
- Manually copy query ids from the raw trace output.
- Navigate to system query dashboard.
- Type in a different query to get profile events and replace query-id (one at a time) with the strings gathered in step 1.
- As if this isn't enough, choose various hosts from the dropdown in systems query dashboard and repeat step 3.
You can see that these manual steps are error-prone and there is scope to automate them.
The query trace output has a filed called summarized_trace_output. Here is a sample of it:
{
"module": "snuba.admin.views",
"event": "summarized_trace_output = TracingSummary(query_summaries={'ebaf1a40d262': QuerySummary(node_name='ebaf1a40d262', is_distributed=True, query_id='fc3e8017-51ce-47c0-bb0f-209470fb70c8', execute_summaries=[ExecuteSummary(rows_read=1, memory_size='4.01 KiB', seconds=0.007365, rows_per_second=135.77732518669382, bytes_per_second='544.17 KiB')], select_summaries=None, index_summaries=None, stream_summaries=None, aggregation_summaries=None, sorting_summaries=None)})",
"severity": "info",
"user_ip": "127.0.0.1",
"endpoint": "clickhouse_trace_query",
"timestamp": "2024-09-05T03:47:13.227537Z"
}
The node_name='ebaf1a40d262' is the container id/hostname of the clickhouse container that snuba admin tool wants to connect to run system queries to get profile events. It has these ports forwarded:
127.0.0.1:8123->8123/tcp, 127.0.0.1:9000->9000/tcp, 127.0.0.1:9009->9009/tcp. In order to reliably connect to this host, we need it to have a proper name that we control (not a container id that docker runtime chooses). In order to resolve that hostname, we need entry in /etc/hosts file. Without name resolution, running system query gives an error like "Host clickhouse.dev.local and port 9000 are not valid."
To be clear, not every developer needs this entry in /etc/hosts. Only those that want to fix issues/enhance snuba admin while running it locally. In the CI pipeline, the configuration is different and so I made changes to the CI clickhouse hostname.
wouldn't it be simpler, and a better experience, to write a tool which does all of the manual translation automatically (including replacing with the appropriate localhost / loopback addresses? -- we already utilize host.docker.internal for example)
wouldn't it be simpler, and a better experience, to write a tool which does all of the manual translation automatically (including replacing with the appropriate localhost / loopback addresses? -- we already utilize
host.docker.internalfor example)
I resolved the issues based on the feedback in our call. No network settings need to be changed to make this work now.