Support HTTP compression
About
CrateDB’s HTTP interface supports gzip and deflate compressed requests, but the crate-python client currently does not utilize this capability. Adding request compression would reduce bandwidth usage, improve performance for large queries and bulk inserts, and align crate-python with best practices seen in other database clients.
As a user, I want the option to send compressed requests to CrateDB to improve performance on congested networks.
Requirements:
- [ ] Add a configuration option to enable request compression (gzip or deflate) when sending requests to CrateDB.
- [ ] The default should enable compression
- [ ] TBD: Introduce a size threshold to determine when compression is applied. Context: Sending a Content-Encoding header for every request adds unnecessary overhead, so compression should only be used when the request size exceeds a configurable threshold (e.g., 1 KB, 2 KB, or 4 KB, similar to other libraries).
[!warning] This is primarily about request encoding / compression. HTTP response encoding is vulnerable to BREACH and therefore requires additional measurements.
@proddata said:
It seems like CrateDB's HTTP interface accepts gzip / deflate compressed data. It might also be interesting to add this capability to
crate-python.
@surister said:
import gzip
import json
import requests
objects = [
[1, "test"] for _ in range(200_000)
]
body = {
"stmt": "INSERT INTO t VALUES (?, ?)",
"bulk_args": objects
}
response = requests.post('http://192.168.88.251:4200/_sql', json=body)
print(response.request.headers.get('content-length'))
response = requests.post('http://192.168.88.251:4200/_sql',
data=gzip.compress(json.dumps(body).encode('utf8')),
headers={'Content-Encoding': 'gzip',
'Content-Type': 'application/gzip; charset=utf-8'})
print(response.request.headers.get('content-length'))
2600054
5149
References
- https://github.com/crate/crate/pull/17494
@proddata: Thanks for your feedback, the requirements sound reasonable if we really need to have the need to turn off the feature on demand.
Naming things
I know this driver uses a standard protocol, while most other low-level database drivers are using a different protocol, mostly binary or otherwise proprietary to their needs, where compression might be handled differently, or is enabled from the start.
In order to adhere to relevant (naming) conventions as good as possible, how to make the feature configurable like we are proposing it: Can we look into, most prominently to learn about interfaces and parameter names, how others are doing it?
a) How other Python DBAPI libraries handle this situation how to make relevant parameters configurable (names, values, units), and b) in particular, how PostgreSQL (clients) are doing it, both in Python lands, and beyond.
... of course, this only matches if compression is also a concern there. I think it might do, but I might also bee too naive.
Rationale
How others are doing it?
In particular, I am not exclusively concerned about the DBAPI driver here, because it rarely has direct exposure other than using it from applications. However, SQLAlchemy is a differerent animal, because its connection string is exposed to wider audiences of people and machines, using it in downstream applications of many kinds, standalone or cloud-based, library-shaped or ephemerally-hosted, or not. You name it.
In this spirit, we aim to standardize on naming conventions here,
- https://github.com/crate/sqlalchemy-cratedb/issues/197
so I am asking to do the same, looking how others are naming their parameters, also for the compression feature.
In the Elasticsearch Python client, HTTP compression is controlled by a simple on/off switch: Configuration Reference Implementation Details
However, it’s important to distinguish between the two aspects of compression support in the client: sending and receiving compressed data. Only the content itself is compressed, not the overall request structure. For example, a simple query like SELECT 1 results in approximately 250 bytes transmitted in the request body. Enabling compression in such cases provides little benefit in terms of data reduction but, in initial tests, introduced a slight increase in latency (a few milliseconds per request).
b) in particular, how PostgreSQL (clients) are doing it, both in Python lands, and beyond.
PostgreSQL does not support request compression in the same way. Neither the PostgreSQL wire protocol nor common client implementations (in Python or other languages) provide native compression mechanisms similar to Elasticsearch’s HTTP-based approach.
Thanks, I've added the information about Elasticsearch to the table below.
With the PostgreSQL wire protocol, compression can be enabled, if your OpenSSL library supports zlib, by toggling the connection parameter sslcompression=1, which has been enabled by default starting with its introduction on version 9.2, but this default has been disabled again with version 11.
I have not been able to spot any threshold parameters other than with Oracle, up until now. Of course this enumeration is neither exhaustive, nor deep, and just tries to tap a little bit into the topic of proper "naming things", and "exploring the landscape".
| Database | Client docs | Parameter names (driver/server/SQL) | Parameter names (JDBC,ODBC) | Parameter names (SQLAlchemy) |
|---|---|---|---|---|
| Elasticsearch | Python client: HTTP compression | api argument:http_compress=True |
http_compressℹ |
|
| MongoDB | MongoDB Network Compression: A Win-Win | api argument:compressors='zstd' |
n/a | |
| MySQL | 6.2.8 Connection Compression Control | setting:protocol_compression_algorithms cli: --compression-algorithms cli: --compress |
compressℹ |
|
| Oracle | Advanced Network Compression Whitepaper | SQLNET.COMPRESSION, SQLNET.COMPRESSION_LEVELS, SQLNET.COMPRESSION_THRESHOLD |
||
| PostgreSQL | 31.1. Database Connection Control Functions » 31.1.2. Parameter Key Words (via SO) | connection:sslcompression=1 env: PGSSLCOMPRESSION=1 |
Did not discover anything. Maybe generic pass-through? |
With the PostgreSQL wire protocol, compression can be enabled, if your OpenSSL library supports zlib, by toggling the connection parameter sslcompression=1, which has been enabled by default starting with its introduction on version 9.2, but this default has been disabled again with version 11.
TLS compression has been removed in TLSs v1.3 due to CRIME
HTTP compression is somewhat vulnerable to BREACH, so it’s important to differentiate between request and response encoding. Ideally, a client should support both, but they should be managed with separate settings, as request compression generally doesn’t present the same risks. That is also partially why I only initially talked about request encoding.
I see. Thank you very much.
So, in order to be able to use other implementations and their parameterizations as blueprints on "naming things", eventually, we need to focus on databases that use traditional OpenSSL, but specifically HTTP as a communication protocol, because those details (e.g. request vs. response compression parameters) will only be present and of concern in such environments.
In this case, selecting Elasticsearch is a perfect choice [^1]. However, relevant parameter sets seem pretty thin in this regard: http_compress neither provides the option to discriminate between request and response, nor conveys any threshold details. Maybe OpenSearch offers a better way to adjust those parameters?
[^1]: Can you think of any other database servers that use HTTP?
OpenSearch Python Client
The OpenSearch Python Low-Level Client supports HTTP compression for request bodies:
http_compress = True # Enables gzip compression for request bodies
Contrary to ES, they mention only request bodies (haven't checked the actual implementation)
ClickHouse Python Client
The ClickHouse Python Client (with limited SQLAlchemy support) also provides compression settings:
ClickHouse Java Client
The ClickHouse Java Client uses the HTTP interface and provides three compression-related settings:
| Setting | Description | Effect |
|---|---|---|
compressServerResponse(boolean enabled) |
enabled - flag that indicates if the option should be enabled |
Sets whether the server should compress its responses. |
compressClientRequest(boolean enabled) |
enabled - flag that indicates if the option should be enabled |
Sets whether the client should compress its requests. |
useHttpCompression(boolean enabled) |
enabled - flag that indicates if the option should be enabled |
Enables HTTP compression for client/server communication if the corresponding options are enabled. |
ClickHouse JavaScript Client
The ClickHouse JavaScript Client differentiates between request and response compression:
Thank you. What do you think about those parameter names, for both DB API's .connect() function, and SQLAlchemy's engine or query options?
-
compress_request=truevs.compress_response=true -
compress_request_threshold=2k -
compress_request_encoding=vs.compress_response_encoding=(one oflz4, zstd, br, gzip, deflate, when applicable)
Since these settings primarily deal with content encoding, we might consider naming them accordingly and aligning with HTTP conventions, such as:
-
encoding_request='gzip' | 'none' | <any others in the future> -
encoding_response='none' | 'gzip' | <any others in the future>
Semantically, this would also remove the need for a dedicated on/off parameter.
WDYT?
For the threshold, I think it would make sense to run some benchmarks—potentially, this parameter might not be needed after all.
Hi. I would like to use a naming scheme that is very much independent from the used protocol / not necessarily tied to it, when possible, focusing on the semantic meaning around the compress label, like the other database vendors are doing it, because everyone will understand what it is about, contrary to encoding.
Maybe let's get rid of request/response instead, instead using client/server, like ClickHouse is doing it?
Then maybe:
-
compress = 'request' | 'response' | 'both' / 'all' | 'none' - or
compress = 'client' | 'server' | 'both' / 'all' | 'none'
which in the future would allow for: -
compress_encoding = 'gzip' | 'zstd' | 'lz4' | 'br' | 'deflate'
Or alternatively:
-
compress = True -
compress_mode = 'request' | 'response' | 'both' | 'none'
which in the future would allow for:
Slightly OT: Discovered per Dependabot update, it looks like Prometheus also introduced zstd compression support recently.
- https://github.com/prometheus/client_golang/pull/1496
- https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md#1220--2025-04-07