crate-python icon indicating copy to clipboard operation
crate-python copied to clipboard

Support HTTP compression

Open amotl opened this issue 1 year ago • 11 comments

About

CrateDB’s HTTP interface supports gzip and deflate compressed requests, but the crate-python client currently does not utilize this capability. Adding request compression would reduce bandwidth usage, improve performance for large queries and bulk inserts, and align crate-python with best practices seen in other database clients.

As a user, I want the option to send compressed requests to CrateDB to improve performance on congested networks.

Requirements:

  • [ ] Add a configuration option to enable request compression (gzip or deflate) when sending requests to CrateDB.
  • [ ] The default should enable compression
  • [ ] TBD: Introduce a size threshold to determine when compression is applied. Context: Sending a Content-Encoding header for every request adds unnecessary overhead, so compression should only be used when the request size exceeds a configurable threshold (e.g., 1 KB, 2 KB, or 4 KB, similar to other libraries).

[!warning] This is primarily about request encoding / compression. HTTP response encoding is vulnerable to BREACH and therefore requires additional measurements.


@proddata said:

It seems like CrateDB's HTTP interface accepts gzip / deflate compressed data. It might also be interesting to add this capability to crate-python.

@surister said:

import gzip
import json
import requests

objects = [
    [1, "test"] for _ in range(200_000)
]

body = {
    "stmt": "INSERT INTO t VALUES (?, ?)",
    "bulk_args": objects
}
response = requests.post('http://192.168.88.251:4200/_sql', json=body)


print(response.request.headers.get('content-length'))

response = requests.post('http://192.168.88.251:4200/_sql',
                         data=gzip.compress(json.dumps(body).encode('utf8')),
                         headers={'Content-Encoding': 'gzip',
                                  'Content-Type': 'application/gzip; charset=utf-8'})

print(response.request.headers.get('content-length'))
2600054
5149

References

  • https://github.com/crate/crate/pull/17494

amotl avatar Feb 21 '25 19:02 amotl

@proddata: Thanks for your feedback, the requirements sound reasonable if we really need to have the need to turn off the feature on demand.

Naming things

I know this driver uses a standard protocol, while most other low-level database drivers are using a different protocol, mostly binary or otherwise proprietary to their needs, where compression might be handled differently, or is enabled from the start.

In order to adhere to relevant (naming) conventions as good as possible, how to make the feature configurable like we are proposing it: Can we look into, most prominently to learn about interfaces and parameter names, how others are doing it?

a) How other Python DBAPI libraries handle this situation how to make relevant parameters configurable (names, values, units), and b) in particular, how PostgreSQL (clients) are doing it, both in Python lands, and beyond.

... of course, this only matches if compression is also a concern there. I think it might do, but I might also bee too naive.

Rationale

How others are doing it?

In particular, I am not exclusively concerned about the DBAPI driver here, because it rarely has direct exposure other than using it from applications. However, SQLAlchemy is a differerent animal, because its connection string is exposed to wider audiences of people and machines, using it in downstream applications of many kinds, standalone or cloud-based, library-shaped or ephemerally-hosted, or not. You name it.

In this spirit, we aim to standardize on naming conventions here,

  • https://github.com/crate/sqlalchemy-cratedb/issues/197

so I am asking to do the same, looking how others are naming their parameters, also for the compression feature.

amotl avatar Feb 25 '25 15:02 amotl

In the Elasticsearch Python client, HTTP compression is controlled by a simple on/off switch: Configuration Reference Implementation Details

However, it’s important to distinguish between the two aspects of compression support in the client: sending and receiving compressed data. Only the content itself is compressed, not the overall request structure. For example, a simple query like SELECT 1 results in approximately 250 bytes transmitted in the request body. Enabling compression in such cases provides little benefit in terms of data reduction but, in initial tests, introduced a slight increase in latency (a few milliseconds per request).

b) in particular, how PostgreSQL (clients) are doing it, both in Python lands, and beyond.

PostgreSQL does not support request compression in the same way. Neither the PostgreSQL wire protocol nor common client implementations (in Python or other languages) provide native compression mechanisms similar to Elasticsearch’s HTTP-based approach.

proddata avatar Feb 25 '25 16:02 proddata

Thanks, I've added the information about Elasticsearch to the table below.

With the PostgreSQL wire protocol, compression can be enabled, if your OpenSSL library supports zlib, by toggling the connection parameter sslcompression=1, which has been enabled by default starting with its introduction on version 9.2, but this default has been disabled again with version 11.

I have not been able to spot any threshold parameters other than with Oracle, up until now. Of course this enumeration is neither exhaustive, nor deep, and just tries to tap a little bit into the topic of proper "naming things", and "exploring the landscape".

Database Client docs Parameter names (driver/server/SQL) Parameter names (JDBC,ODBC) Parameter names (SQLAlchemy)
Elasticsearch Python client: HTTP compression api argument:http_compress=True http_compress
MongoDB MongoDB Network Compression: A Win-Win api argument:compressors='zstd' n/a
MySQL 6.2.8 Connection Compression Control setting:protocol_compression_algorithms
cli:--compression-algorithms
cli:--compress
compress
Oracle Advanced Network Compression Whitepaper SQLNET.COMPRESSION, SQLNET.COMPRESSION_LEVELS, SQLNET.COMPRESSION_THRESHOLD
PostgreSQL 31.1. Database Connection Control Functions » 31.1.2. Parameter Key Words (via SO) connection:sslcompression=1
env:PGSSLCOMPRESSION=1
Did not discover anything. Maybe generic pass-through?

amotl avatar Feb 26 '25 00:02 amotl

With the PostgreSQL wire protocol, compression can be enabled, if your OpenSSL library supports zlib, by toggling the connection parameter sslcompression=1, which has been enabled by default starting with its introduction on version 9.2, but this default has been disabled again with version 11.

TLS compression has been removed in TLSs v1.3 due to CRIME


HTTP compression is somewhat vulnerable to BREACH, so it’s important to differentiate between request and response encoding. Ideally, a client should support both, but they should be managed with separate settings, as request compression generally doesn’t present the same risks. That is also partially why I only initially talked about request encoding.

proddata avatar Feb 26 '25 07:02 proddata

I see. Thank you very much.

So, in order to be able to use other implementations and their parameterizations as blueprints on "naming things", eventually, we need to focus on databases that use traditional OpenSSL, but specifically HTTP as a communication protocol, because those details (e.g. request vs. response compression parameters) will only be present and of concern in such environments.

In this case, selecting Elasticsearch is a perfect choice [^1]. However, relevant parameter sets seem pretty thin in this regard: http_compress neither provides the option to discriminate between request and response, nor conveys any threshold details. Maybe OpenSearch offers a better way to adjust those parameters?

[^1]: Can you think of any other database servers that use HTTP?

amotl avatar Feb 26 '25 14:02 amotl

OpenSearch Python Client

The OpenSearch Python Low-Level Client supports HTTP compression for request bodies:

http_compress = True  # Enables gzip compression for request bodies

Contrary to ES, they mention only request bodies (haven't checked the actual implementation)

ClickHouse Python Client

The ClickHouse Python Client (with limited SQLAlchemy support) also provides compression settings:

ClickHouse Java Client

The ClickHouse Java Client uses the HTTP interface and provides three compression-related settings:

Setting Description Effect
compressServerResponse(boolean enabled) enabled - flag that indicates if the option should be enabled Sets whether the server should compress its responses.
compressClientRequest(boolean enabled) enabled - flag that indicates if the option should be enabled Sets whether the client should compress its requests.
useHttpCompression(boolean enabled) enabled - flag that indicates if the option should be enabled Enables HTTP compression for client/server communication if the corresponding options are enabled.

ClickHouse JavaScript Client

The ClickHouse JavaScript Client differentiates between request and response compression:

proddata avatar Feb 26 '25 14:02 proddata

Thank you. What do you think about those parameter names, for both DB API's .connect() function, and SQLAlchemy's engine or query options?

  • compress_request=true vs. compress_response=true
  • compress_request_threshold=2k
  • compress_request_encoding= vs. compress_response_encoding= (one of lz4, zstd, br, gzip, deflate, when applicable)

amotl avatar Feb 28 '25 03:02 amotl

Since these settings primarily deal with content encoding, we might consider naming them accordingly and aligning with HTTP conventions, such as:

  • encoding_request='gzip' | 'none' | <any others in the future>
  • encoding_response='none' | 'gzip' | <any others in the future>

Semantically, this would also remove the need for a dedicated on/off parameter.

WDYT?

For the threshold, I think it would make sense to run some benchmarks—potentially, this parameter might not be needed after all.

proddata avatar Feb 28 '25 08:02 proddata

Hi. I would like to use a naming scheme that is very much independent from the used protocol / not necessarily tied to it, when possible, focusing on the semantic meaning around the compress label, like the other database vendors are doing it, because everyone will understand what it is about, contrary to encoding.

Maybe let's get rid of request/response instead, instead using client/server, like ClickHouse is doing it?

amotl avatar Feb 28 '25 09:02 amotl

Then maybe:

  • compress = 'request' | 'response' | 'both' / 'all' | 'none'
  • or compress = 'client' | 'server' | 'both' / 'all' | 'none'
    which in the future would allow for:
  • compress_encoding = 'gzip' | 'zstd' | 'lz4' | 'br' | 'deflate'

Or alternatively:

  • compress = True
  • compress_mode = 'request' | 'response' | 'both' | 'none'
    which in the future would allow for:

proddata avatar Feb 28 '25 09:02 proddata

Slightly OT: Discovered per Dependabot update, it looks like Prometheus also introduced zstd compression support recently.

  • https://github.com/prometheus/client_golang/pull/1496
  • https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md#1220--2025-04-07

amotl avatar Apr 14 '25 07:04 amotl