python-driver icon indicating copy to clipboard operation
python-driver copied to clipboard

Connection to a ScyllaDB cluster is delayed as it tries with unsupported versions first

Open mykaul opened this issue 2 years ago • 40 comments

Looking at a pcap, the ScyllaDB Python driver tries to connect with versions we do not support - 66 (DSE_V2), 65 (DSE_V1), 5. That means 3 connections that are responded with an ERROR, until we succeed to connect. Each connection is of course the whole 3 way TCP conn, then closed, etc... cc @fruch (I've seen it in cqlsh, now I understand it's due to the Python driver defaults)

mykaul avatar Jul 20 '23 08:07 mykaul

We can (should?) also change:

SUPPORTED_VERSIONS = (66, 65, 6, 5, 4, 3, 2, 1)

To what we actually support.

mykaul avatar Jul 20 '23 11:07 mykaul

We can change the order in which protocols are tried, I'll try to get to it soon. However, I wanted to note that this doesn't really matter if there aren't many connections (like when using clqsh). Performance of connecting will matter if there are a lot of connections made at the same time. And it that case, protocol_version should be explicitly set when constructing Cluster object instead of relying on driver guessing the version - if our tests or clients don't do it, they should be instructed to. If someone for some reason needs connection setup performance while using cqlsh - it has a --protocol-version parameter which should be used in that case.

Lorak-mmk avatar Jul 20 '23 13:07 Lorak-mmk

We can (should?) also change:

SUPPORTED_VERSIONS = (66, 65, 6, 5, 4, 3, 2, 1)

To what we actually support.

We only test protocol v4 in CI, but I think all the others should be supported by driver (I may be wrong here) - which ones do you think should be removed from this list?

Lorak-mmk avatar Jul 20 '23 13:07 Lorak-mmk

We (ScyllaDB) do not support version >4, but I reckon you raise an interesting argument - what if someone wants to use our driver against both ScyllaDB and some DSE version... I don't think we test or should focus on the usecase. I prefer faster connection out of the box.

mykaul avatar Jul 20 '23 19:07 mykaul

We can change the order in which protocols are tried, I'll try to get to it soon. However, I wanted to note that this doesn't really matter if there aren't many connections (like when using clqsh). Performance of connecting will matter if there are a lot of connections made at the same time. And it that case, protocol_version should be explicitly set when constructing Cluster object instead of relying on driver guessing the version - if our tests or clients don't do it, they should be instructed to. If someone for some reason needs connection setup performance while using cqlsh - it has a --protocol-version parameter which should be used in that case.

Yes, I was lazy so used cqlsh, but the issue is in the Python driver (and perhaps others?) and I'm looking at the impact when hundreds or thousands of connection attempts are made. The argument about specifying the exact protocol version is valid, but assumes someone takes this into consideration in the first place.

mykaul avatar Jul 20 '23 19:07 mykaul

An easy solution is put the version that Scylla supports (4) as the first one in the long list.

But this change may break a guarantee made in the driver documentation https://docs.datastax.com/en/developer/python-driver/3.25/api/cassandra/cluster/:

protocol_version = 66 - The maximum version of the native protocol to use. If not set in the constructor, the driver will automatically downgrade version based on a negotiation with the server, but it is most efficient to set this to the maximum supported by your version of Cassandra. Setting this will also prevent conflicting versions negotiated if your cluster is upgraded.

This text suggests that the driver defaults to trying version 66 and then downgrading. If it starts with 4, it will never try anything higher. This text suggest that it's the user's responsibility to set the protocol_version parameter in the cluster (and as @Lorak-mmk said, we indeed do this in test/cql-pytest/util.py and other places).

We could, of course, change this documentation to explain we actually always use 4, and if you want a higher protocol, you should set it manually.

By the way, we have Scylla issues on how Scylla handles any protocol versions except 4 incorrectly :( See https://github.com/scylladb/scylladb/issues/9791 and https://github.com/scylladb/scylladb/issues/11899

nyh avatar Jul 20 '23 19:07 nyh

This text suggests that the driver defaults to trying version 66 and then downgrading. If it starts with 4, it will never try anything higher. This text suggest that it's the user's responsibility to set the protocol_version parameter in the cluster (and as @Lorak-mmk said, we indeed do this in test/cql-pytest/util.py and other places).

The above is not our driver's text. There's a good open question about ScyllaDB driver support for Cassandra latest versions. I assume 'not tested' (certainly not for their enterprise versions!) is an accurate statement, though in reality we are probably fine...

mykaul avatar Jul 25 '23 10:07 mykaul

The above is not our driver's text.

There's a separate argument why we maintain our own driver with random improvements (and not just Scylla-specific features), and whether Scylla users must use the Scylla verision of the driver for every language (do we even have a Scylla version of every language?). I guess this issue is not the place to hold this discussion.

There's a good open question about ScyllaDB driver support for Cassandra latest versions. I assume 'not tested' (certainly not for their enterprise versions!) is an accurate statement, though in reality we are probably fine...

I don't think anybody will seriously consider using Scylla's driver for running against Cassandra, so the answer to this question is not important.

nyh avatar Jul 25 '23 11:07 nyh

The above is not our driver's text. There's a good open question about ScyllaDB driver support for Cassandra latest versions. I assume 'not tested' (certainly not for their enterprise versions!) is an accurate statement, though in reality we are probably fine...

I don't think anybody will seriously consider using Scylla's driver for running against Cassandra, so the answer to this question is not important.

Please see: https://github.com/scylladb/python-driver/blob/master/docs/upgrading.rst I'll paste the relevant fragment below:

Since 3.21.0, scylla-driver fully supports DataStax products. dse-driver and
dse-graph users should now migrate to scylla-driver to benefit from latest bug fixes
and new features. The upgrade to this new unified driver version is straightforward
with no major API changes.

Changing the default version of the protocol would break backwards compatibility (because our docs say what version is the default, as @nyh noticed) and it would also break above promise.

There's a separate argument why we maintain our own driver with random improvements (and not just Scylla-specific features), and whether Scylla users must use the Scylla verision of the driver for every language (do we even have a Scylla version of every language?). I guess this issue is not the place to hold this discussion.

We don't - most notably, we don't have C# and Node.js drivers.

Lorak-mmk avatar Jul 25 '23 12:07 Lorak-mmk

@Lorak-mmk

the docs mostly copy-paste from upstream with search and replace for the driver name, we did removed lots of parts, this part can be removed as well.

I don't think we need to support or test DSE versions, if we are compatible with cassandra it's probably enough for our users.

Also using CQL v4, should be working with cassandra and DSE,

starting with the version we support best, sounds o.k., especially since user would be able to select anything else they might need, if then need any of the advanced features of those versions.

fruch avatar Jul 25 '23 12:07 fruch

If we are ok with not supporting DSE and dropping this promise from docs, then awesome - we can easily drop default protocol version to V5, and potentially remove a lot of DSE-specific code from the driver in the future. That would decrease failed attempts from 3 to 1 when connecting to Scylla without specifying protocol version. Dropping to V4 would still be a breaking change (as we want to be compatible with Cassandra), wouldn't it? Unless there are no new features there.

Lorak-mmk avatar Jul 25 '23 12:07 Lorak-mmk

@tzach , @annastuchlik - please see https://github.com/scylladb/python-driver/issues/244#issuecomment-1649731361 - this is probably a bug in our documentation.

mykaul avatar Jul 25 '23 13:07 mykaul

If we are ok with not supporting DSE and dropping this promise from docs, then awesome - we can easily drop default protocol version to V5, and potentially remove a lot of DSE-specific code from the driver in the future. That would decrease failed attempts from 3 to 1 when connecting to Scylla without specifying protocol version. Dropping to V4 would still be a breaking change (as we want to be compatible with Cassandra), wouldn't it? Unless there are no new features there.

The downside is that syncing with upstream changes are going to be more challenging - either you resolve conflicts, or you (re)apply patches after syncing (the latter is easier?) - but it becomes a process. I assume we have the former today, in some cases, anyway.

mykaul avatar Jul 25 '23 13:07 mykaul

This issue affects control connection, not any of subsequent connections. So if your client connects to a cluster which has 8 nodes, 32 shards each, then the control connection will have 3 failures, as described, but the rest of shard connections (256) should connect at first try. Initially after reading description I thought that each connection is affected, so I'm noting it here in case anyone else misunderstands.

Lorak-mmk avatar Aug 20 '23 19:08 Lorak-mmk

This issue affects control connection, not any of subsequent connections. So if your client connects to a cluster which has 8 nodes, 32 shards each, then the control connection will have 3 failures, as described, but the rest of shard connections (256) should connect at first try. Initially after reading description I thought that each connection is affected, so I'm noting it here in case anyone else misunderstands.

Isn't it per node given (the control connection)? Or once an initial connection is made, for the very 1st time, everything else works with the correct version, across nodes and shards? (I've tried with cqlsh against a single shard, single node docker instance...)

mykaul avatar Aug 21 '23 07:08 mykaul

(The change in cqlsh is https://github.com/scylladb/scylla-cqlsh/pull/45 )

mykaul avatar Aug 21 '23 07:08 mykaul

After reading the code I'm pretty sure it's once per Cluster object - so only one connection, not one per node, and if you create multiple Session objects from one Cluster, only the first one will encounter this issue. I'll verify this in Wireshark later.

Lorak-mmk avatar Aug 21 '23 12:08 Lorak-mmk

After reading the code I'm pretty sure it's once per Cluster object - so only one connection, not one per node, and if you create multiple Session objects from one Cluster, only the first one will encounter this issue. I'll verify this in Wireshark later.

If that's the case, it is really benign and not worth fixing (as it'll introduce conflict with upstream on every rebase and I've offered a fix for cqlsh already)

mykaul avatar Aug 21 '23 12:08 mykaul

Not worth handling, per the above discussion - closing.

mykaul avatar Sep 21 '23 13:09 mykaul

Perhaps we should fix it?

Image

mykaul avatar Jun 12 '25 11:06 mykaul

Perhaps we should fix it?

Image

I think we definitely should fix it if we want to mitigate connection storm effects.

https://github.com/scylladb/python-driver/issues/244#issuecomment-1643964340:

However, I wanted to note that this doesn't really matter if there aren't many connections (like when using clqsh).

Yes, but as above, drivers starting connecting with unsupported protocols makes connection storms worse.

ptrsmrn avatar Jun 18 '25 15:06 ptrsmrn

We definitely should fix it

dkropachev avatar Jun 19 '25 00:06 dkropachev

Per my discussion with @avikivity , we should remove DSE versions from our driver.

SUPPORTED_VERSIONS = (DSE_V2, DSE_V1, V6, V5, V4, V3, V2, V1)

Should be:

SUPPORTED_VERSIONS = (V5, V4, V3, V2, V1)

To start with.

mykaul avatar Jun 19 '25 11:06 mykaul

@nopzdk and I discussed also this morning that we should add to our documentation that customer should explicitly config in their app to use V4. Just like I did in https://github.com/scylladb/scylla-cqlsh/pull/45

mykaul avatar Jun 19 '25 11:06 mykaul

Per my discussion with @avikivity , we should remove DSE versions from our driver.

SUPPORTED_VERSIONS = (DSE_V2, DSE_V1, V6, V5, V4, V3, V2, V1) Should be:

SUPPORTED_VERSIONS = (V5, V4, V3, V2, V1) To start with.

Let's check if DSE is needed during migrations.

avikivity avatar Jun 19 '25 13:06 avikivity

@nopzdk and I discussed also this morning that we should add to our documentation that customer should explicitly config in their app to use V4. Just like I did in scylladb/scylla-cqlsh#45

This means that when we support v5, applications won't transparently upgrade.

avikivity avatar Jun 19 '25 13:06 avikivity

@nopzdk and I discussed also this morning that we should add to our documentation that customer should explicitly config in their app to use V4. Just like I did in scylladb/scylla-cqlsh#45

This means that when we support v5, applications won't transparently upgrade.

That too. I've set it now in the driver to v5.

mykaul avatar Jun 19 '25 14:06 mykaul

Per my discussion with @avikivity , we should remove DSE versions from our driver. SUPPORTED_VERSIONS = (DSE_V2, DSE_V1, V6, V5, V4, V3, V2, V1) Should be: SUPPORTED_VERSIONS = (V5, V4, V3, V2, V1) To start with.

Let's check if DSE is needed during migrations.

I'm not sure what this means.

mykaul avatar Jun 19 '25 14:06 mykaul

I think best fix would be on server side. Server have to report client what is the best protocol version.

We can address it on driver side as well, pull protocol version from the control connection, but it is going to make old protocol version stick on upgrade.

dkropachev avatar Jun 19 '25 14:06 dkropachev

I think proper fix would be on server side. Server have to report client what is the best protocol version.

How? Isn't the very first message sent by a client is its own protocol version? And then if it is DSE or V5, server sends a protocol error?

mykaul avatar Jun 19 '25 14:06 mykaul