gocql Question on bulk queries

Hello!

(not sure if this is the right place to ask questions - feel free to close if not) Going over some online resources on Scylla performance, there's a recurring theme, which is:

it may be beneficial to do bulk queries (using SELECT ... IN on reads and BATCH on writes) instead of doing more parallel queries.
when doing bulk queries, it is better to avoid requiring coordination across multiple nodes (= avoid bulk queries that require data from different nodes)

Example resources that touch on this:

On SELECT ... IN queries: https://www.scylladb.com/tech-talk/planning-queries-maximum-performance-scylla-summit-2017/ (a bit old, but I think it is still relevant?)
On BATCH insert queries: https://www.scylladb.com/2019/03/27/best-practices-for-scylla-applications/

Now, the application I'm working on needs to do large bulk reads and writes, so I want to make sure I'm not adding artificial bottlenecks.

With that in mind, I'm a bit puzzled about how to implement the above recommendations with gocql. Ideally the driver would let me split my bulk queries by node, but there does not seem to be a way to do so (?) It seems the closest I can do using gocql is to have my application split bulk queries by partition key. But if I'm trying to read, let's say, 200K rows with all different partition keys, this basically means bulk queries will all be split into 200K single-row queries - which basically means I can't use bulk queries in practice - which may be detrimental to my application performance.

Given that the driver has access to the information that would be required to split queries intelligently, I was wondering why there is no such facility? I'm not sure whether I am misunderstanding how to best use the driver, or if the driver is currently lacking the feature. It would surely require the driver to expose a rather odd-looking API (as the current gocql API seems pretty abstracted away from such concerns) - but that seems required in order to make bulk queries as efficient as they can be.

I'm completely new to the Scylla ecosystem so apologies if my question seems odd. Advice welcome, and sorry if I missed something obvious :)

Thanks!

Apr 14 '21 00:04 vikmik

Yep, I'm not aware of any public API to determine which node (resp. in case of Scylla you would group by shard) the partition key maps to.

We just split our queries by partition key, although in our case we don't need to read 200K specific partition keys at once and we read more than one row from partition.

One advantage of splitting by partition key is that each partition can return error separately (e.g. in case some partition takes long to process, but that might be irrelevant if you only have one row per partition).

It would definitely be interesting to see a benchmark comparing the options.

Apr 15 '21 06:04 martin-sucha

I too would like to have the ability to batch my data according to which node/shard my data is going. Any new information regarding this?

May 30 '21 17:05 Codebreaker101

The same problem. I extend gocql driver with this function, which helps me to split keys into shard aware batches. May be it would be helpful https://github.com/scylladb/gocql/pull/164

Mar 18 '24 13:03 moguchev

Did you benchmark your approach? For what it's worth, I did test my application with similar changes, and found that in my case it didn't help. I stopped using SELECT IN ... and BATCH entirely. ScyllaDB is so good at handling concurrent queries that I found no compelling reason to extend the driver code and have to maintain that.

I suspect that ScyllaDB is less efficient at load balancing queries that can vary in size than it is at handling many many concurrent queries. That was 3 years ago though.

Mar 18 '24 20:03 vikmik

It really depends on kind of load. ScyllaDB is really good at handling concurrent queries but when we have 1,5M/rps with 100-500 different keys (from different partitions) per one query on one insert or read, concurrent queries with single keys is not efficient.

We did several experiments and found that ScyllaDB can insert big BATCHes efficiently if all keys in batch are corresponded to one host (so strategy is Host Aware). And SELECT IN queries is efficiently when it has less or equal 10 keys corresponded to one shard (so strategy is Shard Aware). That's why I wrote this extension API and uses in our application.

You should also understand that ScyllaDB is really good at parallelism, so it's important to have a threaded approach to handling queries in your application in order to ensure maximum utilization of ScyllaDB.

Apr 14 '24 09:04 moguchev

@moguchev - how do you handle individual query failures?

Apr 15 '24 06:04 mykaul

In insert queries we have a background retry queue, so we are hope that some failed queries will retried with success (of course we monitore such queries)

In read queries in real time we can nothing to do with some individual failed queries, so we degrade functionality and notify clients, that they can re-request some values. This is trade off. In some applications it cannot be done. We are lucky so we can afford this strategy.

Apr 15 '24 10:04 moguchev

i would suggest to close this issue in favor of https://github.com/scylladb/gocql/issues/200

Jul 20 '24 23:07 dkropachev

gocql gocql copied to clipboard

Question on bulk queries

gocql
gocql copied to clipboard