crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

perf: Optimize metadata records processing in `SqlStorageClient`

Open Mantisus opened this issue 1 month ago • 4 comments

Description

  • This PR adds new buffer tables to improve handling of metadata records. The key change is that updates to metadata are now accumulated in buffer and applied when get_metadata is called. With the old behavior, metadata records were updated instantly within a transaction. This led to waiting for locks to be released in high-concurrency situations.

Issues

  • Closes: #1533

Mantisus avatar Nov 11 '25 14:11 Mantisus

Interesting! I'd imagine that transactions consisting of e.g., an insertion to the dataset_items table and an update to dataset metadata wouldn't lock the metadata table for that long - you can commit right after the update to metadata.

Also, the buffering approach is faster because the buffer table gets a row for each increment and those get compacted later on, correct?

janbuchar avatar Nov 18 '25 09:11 janbuchar

update to dataset metadata wouldn't lock the metadata table for that long

They will create many short-lived locks. And with a large number of clients with high concurrency inserting new records, this effect will accumulate. This is exactly what @ericvg97 pointed out - https://github.com/apify/crawlee-python/issues/1533#issuecomment-3491525019

Although, of course, the strongest impact on RequestQueue

Yes, insert operations into the buffer table are quite fast. And then we can simply apply the result of the aggregations to update the metadata record.

Mantisus avatar Nov 18 '25 12:11 Mantisus

update to dataset metadata wouldn't lock the metadata table for that long

They will create many short-lived locks. And with a large number of clients with high concurrency inserting new records, this effect will accumulate. This is exactly what @ericvg97 pointed out - #1533 (comment)

Although, of course, the strongest impact on RequestQueue

I see, thanks. And is there any chance that the lock is held for too long because of how we work with sqlalchemy? In other words, would it be better if we just executed sql such as insert ...; update ...; commit in one go? If yes, it might be worth trying before adding three new tables to the whole thing.

janbuchar avatar Nov 19 '25 14:11 janbuchar

it might be worth trying before adding three new tables to the whole thing.

I will test this approach.

Mantisus avatar Nov 19 '25 15:11 Mantisus