crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Updating request queue metada performs full table scan in SQL storage

Open ericvg97 opened this issue 1 month ago • 9 comments

Updating the metadata performs this subquery (in a bigger update query) which performs a full table scan, making all my crawlers much slower (and making the tasks timeout)

SELECT count(*) FROM request_queue_records
WHERE request_queue_id = 'AO2Uinca6RHO1Mo6X'
AND is_handled IS true

With the other two subqueries it also happens:

SELECT count(*) FROM request_queue_records
        WHERE request_queue_id = 'AO2Uinca6RHO1Mo6X'
          AND is_handled IS false
SELECT count(*) FROM request_queue_records
        WHERE request_queue_id = 'AO2Uinca6RHO1Mo6X'

this is a bottleneck as all crawler instances want to update this row at the same time + the query is slow. Besides adding an index which would make the query much faster, is there the option to not update this metadata? Is it used for something?
Thi

ericvg97 avatar Nov 05 '25 12:11 ericvg97

I added this index and there is no table scan anymore

CREATE INDEX idx_request_queue_records_queueid_handled ON public.request_queue_records USING btree (request_queue_id, is_handled)

although (without having no idea on what the metadata is used for) I would be super interested in deleting this query altogether if it is only for analytic purposes

ericvg97 avatar Nov 05 '25 12:11 ericvg97

Hi @ericvg97, thanks for catching that. And for your research into the issue.

Unfortunately, at this stage, I don't see a way to completely remove this query. Although the metadata record is indeed a bottleneck when attempting concurrent access to it.

Metadata is part of the publish API storage and also acts as a queue state synchronizer between clients.

This request synchronizes metadata, preventing inconsistent data that may occur in certain cases.

Mantisus avatar Nov 05 '25 13:11 Mantisus

oh I see, and what do you think about adding the index?

ericvg97 avatar Nov 05 '25 13:11 ericvg97

Yes, of course, I will test the queue with the index and if I don't see any potential problems, I will add it or something similar. And I will see if we can make this query less frequently.

Mantisus avatar Nov 05 '25 13:11 Mantisus

Perfect, thanks! I am also having a similar problem with the datasets. Adding a new entry triggers an update in the datasets table (only metadata) and it also blocks my crawlers... Any suggestions? I was thinking on creating multiple datasets, but I am not saving different types of data, just the urls I visit.

ericvg97 avatar Nov 05 '25 14:11 ericvg97

I will consider optimizing metadata record updates. This is an issue for Dataset and RequestQueue, as there are counters that require updating.

Mantisus avatar Nov 05 '25 16:11 Mantisus

what would you think about calculating these values for RequestQueue on demand? Is it possible that these counts are much more written than read? The change could be minimal, just deleting the counts in the request_queue and doing the query in the get_data or get_metadata method instead of when adding requests (or retrieving them) from the queue.

(For Datasets I understand it is not that easy because modified_at can't be infered in the database records table, but we could either store a timestamp in the DatasetRecords table (and also compute modified_at on query time))

ericvg97 avatar Nov 06 '25 08:11 ericvg97

I am considering a slightly larger update that will change the approach to working with metadata records. I am still in the process of implementing it, and it will require preliminary testing to ensure that it will indeed improve the situation.

Mantisus avatar Nov 06 '25 09:11 Mantisus

Thanks!! We can be beta testers if you need it

ericvg97 avatar Nov 06 '25 09:11 ericvg97