giant icon indicating copy to clipboard operation
giant copied to clipboard

Concurrency issue with work allocation

Open hoyla opened this issue 3 months ago • 1 comments

It seems that when there could be occurrences when several workers which are triggered at the same time or only some millisecond apart, they might end up pick up the same task as each other.

this kibana query shows all three workers locking the same blob task on blob on4SvXDmv0xdd5C3CBsvBqgOdy2Dv35JppouasncS0BzV7LNH-NUUC24tSIqnCOmUbOFvrX_SZxH4bArIx2vsA

http://localhost:5601/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15d,to:now))&_a=(columns:!(message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:d2720940-0e30-11eb-a809-5706b391daea,key:app,negate:!f,params:(query:pfi-worker),type:phrase),query:(match_phrase:(app:pfi-worker)))),index:d2720940-0e30-11eb-a809-5706b391daea,interval:auto,query:(language:kuery,query:%22on4SvXDmv0xdd5C3CBsvBqgOdy2Dv35JppouasncS0BzV7LNH-NUUC24tSIqnCOmUbOFvrX_SZxH4bArIx2vsA%22),sort:!(!('@timestamp',desc))) running

match (b: Blob {uri: "on4SvXDmv0xdd5C3CBsvBqgOdy2Dv35JppouasncS0BzV7LNH-NUUC24tSIqnCOmUbOFvrX_SZxH4bArIx2vsA"}) return b shows the blob is locked by all three instances. the workers are'’t fetching new tasks to complete

This doesn’t seem to be causing any issue, but it’s affecting the Giant performance and it’s worth thinking of how to fix it. After some discussion, we think possible ways to imrove this could be:

Using a messaging system such as kinesis stream

Moving the work allocation away from neo4j and into postgres

hoyla avatar Sep 26 '25 11:09 hoyla

There’s https://github.com/softwaremill/elasticmq which could let you unify everything around SQS, running that for local dev/offline

mbarton avatar Oct 04 '25 11:10 mbarton

I did a bit of digging into this today. During the last 24 hours - a busy period for giant - giant workers "found work" 2600 times. In 214 instances, two or more workers "found work" in the same second, with the same number of assignments. In 12 instances this happened to 3 or more workers.

It is possible that the workers got different assignments, I haven't dug that far, but based on the errors I'm seeing in the logs and resulting chaos in neo4j, I think we can assume that in a lot of these instances it is the same work that is getting picked up.

fetchWork fetches and locks the work in a single query, so I am assuming the only way workers can end up working on the same thing is if they both hit neo4j at exactly the same time, find the same work and lock the same work.

I don't think it's possible to put a uniqueness constraint on relationships in neo4j (ideally it would be nice for fetchWork to fail if another worker has locked the same blobs).

I think this is a problem specific to the 'read committed' nature of neo4j concurrency:

A transaction that reads a node/relationship does not block another transaction from writing to that node/relationship before the first transaction finishes. This type of isolation is weaker than serializable isolation level but offers significant performance advantages while being sufficient for the overwhelming majority of cases. (from here - note that is for a recent version of neo4j)

Solutions:

  • switch to a queue system instead of using the database! Probably the best option, supported by @mbarton above
  • Make the existing locking mechanism better...I had a nice chat with an LLM about this - I think possibly if we used a property rather than a relation to lock the work then the transaction would fail and roll back if two workers tried to set the property - something like this
SET b.lockedBy = $workerName
WHERE b.lockedBy IS NULL

I think the above would be a cheap option rather than diving into sqs, but given the success we've had with the sqs-based, infinitely scalable 'ocr blaster', it would be nice to move towards a solution that doesn't put a single db instance in the middle of all those squabbling workers

philmcmahon avatar Dec 19 '25 12:12 philmcmahon