incubator-devlake icon indicating copy to clipboard operation
incubator-devlake copied to clipboard

[Bug][gitextractor] Deadlocks when collecting commits

Open pimguilherme opened this issue 1 year ago • 21 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

What happened

I am running into deadlocks when trying to execute two parallel pipelines for distinct repositories.

This is the error I meet: image

What do you expect to happen

There should be no deadlocks, or the system should be resilient and retry transactions a few times (perhaps a configurable amount?)

I have found the following issue on this same behavior: https://github.com/apache/incubator-devlake/issues/5819

Is it expected that we disable the commit file extraction to avoid this error? I just want to confirm we would not be missing any relevant metrics with this setting.

How to reproduce

Trigger simultaneous pipelines on repositories with over 20k commits. (we were able to trigger this with 2 pipelines on a db.t3.micro and 3 pipelines on a db.m5.xlarge)

Anything else

No response

Version

v0.18.0, v0.19.0, v0.20.0

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

pimguilherme avatar Mar 04 '24 18:03 pimguilherme

UPDATE: I tried using SKIP_COMMIT_FILES and although it seems faster, I still observe deadlocks 🤔

I have also hit deadlocks on the other plugin:

time="2024-03-04 21:43:16" level=info msg=" [pipeline service] [pipeline #28] [task #215] [CollectPr] existing data parser return ErrFinishCollect, but skip. rawId: #787"
time="2024-03-04 21:43:17" level=info msg=" [pipeline service] [pipeline #28] [task #215] [CollectPr] finished records: 92"
time="2024-03-04 21:43:21" level=error msg=" [pipeline service] [pipeline #28] [task #215] [CollectPr] ended Graphql collector execution with error
	caused by: 
	combined messages: 
	{
		error adding result to batch (500)
		=====================
		error adding result to batch (500)
	}"
time="2024-03-04 21:43:21" level=error msg=" [pipeline service] [pipeline #28] [task #215] [CollectPr] the first error of them
	caused by: error adding result to batch (500)
	Wraps: (2) Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction (500)
	Wraps: (3) Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
	Error types: (1) *hintdetail.withDetail (2) *hintdetail.withDetail (3) *mysql.MySQLError"
time="2024-03-04 21:43:21" level=error msg=" [pipeline service] [pipeline #28] [task #215] subtask CollectPr ended unexpectedly
	Wraps: (2)
	  | combined messages: 
	  | {
	  | 	error adding result to batch (500)
	  | 	=====================
	  | 	error adding result to batch (500)
	  | }
	Error types: (1) *hintdetail.withDetail (2) *errors.errorString"

This was running 0.18.0 on AWS (2 VCPU + 1 GB RDS mysql database and 3 CPU + 3GB pod), and trying 3 parallel pipelines such as: image

pimguilherme avatar Mar 04 '24 21:03 pimguilherme

this is the latest locks information on innodb:

------------------------
LATEST DETECTED DEADLOCK
------------------------
2024-03-06 09:28:06 22494182373120
*** (1) TRANSACTION:
TRANSACTION 9964411, ACTIVE 1 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 1128, 2 row lock(s), undo log entries 5
MySQL thread id 354, OS thread handle 22494692480768, query id 995814 10.172.34.108 devlake update
INSERT INTO `commit_parents` (`created_at`,`updated_at`,`_raw_data_params`,`_raw_data_table`,`_raw_data_id`,`_raw_data_remark`,`commit_sha`,`parent_commit_sha`) VALUES ('2024-03-06 09:28:05.308','2024-03-06 09:28:05.308','github:GithubRepo:1:189630055','gitextractor',0,'','7b2bc5000669d9b3fd5d231a581cc52271e36f4b','a4b384bf1c0f675739d075a146607518c1ce1457'),('2024-03-06 09:28:05.308','2024-03-06 09:28:05.308','github:GithubRepo:1:189630055','gitextractor',0,'','a4b384bf1c0f675739d075a146607518c1ce1457','8eadbeed5aee0e7b7c0ee0dba4dab06059950416'),('2024-03-06 09:28:05.308','2024-03-06 09:28:05.308','github:GithubRepo:1:189630055','gitextractor',0,'','8eadbeed5aee0e7b7c0ee0dba4dab06059950416','4ed72106c022a15b53adfa259510f0dcbc78ac99'),('2024-03-06 09:28:05.308','2024-03-06 09:28:05.308','github:GithubRepo:1:189630055','gitextractor',0,'','8eadbeed5aee0e7b7c0ee0dba4dab06059950416','6bf4ac71efaaf5407a9a9c8f7f27200

*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 2340 page no 783 n bits 160 index PRIMARY of table `lake`.`commit_parents` trx id 9964411 lock_mode X locks rec but not gap
Record lock, heap no 83 PHYSICAL RECORD: n_fields 10; compact format; info bits 64
 0: len 30; hex 366266346163373165666161663534303761396139633866376632373230; asc 6bf4ac71efaaf5407a9a9c8f7f2720; (total 40 bytes);
 1: len 30; hex 346564373231303663303232613135623533616466613235393531306630; asc 4ed72106c022a15b53adfa259510f0; (total 40 bytes);
 2: len 6; hex 000000980b7b; asc      {;;
 3: len 7; hex 81000036b50278; asc    6  x;;
 4: len 7; hex 99b2cc97050c08; asc        ;;
 5: len 7; hex 99b2cc97050c08; asc        ;;
 6: len 29; hex 6769746875623a4769746875625265706f3a313a313839363330303535; asc github:GithubRepo:1:189630055;;
 7: len 12; hex 676974657874726163746f72; asc gitextractor;;
 8: len 8; hex 0000000000000000; asc         ;;
 9: len 0; hex ; asc ;;


*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2340 page no 1944 n bits 168 index PRIMARY of table `lake`.`commit_parents` trx id 9964411 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 14 PHYSICAL RECORD: n_fields 10; compact format; info bits 96
 0: len 30; hex 346564383363313161326163326630383431383039323835633161363065; asc 4ed83c11a2ac2f0841809285c1a60e; (total 40 bytes);
 1: len 30; hex 643464323931376263313931353639313063346336323335373636656162; asc d4d2917bc19156910c4c6235766eab; (total 40 bytes);
 2: len 6; hex 000000980b69; asc      i;;
 3: len 7; hex 01000027fc2b94; asc    ' + ;;
 4: len 7; hex 99b2ca77d6000a; asc    w   ;;
 5: len 7; hex 99b2ca77d6000a; asc    w   ;;
 6: len 28; hex 6769746875623a4769746875625265706f3a313a3936353934363639; asc github:GithubRepo:1:96594669;;
 7: len 12; hex 676974657874726163746f72; asc gitextractor;;
 8: len 8; hex 0000000000000000; asc         ;;
 9: len 0; hex ; asc ;;


*** (2) TRANSACTION:
TRANSACTION 9964393, ACTIVE 1 sec fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 333 lock struct(s), heap size 41080, 19875 row lock(s), undo log entries 15662
MySQL thread id 342, OS thread handle 22494723909376, query id 995725 10.172.34.108 devlake updating
DELETE FROM `commit_parents` WHERE _raw_data_table = 'gitextractor' AND _raw_data_params = 'github:GithubRepo:1:96594669'

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 2340 page no 1944 n bits 168 index PRIMARY of table `lake`.`commit_parents` trx id 9964393 lock_mode X locks gap before rec
Record lock, heap no 14 PHYSICAL RECORD: n_fields 10; compact format; info bits 96
 0: len 30; hex 346564383363313161326163326630383431383039323835633161363065; asc 4ed83c11a2ac2f0841809285c1a60e; (total 40 bytes);
 1: len 30; hex 643464323931376263313931353639313063346336323335373636656162; asc d4d2917bc19156910c4c6235766eab; (total 40 bytes);
 2: len 6; hex 000000980b69; asc      i;;
 3: len 7; hex 01000027fc2b94; asc    ' + ;;
 4: len 7; hex 99b2ca77d6000a; asc    w   ;;
 5: len 7; hex 99b2ca77d6000a; asc    w   ;;
 6: len 28; hex 6769746875623a4769746875625265706f3a313a3936353934363639; asc github:GithubRepo:1:96594669;;
 7: len 12; hex 676974657874726163746f72; asc gitextractor;;
 8: len 8; hex 0000000000000000; asc         ;;
 9: len 0; hex ; asc ;;


*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 2340 page no 783 n bits 160 index PRIMARY of table `lake`.`commit_parents` trx id 9964393 lock_mode X waiting
Record lock, heap no 83 PHYSICAL RECORD: n_fields 10; compact format; info bits 64
 0: len 30; hex 366266346163373165666161663534303761396139633866376632373230; asc 6bf4ac71efaaf5407a9a9c8f7f2720; (total 40 bytes);
 1: len 30; hex 346564373231303663303232613135623533616466613235393531306630; asc 4ed72106c022a15b53adfa259510f0; (total 40 bytes);
 2: len 6; hex 000000980b7b; asc      {;;
 3: len 7; hex 81000036b50278; asc    6  x;;
 4: len 7; hex 99b2cc97050c08; asc        ;;
 5: len 7; hex 99b2cc97050c08; asc        ;;
 6: len 29; hex 6769746875623a4769746875625265706f3a313a313839363330303535; asc github:GithubRepo:1:189630055;;
 7: len 12; hex 676974657874726163746f72; asc gitextractor;;
 8: len 8; hex 0000000000000000; asc         ;;
 9: len 0; hex ; asc ;;

*** WE ROLL BACK TRANSACTION (1)

pimguilherme avatar Mar 06 '24 09:03 pimguilherme

Perhaps the dataaccess layer code should be hardened to attempt a few retries in the event of deadlocks. @klesh

keon94 avatar Mar 08 '24 19:03 keon94

@keon94 Maybe, or we could try other tricks to avoid massive locks because there is performance penalty. Batch insertion/updation maybe?

klesh avatar Mar 11 '24 08:03 klesh

hey guys, thanks for getting back to me!

I think it's not unusual to go for the retry attempts, but indeed the more we could avoid them, the better.. it sound like a combination of both would be ideal if that's not too much to ask:

  • maybe configurable deadlock retries (ex exponential retry, number of attepmts, delay)
  • maybe batch insertion, but maybe that also triggers deadlocks and maybe even more? because you would batch a series of records in distinct pipelines, and while a pipeline is committing its data, the other one with another big batch would probably need a lot of locks for its records (just brainstorming here)

i am trying to take some time to understand a bit better on the code where the deadlock lies.. if you have any pointers that would be very welcome!

pimguilherme avatar Mar 12 '24 13:03 pimguilherme

I just checked the code and there is already batch insertion there, but I still don't understand how the deadlock happened, it doesn't make sense to me. It should be OK for different threads to insert records into the same table.

I feel like we should figure out why it happened before we could fix it correctly.

klesh avatar Mar 13 '24 08:03 klesh

sounds good! I will try to come up with a reproducible way of triggering it

pimguilherme avatar Mar 13 '24 17:03 pimguilherme

@pimguilherme That is very kind of you, thanks for taking the tough part of all. Looking forward to hearing from you. 💪

klesh avatar Mar 14 '24 10:03 klesh

my pleasure @klesh! spoiler -- I was apparently able to trigger it using 3 apache repositories.. I will try to capture the steps properly and share them here

one thing I wanted to clear.. we are using MySQL, but do you support PostGres?

thanks!

pimguilherme avatar Mar 14 '24 13:03 pimguilherme

@pimguilherme Good to know.

You may use pg to store the data without any problem. However, all Grafana dashboards are designed for MySQL, which may use some dialect syntax.

klesh avatar Mar 14 '24 13:03 klesh

Cool, thanks for confirming @klesh!

I have put together a small script to trigger the deadlock on an empty local devlake instance running 0.18.0 (it would probably work on newer instances with probably some small adjustments to the script)

Here are the steps:

  • copy env.example to .env and set PIPELINE_MAX_PARALLEL=5 and ENCRYPTION_SECRET=secret
  • online devlake with docker-compose -f docker-compose-dev.yml up
  • once devlake is up, run the attached bash script: deadlock-setup.txt bash deadlock-setup.txt (NOTE: you need jq, curl, gunzip and grep installed to run this script) ( it will just do a series of calls to the API to set up 3 apache repositories/projects, druid, arrow-datafusion and incubator-devlake, set a smaller sync window and trigger the pipelines continuously until deadlock is met.. I have included a test user github token for public repos)
  • wait until deadlock is hit (it took about 7 minutes in my tests, according to attached logs)
  • deadlock-example.txt
  • deadlock-example2.txt

please let me know how it goes

PS I also met this error somewhat consistently in the first pipelines, but i think we can focus on the deadlocks: image

cheers!

pimguilherme avatar Mar 14 '24 21:03 pimguilherme

one thing to note, apparently this issue does not occur on postgres (have been running the same test script and it hasn't triggered it in 30 minutes)

pimguilherme avatar Mar 15 '24 13:03 pimguilherme

Found a article might be helpful https://medium.com/assoconnect/dealing-with-deadlocks-on-insert-requests-59eb93ef8dce

klesh avatar Mar 18 '24 05:03 klesh

oh, that's pretty interesting @klesh, I will try that out as well, beucase it would be easier than having to migrate infra-structure to Postgres

thanks for the pointer!@

pimguilherme avatar Mar 18 '24 13:03 pimguilherme

@pimguilherme Good hunting 🦆

klesh avatar Mar 19 '24 11:03 klesh

thanks @klesh! we have been running smooth on the deadlock side after changing the transaction isolation :)

I think it would be the most ideal scenario for the code to be able to handle things even without this setting, but just behaving much slower, or at least with a configurable amount of deadlock attempts. But that may be overengineering for now considering it's a simple and sensible change to set transaction_isolation to READ-COMMITTED (which is the default for Postgres apparently, so that's why it runs smooth there too probably)

it would probably be a good idea to mention this in the docs, and maybe set this as a default configuration for devlake on MySQL.. what do you think?

thanks!!

pimguilherme avatar Mar 21 '24 13:03 pimguilherme

@pimguilherme I think we could do both 😄

klesh avatar Mar 21 '24 14:03 klesh

sounds good @klesh, thanks! :smile:

we have been able to run 3 parallel pipelines very stable in this fashion.. we did a few load tests and I think the summary for being able to cope with parallelism in devlake is that it is ideal to have 1 CPU + 4 GB memory on a MySQL database per parallel pipeline

so, for 3 pipelines in parallel, at least 3 CPUs and 12 GB memory.. at least in our setup, devlake is quite CPU intensive on the database during some pipeline steps (specially the dora calculation on bigger repos I think, but without further investigation yet)

as for the devlake process itself, we have been running it with 3Gi memory and it doesnt seem to be very CPU intensive

we basically use the Github connector (without issue tracking) for all our repos, and DORA and Azure Devops for some

we haven't hit any rate limits on github yet :raised_hands:

thanks for the support! :green_heart:

pimguilherme avatar Mar 26 '24 14:03 pimguilherme

@pimguilherme Thanks for sharing, this is very informative. I wonder if we could set the isolation level on the devlake end or should it be done by setting up the mysql server? Either way, would you like to contribute the code or docs for the matter? Thanks in advanced.

klesh avatar Mar 28 '24 08:03 klesh

okay @klesh, thanks for the suggestion! I will try to allocate some time to do this :)

we could set the transaction_isolation on the session level I think, but I think that would override any defaults on the database.. best route I think would be to set it at the session level and provide a configuration to override it just in case (although it sounds really unlikely that this would be done 🤔

pimguilherme avatar Apr 02 '24 19:04 pimguilherme

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Aug 06 '24 00:08 github-actions[bot]