fluo
fluo copied to clipboard
Lock resolution failed
During a long run of webindex because of #656 there was some data that was unable to be processed for a long time. When I updated the Accumulo iterators on the cluster with the fix for #656 some data then failed to process with following error.
21:15:42.284 [Fluo-0005-001-ntfyProc] ERROR io.fluo.core.worker.WorkTaskAsync - Failed to process work um:u:2ff: fluoRecipes cfm:um 649491705
java.lang.IllegalStateException: can not abort : um:u:2ff:^C^Acom.blogger>.www>o>/profile/0391446164822345113�^@^@^@^@^R��� u v 340836728 (UNKNOWN)
at io.fluo.core.impl.LockResolver.resolveLocks(LockResolver.java:169) ~[fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.core.impl.ParallelSnapshotScanner.scan(ParallelSnapshotScanner.java:123) ~[fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.core.impl.TransactionImpl.get(TransactionImpl.java:179) ~[fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.recipes.map.CollisionFreeMap.getCurrentValues(CollisionFreeMap.java:233) ~[fluo-recipes-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.recipes.map.CollisionFreeMap.process(CollisionFreeMap.java:174) ~[fluo-recipes-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.recipes.map.CollisionFreeMapObserver.process(CollisionFreeMapObserver.java:44) ~[fluo-recipes-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.core.worker.WorkTaskAsync.run(WorkTaskAsync.java:85) ~[fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.core.worker.NotificationProcessor$NotificationProcessingTask.run(NotificationProcessor.java:147) [fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_91]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
21:15:46.894 [Fluo-0005-002-ntfyProc] ERROR io.fluo.core.worker.WorkTaskAsync - Failed to process work um:u:04b: fluoRecipes cfm:um 649486038
java.lang.IllegalStateException: can not abort : um:u:04b:^C^A�^Acom.bulletinmarketing>.adserv>o>/w/1.0/rc?cs=621c506d44&cb=INSERT_RANDOM_NUMBER_HERE^@^@^@^@^T4E u v 371126526 (UNKNOWN)
at io.fluo.core.impl.LockResolver.resolveLocks(LockResolver.java:169) ~[fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.core.impl.ParallelSnapshotScanner.scan(ParallelSnapshotScanner.java:123) ~[fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.core.impl.TransactionImpl.get(TransactionImpl.java:179) ~[fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.recipes.map.CollisionFreeMap.getCurrentValues(CollisionFreeMap.java:233) ~[fluo-recipes-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.recipes.map.CollisionFreeMap.process(CollisionFreeMap.java:174) ~[fluo-recipes-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.recipes.map.CollisionFreeMapObserver.process(CollisionFreeMapObserver.java:44) ~[fluo-recipes-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.core.worker.WorkTaskAsync.run(WorkTaskAsync.java:85) ~[fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.core.worker.NotificationProcessor$NotificationProcessingTask.run(NotificationProcessor.java:147) [fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_91]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
21:15:47.298 [HashNotificationFinder ScanTask] DEBUG i.f.core.worker.finder.hash.ScanTask - Scanned 362 of 362 tablets, added 0 new notifications (total queued 0)
21:15:47.617 [Fluo-0005-003-ntfyProc] ERROR io.fluo.core.worker.WorkTaskAsync - Failed to process work um:u:01f: fluoRecipes cfm:um 649493345
java.lang.IllegalStateException: can not abort : um:u:01f:^C^Acom.nisource>.www>o>�^@^@^@^@^U�+� u v 370729353 (UNKNOWN)
at io.fluo.core.impl.LockResolver.resolveLocks(LockResolver.java:169) ~[fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.core.impl.ParallelSnapshotScanner.scan(ParallelSnapshotScanner.java:123) ~[fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.core.impl.TransactionImpl.get(TransactionImpl.java:179) ~[fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.recipes.map.CollisionFreeMap.getCurrentValues(CollisionFreeMap.java:233) ~[fluo-recipes-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.recipes.map.CollisionFreeMap.process(CollisionFreeMap.java:174) ~[fluo-recipes-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.recipes.map.CollisionFreeMapObserver.process(CollisionFreeMapObserver.java:44) ~[fluo-recipes-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.core.worker.WorkTaskAsync.run(WorkTaskAsync.java:85) ~[fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at io.fluo.core.worker.NotificationProcessor$NotificationProcessingTask.run(NotificationProcessor.java:147) [fluo-core-1.0.0-beta-3-SNAPSHOT.jar:1.0.0-beta-3-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_91]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Digging into this there were locks pointing to a primary that did not exists.
Below is the place where the primary should be.
$ fluo scan webindex --raw -p um:u:2ff | grep 0391446164822345
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00$\x95\xd7\xdb u:v [] 613849491-WRITE 613799899
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00$\x95\xd7\xdb u:v [] 613799899-DATA \x0b\x01\x00\x02
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00$\xe4;\xc5 u:v [] 619085141-WRITE 618937285
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00$\xe4;\xc5 u:v [] 618937285-DATA \x0b\x01\x00\x02
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00% \xf7F u:v [] 623105934-WRITE 622917446
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00% \xf7F u:v [] 622917446-DATA \x0b\x01\x00\x02
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00%]\x1fI u:v [] 626964290-WRITE 626859849
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00%]\x1fI u:v [] 626859849-DATA \x0b\x01\x00\x02
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00%`\x96\xb6 u:v [] 627150900-WRITE 627087030
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00%`\x96\xb6 u:v [] 627087030-DATA \x0b\x01\x00\x02
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00%\xe2[; u:v [] 635789987-WRITE 635591483
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00%\xe2[; u:v [] 635591483-DATA \x0b\x01\x00\x02
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00&b\xc8\xc0 u:v [] 644258355-WRITE 644008128
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00&b\xc8\xc0 u:v [] 644008128-DATA \x0b\x01\x00\x02
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00&\xa4\xb0Y u:v [] 648413575-WRITE 648327257
um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00&\xa4\xb0Y u:v [] 648327257-DATA \x0b\x01\x00\x02
Below are some locks that point to a non-existent primary.
$ fluo scan webindex --raw -p um:d:2ff | grep 0391446164822345 | grep LOCK | head
um:d:2ff:\x03\x01\xda\x03com.digg>.www>o>/submit?phase=2&url=http://www.cheatsguru.com/psp/harry_potter_and_the_goblet_of_fire/&topic=gaming&title=Harry Potter and the Goblet of Fire+cheats+at+CheatsGuru.com&jump=close&v=4 data:current [] 340836728-LOCK um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00\x12\x9e\xf5\xe3 u v WRITE NOT_DELETE NOT_TRIGGER z
um:d:2ff:\x03\x01\xda\x03com.digg>>o>/submit?phase=2&url=http%3A%2F%2Fwww.defensivecarry.com%2Fforum%2Fdefensive-carry-guns%2F25154-if-you-could-only-carry-one-concealed-handgun.html&title=If+you+could+only+carry+ONE+concealed+handgun data:current [] 340836728-LOCK um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00\x12\x9e\xf5\xe3 u v WRITE NOT_DELETE NOT_TRIGGER z
um:d:2ff:\x03\x01\xda\x03com.informationweek>.www>o>/whitepaper/Mobility/Mobile-Business/five-questions-to-ask-before-choosing-responsive-w-wp1396379164?articleID=191741381&cid=SBX_ddj_fture_wp_default_video&itc=SBX_ddj_fture_wp_default_video data:current [] 340836728-LOCK um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00\x12\x9e\xf5\xe3 u v WRITE NOT_DELETE NOT_TRIGGER z
um:d:2ff:\x03\x01\xda\x03com.instapaper>.www>o>/hello2?title=Kelly+Clarkson+upset+with+U.K.+editors+over+newspaper+article&url=http%3A%2F%2Fwww.wonderwall.com%2Fmusic%2Fkelly-clarkson-upset-with-uk-editors-over-newspaper-article-1720612.story data:current [] 340836728-LOCK um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00\x12\x9e\xf5\xe3 u v WRITE NOT_DELETE NOT_TRIGGER z
um:d:2ff:\x03\x01\xda\x03com.itcentralstation>.www>o>/products/comparisons/ca-access-control_vs_novell-identity-manager?aid=TechTargetRef&lp=comparison1&tid=category1917_topic1324101_sitesearchWindowsServer_contentEditorialContent-tip_widget2 data:current [] 340836728-LOCK um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00\x12\x9e\xf5\xe3 u v WRITE NOT_DELETE NOT_TRIGGER z
um:d:2ff:\x03\x01\xda\x03com.linkedin>.www>o>/shareArticle?url=http%3A%2F%2Fliveeventfacebook.eventbrite.com%2F%3Faff%3Desli&mini=true&source=Eventbrite&title=FACEBOOK+-+LIVE+MARKETING+WORKSHOPS&utm_medium=eventpage&utm_term=share data:current [] 340836728-LOCK um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00\x12\x9e\xf5\xe3 u v WRITE NOT_DELETE NOT_TRIGGER z
um:d:2ff:\x03\x01\xda\x03com.pinterest>>o>/pin/create/button/?url=http%3A%2F%2Fwww.head-fi.org%2Ft%2F641953%2Fkoss-prodj100-headphones-mint-bought-only-2-months-ago&description=KOSS+ProDJ100+Headphones+-+MINT+-+Bought+only+2+months+ago%21 data:current [] 340836728-LOCK um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00\x12\x9e\xf5\xe3 u v WRITE NOT_DELETE NOT_TRIGGER z
um:d:2ff:\x03\x01\xda\x03com.twitter>>s>/share?text=U%2ES%2E%20Directly%20Blames%20China%E2%80%99s%20Military%20for%20Cyberattacks&url=http%3A%2F%2Ftechnews%2Eacm%2Eorg%2Farchives%2Ecfm%3Ffo%3D2013%2D08%2DMay%2FMay%2D08%2D2013%2Ehtml%23652745 data:current [] 340836728-LOCK um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00\x12\x9e\xf5\xe3 u v WRITE NOT_DELETE NOT_TRIGGER z
um:d:2ff:\x03\x01\xda\x03com.yahoo>.login>s>/config/login?.intl=us&.src=smw&.pd=smw_ver%3d1&.done=http%3A%2F%2Fwww.mp3car.com%2Fvbulletin%2Foff-topic%2F70039-thinking-about-going-manual.html&title=Thinking+about+going+manual.. data:current [] 340836728-LOCK um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00\x12\x9e\xf5\xe3 u v WRITE NOT_DELETE NOT_TRIGGER z
um:d:2ff:\x03\x01\xda\x03gov.nih>.nlm.ncbi.www>o>/entrez/query.fcgi?db=PubMed&cmd=Search&doptcmdl=Citation&defaultField=Title+Word&term=Fischer%5Bauthor%5D+AND+Many+cuts+to+ruin%3A+a+comprehensive+update+of+caspase+substrates. data:current [] 340836728-LOCK um:u:2ff:\x03\x01com.blogger>.www>o>/profile/0391446164822345113\xb7\x00\x00\x00\x00\x12\x9e\xf5\xe3 u v WRITE NOT_DELETE NOT_TRIGGER z
I think I have tracked down one possible cause of this. When a transaction rolls itself back it does not properly mark the primary column.
I am not sure if this is 100% the same thing, but on Fluo 2.0.0-SNAPSHOT we ran into something that looks similar. This occurred on a single test server that was likely very overloaded at the time of failure. When trying to scan we see that exception
root@fluo-oracle:/# fluo scan -a crucible -p dataset_offsets:Alerts:
dataset_offsets:Alerts:0 offset 21179
dataset_offsets:Alerts:1 offset 21179
dataset_offsets:Alerts:2 offset 21179
dataset_offsets:Alerts:3 offset 21179
Exception in thread "main" java.lang.IllegalStateException: can not abort : record:Alerts:4:000021178 10 143265458 (UNKNOWN)
at org.apache.fluo.core.impl.LockResolver.resolveLocks(LockResolver.java:201)
at org.apache.fluo.core.impl.SnapshotScanner$SnapIter.resolveLock(SnapshotScanner.java:184)
at org.apache.fluo.core.impl.SnapshotScanner$SnapIter.getNext(SnapshotScanner.java:221)
at org.apache.fluo.core.impl.SnapshotScanner$SnapIter.hasNext(SnapshotScanner.java:131)
at com.google.common.collect.TransformedIterator.hasNext(TransformedIterator.java:42)
at org.apache.fluo.core.util.ScanUtil.scan(ScanUtil.java:124)
at org.apache.fluo.core.util.ScanUtil.scanFluo(ScanUtil.java:152)
at org.apache.fluo.command.FluoScan.execute(FluoScan.java:109)
at org.apache.fluo.command.FluoProgram.runFluoCommand(FluoProgram.java:69)
at org.apache.fluo.command.FluoProgram.main(FluoProgram.java:33)
When looking at the raw data we see the lock to the primary
...
dataset_offsets:Alerts:4 :offset [] 113249162-WRITE 113249161
dataset_offsets:Alerts:4 :offset [] 113247916-WRITE 113247915
dataset_offsets:Alerts:4 :offset [] 113246900-WRITE 113246897
dataset_offsets:Alerts:4 :offset [] 85418754-WRITE 85418753
dataset_offsets:Alerts:4 :offset [] 143265458-LOCK record:Alerts:4:000021178 10 WRITE NOT_DELETE NOT_TRIGGER c
dataset_offsets:Alerts:4 :offset [] 143265458-DATA 21178
dataset_offsets:Alerts:4 :offset [] 143265003-DATA 21177
dataset_offsets:Alerts:4 :offset [] 114952314-DATA 21159
...
But maybe in this case the primary does exist?
record:Alerts:4:000021178 :10 [] 143265458-LOCK record:Alerts:4:000021178 10 WRITE NOT_DELETE NOT_TRIGGER c
Are there any recovery tools or process for cleaning up a situation like this? Thanks for any advice anyone may have. cc @wjsl
@jaredwinick would the IllegalstateExcep happen every-time the data was scanned? Want to confirm this is an issue with the persisted data and not a transient issue in the code. Seems like a problem with the data.
What command did you run to see the following output :
record:Alerts:4:000021178 :10 [] 143265458-LOCK record:Alerts:4:000021178 10 WRITE NOT_DELETE NOT_TRIGGER c
Are there any recovery tools or process for cleaning up a situation like this?
I don't think so.
It’s a part of the data when we scan it.
On Sep 1, 2021, at 10:01 AM, Keith Turner @.***> wrote:
@jaredwinick https://github.com/jaredwinick would the IllegalstateExcep happen every-time the data was scanned? Want to confirm this is an issue with the persisted data and not a transient issue in the code. Seems like a problem with the data.
What command did you run to see the following output :
record:Alerts:4:000021178 :10 [] 143265458-LOCK record:Alerts:4:000021178 10 WRITE NOT_DELETE NOT_TRIGGER c Are there any recovery tools or process for cleaning up a situation like this?
I don't think so.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/fluo/issues/660#issuecomment-910314915, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKB7ZKPOZMPFMKTHU6YXMTT7YW3PANCNFSM4CC6NMYA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
What command did you run to see the following output :
record:Alerts:4:000021178 :10 [] 143265458-LOCK record:Alerts:4:000021178 10 WRITE NOT_DELETE NOT_TRIGGER c
It was
fluo scan -a crucible -p record:Alerts:4:000021178 --raw
the values are big so i just copied what looked like the relevant part of the output
I think we tracked down the problem in #1123