QCFractal
QCFractal copied to clipboard
Completed tasks not being pushed because server returns Entity Too Large
Describe the bug
Workers have been recently marking jobs stale, consistently (several per day). After tracing it back to interface/client.py#L235 (in FractalClient::_request), the logs show
[W 200515 22:51:33 managers:579] Post complete tasks was not successful. Attempting again on next update. Error:
[W 200515 22:51:33 managers:580] Server communication failure. Reason: Request Entity Too Large
[I 200515 22:51:33 managers:616] Tried to push 2 complete tasks to the server (2 success / 0 fail).
(I locally modified queue/manager.py#L579 to show this the return message from FractalClient)
This will occur until it hits the retries threshold, then bail. In all failures, the same error is given.
To Reproduce
Spin up managers, wait ~12 hours. I can provide task IDs if they are useful for debugging; I can't tell which job is causing the issues, but I can collect all pulled tasks up to that point.
Ok now that makes sense. This error doesn't come from QCFractal, but from nginx (and now I see it in the nginx error log). For a short term fix, I can increase that limit. Longer term, it might be good to chunk the uploads: https://requests.readthedocs.io/en/master/user/advanced/#chunk-encoded-requests
These seem to be some pretty large results. We currently have a limit of 50M but that is easily changed. Can you give me some of the task ids? I am just curious as to why they are so large
Sure, a quick grep for duplicate ids that appear before a stale update since May 11 is below. In particular, 10499131 might be of interest
standard/log: {'id': '10411286'
standard/log: {'id': '10411469'
standard/log: {'id': '10430259'
standard/log: {'id': '10430426'
standard/log: {'id': '10430617'
standard/log: {'id': '10430707'
standard/log: {'id': '10430760'
standard/log: {'id': '10431318'
standard/log: {'id': '10431326'
standard/log: {'id': '10431549'
standard/log: {'id': '10431597'
standard/log: {'id': '10431784'
standard/log: {'id': '10431864'
standard/log: {'id': '10431878'
standard/log: {'id': '10431952'
standard/log: {'id': '10431962'
standard/log: {'id': '10431987'
standard/log: {'id': '10432011'
standard/log: {'id': '10432023'
standard/log: {'id': '10432089'
standard/log: {'id': '10432123'
standard/log: {'id': '10432130'
standard/log: {'id': '10432133'
standard/log: {'id': '10479610'
standard/log: {'id': '10482724'
standard/log: {'id': '10483560'
standard/log: {'id': '10483578'
standard/log: {'id': '10483621'
standard/log: {'id': '10483741'
standard/log: {'id': '10488902'
standard/log: {'id': '10488960'
standard/log: {'id': '10489211'
standard/log: {'id': '10489230'
standard/log: {'id': '10489670'
standard/log: {'id': '10489735'
standard/log: {'id': '10489806'
standard/log: {'id': '10489841'
standard/log: {'id': '10490038'
standard/log: {'id': '10490168'
standard/log: {'id': '10490170'
standard/log: {'id': '10490208'
standard/log: {'id': '10490231'
standard/log: {'id': '10490256'
standard/log: {'id': '10490257'
standard/log: {'id': '10490259'
standard/log: {'id': '10490268'
standard/log: {'id': '10490269'
standard/log: {'id': '10490271'
standard/log: {'id': '10490272'
standard/log: {'id': '10490274'
standard/log: {'id': '10490275'
standard/log: {'id': '10490277'
standard/log: {'id': '10490278'
standard/log: {'id': '10493209'
standard/log: {'id': '10496349'
standard/log: {'id': '10497676'
standard/log: {'id': '10499131'
Those all seem complete on on the backend. Those just the gradient calculations, though. I have raised the limit on the nginx server. Hopefully that clears up the stale jobs, but is only a temporary fix.
Interesting. Here is a complete list from a start up to shutdown on between 200514 21:21:57 and 200516 00:45:34. I added the in the list points where it tried to push the jobs, where the error is returned, by #Tried:
'10497676'
'10499131'
'10499988'
'10499993'
'10500012'
'10500013'
'10499995'
'10500015'
'10500018'
'10500069'
'10500090'
'10500145'
'10500704'
'10500706'
'10500718'
'10500721'
'10500888'
'10500891'
'10500893'
'10500933'
'10500936'
'10494185'
'10500937'
'10500962'
'10501000'
'10479610'
'10498965'
'10501120'
'10501308'
'10501318'
'10501343'
'10501350'
'10501376'
'10501467'
'10501505'
'10501555'
'10501629'
'10502060'
'10502062'
'10502236'
'10502237'
'10502275'
'10502485'
'10502528'
'10502615'
'10502656'
'10502948'
'10502995'
'10503255'
'10503257'
'10503263'
'10503266'
'10503269'
'10503272'
'10503359'
'10503408'
'10503427'
'10503442'
'10503518'
'10503552'
'10502945'
'10503746'
'10503767'
'10503777'
'10503780'
'10503795'
'10503808'
'10503812'
'10503825'
'10503831'
'10503857'
'10503858'
'10503885'
'10503887'
'10503888'
'10503920'
'10503954'
'10503970'
'10502946'
'10503196'
'10503645'
'10501883'
'10502440'
'10503256'
'10503361'
'10503665'
'10503768'
'10503919'
'10504299'
'10504329'
'10504332'
'10504337'
'10504348'
'10504106'
'10504349'
'10504354'
'10504356'
'10504359'
'10504380'
'10504387'
'10504390'
'10504392'
'10504398'
'10504401'
'10504409'
'10504462'
'10503505'
'10504579'
'10504583'
'10504586'
'10504588'
'10504592'
'10504597'
'10504625'
'10504627'
'10504683'
'10504712'
'10504714'
'10504793'
'10504804'
'10504805'
'10504811'
'10504815'
'10504818'
'10504838'
'10504845'
'10504873'
'10504894'
'10504939'
'10504943'
'10505034'
'10505035'
'10504679'
'10505293'
'10505395'
'10505425'
'10505463'
'10431966'
'10505862'
'10505864'
'10505866'
'10505867'
'10505879'
'10505880'
'10505891'
'10505895'
'10505897'
'10505900'
'10505958'
'10506169'
'10501504'
'10506680'
'10506687'
'10506706'
'10506745'
'10506789'
'10478553'
'10506796'
'10506799'
'10506808'
'10506813'
'10506817'
'10506831'
'10506822'
'10506828'
'10506838'
'10506843'
'10495089'
'10506866'
'10506879'
'10506883'
'10506872'
'10506893'
'10430383'
'10506894'
'10506898'
'10506901'
'10506917'
'10506721'
'10505521'
'10506948'
'10506961'
'10506949'
'10506125'
'10506964'
'10506966'
'10506978'
'10506984'
'10507021'
'10507053'
'10507027'
'10507029'
'10507057'
'10507080'
'10507064'
'10507093'
'10507110'
'10507116'
'10507118'
'10507127'
'10507130'
'10507134'
'10507151'
'10507154'
'10507156'
'10507172'
'10507175'
'10507198'
'10432436'
'10507195'
'10507221'
'10507240'
'10507241'
'10507245'
'10507262'
'10507270'
'10506914'
'10507266'
'10507275'
'10506720'
'10506419'
'10506748'
'10507282'
'10507286'
'10507291'
'10507292'
'10507302'
'10507303'
'10430120'
'10506264'
'10507307'
'10507309'
'10507312'
'10507314'
'10507318'
'10507320'
'10507323'
'10507325'
'10507328'
'10507341'
'10507343'
'10507346'
'10507350'
'10507352'
'10507364'
'10507382'
'10507383'
'10507392'
'10507393'
'10507400'
'10507407'
'10507413'
'10507464'
'10507415'
'10507420'
'10507424'
'10431970'
'10432113'
'10507496'
'10507024'
'10507426'
'10431633'
'10507525'
'10507527'
'10507428'
'10507432'
'10507545'
'10507449'
'10507451'
'10507455'
'10507569'
'10507456'
'10507583'
'10507462'
'10507477'
'10507479'
'10507613'
'10507481'
'10507482'
'10507484'
'10507485'
'10507490'
'10507493'
'10507510'
'10507513'
'10507517'
'10507521'
'10507534'
'10507536'
'10507539'
'10507544'
'10507552'
'10507554'
'10507555'
'10507558'
'10507560'
'10507573'
'10507584'
'10507585'
'10507587'
'10507606'
'10507767'
'10507609'
'10507614'
'10507616'
'10507650'
'10507654'
'10507657'
'10507687'
'10507688'
'10507690'
'10507693'
'10507695'
'10507698'
'10507699'
'10507701'
'10507744'
'10507746'
'10507748'
'10507750'
'10507757'
'10507781'
'10432000'
'10507784'
'10507785'
'10507813'
'10507815'
'10507816'
'10507818'
'10507829'
'10507837'
'10507838'
'10507839'
'10507841'
'10507845'
'10507849'
'10507852'
'10507890'
'10507891'
'10507893'
'10507895'
'10507896'
'10507914'
'10507918'
'10495448'
'10507932'
'10507942'
'10507953'
'10507958'
'10507959'
'10507962'
'10507965'
'10507921'
'10507969'
'10507972'
'10507975'
'10507987'
'10507990'
'10507995'
'10507998'
'10508013'
'10508015'
'10508072'
'10508032'
'10508034'
'10508039'
'10508093'
'10508096'
'10508043'
'10508046'
'10508054'
'10508055'
'10508059'
'10508066'
'10508067'
'10508069'
'10508075'
'10508079'
'10508081'
'10508085'
'10508088'
'10508099'
'10508104'
'10508109'
'10508110'
'10508112'
'10508116'
'10508124'
'10508127'
'10508130'
'10508139'
'10508143'
'10508160'
'10508165'
'10508166'
'10508168'
'10508170'
'10508268'
'10508269'
'10508276'
'10508277'
'10508278'
'10501159'
'10506779'
'10508207'
'10508210'
'10508212'
'10507412'
'10430617'
'10430707'
'10431532'
'10431549'
'10431561'
'10431597'
'10431784'
'10431817'
'10431864'
'10431875'
'10431952'
'10431962'
'10431969'
'10431993'
'10432011'
'10432021'
'10432059'
'10432081'
'10432129'
'10432130'
'10432147'
'10432481'
'10506672'
'10506885'
'10507769'
'10507770'
'10508095'
'10508254'
'10508270'
'10507904'
'10431815'
'10507957'
'10507973'
'10508425'
'10508426'
'10507988'
'10508012'
'10508431'
'10508432'
'10508040'
'10508037'
'10508065'
'10508082'
'10508102'
'10508520'
'10508239'
'10508323'
'10508329'
'10508632'
'10508356'
'10508371'
'10508375'
'10508378'
'10430924'
'10508463'
'10508493'
'10508501'
'10508652'
'10508796'
'10508804'
'10508879'
'10508901'
'10508921'
'10508926'
'10508972'
#Tried
'10508993'
'10509028'
'10509057'
'10509088'
'10509116'
'10509154'
'10509239'
#Tried
'10509275'
'10509278'
'10509281'
'10509299'
'10509300'
'10509325'
'10509371'
'10509383'
'10509391'
'10509866'
#Tried
'10509422'
'10509487'
'10509488'
'10509502'
'10509515'
'10509517'
'10509562'
'10509604'
'10509660'
'10509695'
'10509702'
#Tried
'10509729'
'10509732'
'10509740'
'10509745'
'10509758'
'10509798'
'10509813'
'10509838'
'10509852'
'10509883'
'10509010'
'10509922'
'10509953'
'10509564'
'10509797'
'10509968'
'10510003'
'10510004'
'10510012'
'10510023'
'10510025'
'10510037'
'10510043'
'10510063'
'10510068'
'10510071'
'10510073'
'10510092'
'10510097'
'10510098'
'10510104'
'10510127'
'10510465'
'10509896'
'10510144'
'10510161'
'10510195'
'10510215'
'10510235'
'10510238'
'10510251'
'10510216'
'10510573'
'10510603'
'10509568'
'10509611'
'10509737'
'10509780'
'10509816'
'10509899'
#Tried
'10509969'
'10510000'
'10508118'
#Tried
'10508326'
'10508962'
'10509320'
'10508825'
'10509849'
'10509920'
'10509966'
'10510077'
'10510103'
'10510136'
'10510138'
'10510147'
#Tried
'10510193'
'10510240'
'10510252'
'10510725'
'10510731'
'10510787'
'10510791'
'10510796'
'10511058'
'10511059'
'10510798'
'10510810'
'10510815'
'10510863'
'10510864'
'10510866'
'10431348'
'10431960'
'10507235'
#Tried
'10511122'
'10411469'
#Tried
'10510766'
'10511057'
'10511167'
'10511168'
'10511169'
'10511172'
'10511186'
#Tried
'10511189'
'10511191'
'10511192'
'10511195'
'10511196'
'10492564'
'10493127'
'10493129'
'10493156'
'10494176'
'10494418'
'10495391'
'10496349'
'10496902'
'10496904'
'10496905'
'10504514'
'10506764'
'10508582'
#Tried
'10508773'
Actually, I'll just attach said log file here qcf-stale.log
That stuff is all looking pretty normal. I don't think I will be able to absolutely figure out why these are so big without more instrumentation that catches large pushes and shows what is in them.
If you feel like giving it a shot, I think I know the easiest place to add it. If not, I will add debugging stuff to the next release.
Sure, I was thinking to just add some more prints in _request to see if I can pinpoint it. If you have some ideas I'd be happy to try them out.
ok that's actually what I was thinking. If you can check the size of the serialized payload and if greater than 50MB or so, dump the unserialized data to a file.
Yeah. Will do once I get more time on the cluster.. seemed to have run dry... will post again when I get some results. I can try other clusters to see if it appears.
That said, I did not get any more errors after the limit was raised, so it seems like to solved the immediate issue.
We are running our instances behind traefik now. I haven't seen these errors in a while.
Also, the managers in v0.50 are smarter and I think handle this more gracefully.