QCFractal icon indicating copy to clipboard operation
QCFractal copied to clipboard

Completed tasks not being pushed because server returns Entity Too Large

Open trevorgokey opened this issue 5 years ago • 9 comments

Describe the bug

Workers have been recently marking jobs stale, consistently (several per day). After tracing it back to interface/client.py#L235 (in FractalClient::_request), the logs show

[W 200515 22:51:33 managers:579] Post complete tasks was not successful. Attempting again on next update. Error:
[W 200515 22:51:33 managers:580] Server communication failure. Reason: Request Entity Too Large
[I 200515 22:51:33 managers:616] Tried to push 2 complete tasks to the server (2 success / 0 fail).

(I locally modified queue/manager.py#L579 to show this the return message from FractalClient)

This will occur until it hits the retries threshold, then bail. In all failures, the same error is given.

To Reproduce

Spin up managers, wait ~12 hours. I can provide task IDs if they are useful for debugging; I can't tell which job is causing the issues, but I can collect all pulled tasks up to that point.

trevorgokey avatar May 16 '20 06:05 trevorgokey

Ok now that makes sense. This error doesn't come from QCFractal, but from nginx (and now I see it in the nginx error log). For a short term fix, I can increase that limit. Longer term, it might be good to chunk the uploads: https://requests.readthedocs.io/en/master/user/advanced/#chunk-encoded-requests

These seem to be some pretty large results. We currently have a limit of 50M but that is easily changed. Can you give me some of the task ids? I am just curious as to why they are so large

bennybp avatar May 16 '20 12:05 bennybp

Sure, a quick grep for duplicate ids that appear before a stale update since May 11 is below. In particular, 10499131 might be of interest

standard/log:    {'id': '10411286'
standard/log:    {'id': '10411469'
standard/log:    {'id': '10430259'
standard/log:    {'id': '10430426'
standard/log:    {'id': '10430617'
standard/log:    {'id': '10430707'
standard/log:    {'id': '10430760'
standard/log:    {'id': '10431318'
standard/log:    {'id': '10431326'
standard/log:    {'id': '10431549'
standard/log:    {'id': '10431597'
standard/log:    {'id': '10431784'
standard/log:    {'id': '10431864'
standard/log:    {'id': '10431878'
standard/log:    {'id': '10431952'
standard/log:    {'id': '10431962'
standard/log:    {'id': '10431987'
standard/log:    {'id': '10432011'
standard/log:    {'id': '10432023'
standard/log:    {'id': '10432089'
standard/log:    {'id': '10432123'
standard/log:    {'id': '10432130'
standard/log:    {'id': '10432133'
standard/log:    {'id': '10479610'
standard/log:    {'id': '10482724'
standard/log:    {'id': '10483560'
standard/log:    {'id': '10483578'
standard/log:    {'id': '10483621'
standard/log:    {'id': '10483741'
standard/log:    {'id': '10488902'
standard/log:    {'id': '10488960'
standard/log:    {'id': '10489211'
standard/log:    {'id': '10489230'
standard/log:    {'id': '10489670'
standard/log:    {'id': '10489735'
standard/log:    {'id': '10489806'
standard/log:    {'id': '10489841'
standard/log:    {'id': '10490038'
standard/log:    {'id': '10490168'
standard/log:    {'id': '10490170'
standard/log:    {'id': '10490208'
standard/log:    {'id': '10490231'
standard/log:    {'id': '10490256'
standard/log:    {'id': '10490257'
standard/log:    {'id': '10490259'
standard/log:    {'id': '10490268'
standard/log:    {'id': '10490269'
standard/log:    {'id': '10490271'
standard/log:    {'id': '10490272'
standard/log:    {'id': '10490274'
standard/log:    {'id': '10490275'
standard/log:    {'id': '10490277'
standard/log:    {'id': '10490278'
standard/log:    {'id': '10493209'
standard/log:    {'id': '10496349'
standard/log:    {'id': '10497676'
standard/log:    {'id': '10499131'

trevorgokey avatar May 16 '20 16:05 trevorgokey

Those all seem complete on on the backend. Those just the gradient calculations, though. I have raised the limit on the nginx server. Hopefully that clears up the stale jobs, but is only a temporary fix.

bennybp avatar May 16 '20 18:05 bennybp

Interesting. Here is a complete list from a start up to shutdown on between 200514 21:21:57 and 200516 00:45:34. I added the in the list points where it tried to push the jobs, where the error is returned, by #Tried:

'10497676'
'10499131'
'10499988'
'10499993'
'10500012'
'10500013'
'10499995'
'10500015'
'10500018'
'10500069'
'10500090'
'10500145'
'10500704'
'10500706'
'10500718'
'10500721'
'10500888'
'10500891'
'10500893'
'10500933'
'10500936'
'10494185'
'10500937'
'10500962'
'10501000'
'10479610'
'10498965'
'10501120'
'10501308'
'10501318'
'10501343'
'10501350'
'10501376'
'10501467'
'10501505'
'10501555'
'10501629'
'10502060'
'10502062'
'10502236'
'10502237'
'10502275'
'10502485'
'10502528'
'10502615'
'10502656'
'10502948'
'10502995'
'10503255'
'10503257'
'10503263'
'10503266'
'10503269'
'10503272'
'10503359'
'10503408'
'10503427'
'10503442'
'10503518'
'10503552'
'10502945'
'10503746'
'10503767'
'10503777'
'10503780'
'10503795'
'10503808'
'10503812'
'10503825'
'10503831'
'10503857'
'10503858'
'10503885'
'10503887'
'10503888'
'10503920'
'10503954'
'10503970'
'10502946'
'10503196'
'10503645'
'10501883'
'10502440'
'10503256'
'10503361'
'10503665'
'10503768'
'10503919'
'10504299'
'10504329'
'10504332'
'10504337'
'10504348'
'10504106'
'10504349'
'10504354'
'10504356'
'10504359'
'10504380'
'10504387'
'10504390'
'10504392'
'10504398'
'10504401'
'10504409'
'10504462'
'10503505'
'10504579'
'10504583'
'10504586'
'10504588'
'10504592'
'10504597'
'10504625'
'10504627'
'10504683'
'10504712'
'10504714'
'10504793'
'10504804'
'10504805'
'10504811'
'10504815'
'10504818'
'10504838'
'10504845'
'10504873'
'10504894'
'10504939'
'10504943'
'10505034'
'10505035'
'10504679'
'10505293'
'10505395'
'10505425'
'10505463'
'10431966'
'10505862'
'10505864'
'10505866'
'10505867'
'10505879'
'10505880'
'10505891'
'10505895'
'10505897'
'10505900'
'10505958'
'10506169'
'10501504'
'10506680'
'10506687'
'10506706'
'10506745'
'10506789'
'10478553'
'10506796'
'10506799'
'10506808'
'10506813'
'10506817'
'10506831'
'10506822'
'10506828'
'10506838'
'10506843'
'10495089'
'10506866'
'10506879'
'10506883'
'10506872'
'10506893'
'10430383'
'10506894'
'10506898'
'10506901'
'10506917'
'10506721'
'10505521'
'10506948'
'10506961'
'10506949'
'10506125'
'10506964'
'10506966'
'10506978'
'10506984'
'10507021'
'10507053'
'10507027'
'10507029'
'10507057'
'10507080'
'10507064'
'10507093'
'10507110'
'10507116'
'10507118'
'10507127'
'10507130'
'10507134'
'10507151'
'10507154'
'10507156'
'10507172'
'10507175'
'10507198'
'10432436'
'10507195'
'10507221'
'10507240'
'10507241'
'10507245'
'10507262'
'10507270'
'10506914'
'10507266'
'10507275'
'10506720'
'10506419'
'10506748'
'10507282'
'10507286'
'10507291'
'10507292'
'10507302'
'10507303'
'10430120'
'10506264'
'10507307'
'10507309'
'10507312'
'10507314'
'10507318'
'10507320'
'10507323'
'10507325'
'10507328'
'10507341'
'10507343'
'10507346'
'10507350'
'10507352'
'10507364'
'10507382'
'10507383'
'10507392'
'10507393'
'10507400'
'10507407'
'10507413'
'10507464'
'10507415'
'10507420'
'10507424'
'10431970'
'10432113'
'10507496'
'10507024'
'10507426'
'10431633'
'10507525'
'10507527'
'10507428'
'10507432'
'10507545'
'10507449'
'10507451'
'10507455'
'10507569'
'10507456'
'10507583'
'10507462'
'10507477'
'10507479'
'10507613'
'10507481'
'10507482'
'10507484'
'10507485'
'10507490'
'10507493'
'10507510'
'10507513'
'10507517'
'10507521'
'10507534'
'10507536'
'10507539'
'10507544'
'10507552'
'10507554'
'10507555'
'10507558'
'10507560'
'10507573'
'10507584'
'10507585'
'10507587'
'10507606'
'10507767'
'10507609'
'10507614'
'10507616'
'10507650'
'10507654'
'10507657'
'10507687'
'10507688'
'10507690'
'10507693'
'10507695'
'10507698'
'10507699'
'10507701'
'10507744'
'10507746'
'10507748'
'10507750'
'10507757'
'10507781'
'10432000'
'10507784'
'10507785'
'10507813'
'10507815'
'10507816'
'10507818'
'10507829'
'10507837'
'10507838'
'10507839'
'10507841'
'10507845'
'10507849'
'10507852'
'10507890'
'10507891'
'10507893'
'10507895'
'10507896'
'10507914'
'10507918'
'10495448'
'10507932'
'10507942'
'10507953'
'10507958'
'10507959'
'10507962'
'10507965'
'10507921'
'10507969'
'10507972'
'10507975'
'10507987'
'10507990'
'10507995'
'10507998'
'10508013'
'10508015'
'10508072'
'10508032'
'10508034'
'10508039'
'10508093'
'10508096'
'10508043'
'10508046'
'10508054'
'10508055'
'10508059'
'10508066'
'10508067'
'10508069'
'10508075'
'10508079'
'10508081'
'10508085'
'10508088'
'10508099'
'10508104'
'10508109'
'10508110'
'10508112'
'10508116'
'10508124'
'10508127'
'10508130'
'10508139'
'10508143'
'10508160'
'10508165'
'10508166'
'10508168'
'10508170'
'10508268'
'10508269'
'10508276'
'10508277'
'10508278'
'10501159'
'10506779'
'10508207'
'10508210'
'10508212'
'10507412'
'10430617'
'10430707'
'10431532'
'10431549'
'10431561'
'10431597'
'10431784'
'10431817'
'10431864'
'10431875'
'10431952'
'10431962'
'10431969'
'10431993'
'10432011'
'10432021'
'10432059'
'10432081'
'10432129'
'10432130'
'10432147'
'10432481'
'10506672'
'10506885'
'10507769'
'10507770'
'10508095'
'10508254'
'10508270'
'10507904'
'10431815'
'10507957'
'10507973'
'10508425'
'10508426'
'10507988'
'10508012'
'10508431'
'10508432'
'10508040'
'10508037'
'10508065'
'10508082'
'10508102'
'10508520'
'10508239'
'10508323'
'10508329'
'10508632'
'10508356'
'10508371'
'10508375'
'10508378'
'10430924'
'10508463'
'10508493'
'10508501'
'10508652'
'10508796'
'10508804'
'10508879'
'10508901'
'10508921'
'10508926'
'10508972'
#Tried
'10508993'
'10509028'
'10509057'
'10509088'
'10509116'
'10509154'
'10509239'
#Tried
'10509275'
'10509278'
'10509281'
'10509299'
'10509300'
'10509325'
'10509371'
'10509383'
'10509391'
'10509866'
#Tried
'10509422'
'10509487'
'10509488'
'10509502'
'10509515'
'10509517'
'10509562'
'10509604'
'10509660'
'10509695'
'10509702'
#Tried
'10509729'
'10509732'
'10509740'
'10509745'
'10509758'
'10509798'
'10509813'
'10509838'
'10509852'
'10509883'
'10509010'
'10509922'
'10509953'
'10509564'
'10509797'
'10509968'
'10510003'
'10510004'
'10510012'
'10510023'
'10510025'
'10510037'
'10510043'
'10510063'
'10510068'
'10510071'
'10510073'
'10510092'
'10510097'
'10510098'
'10510104'
'10510127'
'10510465'
'10509896'
'10510144'
'10510161'
'10510195'
'10510215'
'10510235'
'10510238'
'10510251'
'10510216'
'10510573'
'10510603'
'10509568'
'10509611'
'10509737'
'10509780'
'10509816'
'10509899'
#Tried
'10509969'
'10510000'
'10508118'
#Tried
'10508326'
'10508962'
'10509320'
'10508825'
'10509849'
'10509920'
'10509966'
'10510077'
'10510103'
'10510136'
'10510138'
'10510147'
#Tried
'10510193'
'10510240'
'10510252'
'10510725'
'10510731'
'10510787'
'10510791'
'10510796'
'10511058'
'10511059'
'10510798'
'10510810'
'10510815'
'10510863'
'10510864'
'10510866'
'10431348'
'10431960'
'10507235'
#Tried
'10511122'
'10411469'
#Tried
'10510766'
'10511057'
'10511167'
'10511168'
'10511169'
'10511172'
'10511186'
#Tried
'10511189'
'10511191'
'10511192'
'10511195'
'10511196'
'10492564'
'10493127'
'10493129'
'10493156'
'10494176'
'10494418'
'10495391'
'10496349'
'10496902'
'10496904'
'10496905'
'10504514'
'10506764'
'10508582'
#Tried
'10508773'

trevorgokey avatar May 16 '20 18:05 trevorgokey

Actually, I'll just attach said log file here qcf-stale.log

trevorgokey avatar May 16 '20 18:05 trevorgokey

That stuff is all looking pretty normal. I don't think I will be able to absolutely figure out why these are so big without more instrumentation that catches large pushes and shows what is in them.

If you feel like giving it a shot, I think I know the easiest place to add it. If not, I will add debugging stuff to the next release.

bennybp avatar May 17 '20 02:05 bennybp

Sure, I was thinking to just add some more prints in _request to see if I can pinpoint it. If you have some ideas I'd be happy to try them out.

trevorgokey avatar May 17 '20 03:05 trevorgokey

ok that's actually what I was thinking. If you can check the size of the serialized payload and if greater than 50MB or so, dump the unserialized data to a file.

bennybp avatar May 18 '20 14:05 bennybp

Yeah. Will do once I get more time on the cluster.. seemed to have run dry... will post again when I get some results. I can try other clusters to see if it appears.

That said, I did not get any more errors after the limit was raised, so it seems like to solved the immediate issue.

trevorgokey avatar May 18 '20 21:05 trevorgokey

We are running our instances behind traefik now. I haven't seen these errors in a while.

Also, the managers in v0.50 are smarter and I think handle this more gracefully.

bennybp avatar Sep 14 '23 16:09 bennybp