ArchiveBot icon indicating copy to clipboard operation
ArchiveBot copied to clipboard

List jobs with a too long last path component crash

Open JustAnotherArchivist opened this issue 7 years ago • 1 comments

Job c6gd8eb8rwk5su6ijj5g52sv5 crashed with the following traceback:

Starting StartHeartbeat for Item 
Finished StartHeartbeat for Item 
Starting SetFetchDepth for Item 
Finished SetFetchDepth for Item 
Starting PreparePaths for Item 
Finished PreparePaths for Item 
Starting WriteInfo for Item 
Failed WriteInfo for Item 
Traceback (most recent call last):
  File "/home/matt/.local/lib/python3.5/site-packages/seesaw/task.py", line 69, in enqueue
    self.process(item)
  File "/home/matt/ArchiveBot/pipeline/archivebot/seesaw/tasks.py", line 270, in process
    with open(item['source_info_file'], 'w') as f:
OSError: [Errno 36] File name too long: '/home/matt/ArchiveBot/pipeline/data/1537864410911f514affdc1bdf-2636/c6gd8eb8rwk5su6ijj5g52sv5/urls-transfer.sh-Speedosausage-PolygonCherub-mitarashikousei-mashu_003-chakkuheart-chalsoma-akituti-dragooooooooon-Gendo0032-gn_yaky-hataman331-i_n_u-KSUWABE-mahayang0128-mmgrk-Moginiki-RockLow696-rswxx-sameduma-sidotama-U9Works-yunoi_terere-shift0808-pjaniishimo-tweets-shallow-20180925-092352-c6gd8.json'
Waiting 10 seconds...

When the filename is too long, the pipeline should shorten it. I don't know if there's any way to determine which length is safe. It could just remove one character at a time from the end of the slug (i.e. the part before -(shallow|inf)-) until it succeeds. But it would also need to take into account that other files will use longer filenames (e.g. -00000.warc.gz instead of .json); ideally all files from a job should use the same filename.

JustAnotherArchivist avatar Nov 12 '18 00:11 JustAnotherArchivist

As an example regarding that last point, job 2dpu0yxg4tnyzemryrs5iyvxs just crashed on the WARC wpullinc file:

Starting WgetDownload for Item 
Manhole[3714:1569868995.0911]: Patched <built-in function fork> and <built-in function forkpty>.
Manhole[3714:1569868995.0947]: Manhole UDS path: /tmp/manhole-3714
Manhole[3714:1569868995.0949]: Waiting for new connection (in pid:3714) ...
ERROR OSError: [Errno 36] File name too long: '/home/archivebot/ArchiveBot/pipeline-c/data/15698689839fc5fe70c33baa66-83/2dpu0yxg4tnyzemryrs5iyvxs/urls-transfer.notkiska.pw-facebook-@DrSayed-Noorullah-Jalili-%D8%AF%D9%88%DA%A9%D8%AA%D9%88%D8%B1-%D8%B3%DB%8C%D8%AF%D9%86%D9%88%D8%B1%D8%A7%D9%84%D9%84%D9%87-%D8%AC%D9%84%DB%8C%D9%84%DB%8C-2280050385540346-shallow-20190930-184313-2dpu0-00000.warc.gz-wpullinc'
Finished WgetDownload for Item 

Incidentally, that job did not report an error or crash properly; cf. #423

JustAnotherArchivist avatar Sep 30 '19 18:09 JustAnotherArchivist