List jobs with a too long last path component crash
Job c6gd8eb8rwk5su6ijj5g52sv5 crashed with the following traceback:
Starting StartHeartbeat for Item
Finished StartHeartbeat for Item
Starting SetFetchDepth for Item
Finished SetFetchDepth for Item
Starting PreparePaths for Item
Finished PreparePaths for Item
Starting WriteInfo for Item
Failed WriteInfo for Item
Traceback (most recent call last):
File "/home/matt/.local/lib/python3.5/site-packages/seesaw/task.py", line 69, in enqueue
self.process(item)
File "/home/matt/ArchiveBot/pipeline/archivebot/seesaw/tasks.py", line 270, in process
with open(item['source_info_file'], 'w') as f:
OSError: [Errno 36] File name too long: '/home/matt/ArchiveBot/pipeline/data/1537864410911f514affdc1bdf-2636/c6gd8eb8rwk5su6ijj5g52sv5/urls-transfer.sh-Speedosausage-PolygonCherub-mitarashikousei-mashu_003-chakkuheart-chalsoma-akituti-dragooooooooon-Gendo0032-gn_yaky-hataman331-i_n_u-KSUWABE-mahayang0128-mmgrk-Moginiki-RockLow696-rswxx-sameduma-sidotama-U9Works-yunoi_terere-shift0808-pjaniishimo-tweets-shallow-20180925-092352-c6gd8.json'
Waiting 10 seconds...
When the filename is too long, the pipeline should shorten it. I don't know if there's any way to determine which length is safe. It could just remove one character at a time from the end of the slug (i.e. the part before -(shallow|inf)-) until it succeeds. But it would also need to take into account that other files will use longer filenames (e.g. -00000.warc.gz instead of .json); ideally all files from a job should use the same filename.
As an example regarding that last point, job 2dpu0yxg4tnyzemryrs5iyvxs just crashed on the WARC wpullinc file:
Starting WgetDownload for Item
Manhole[3714:1569868995.0911]: Patched <built-in function fork> and <built-in function forkpty>.
Manhole[3714:1569868995.0947]: Manhole UDS path: /tmp/manhole-3714
Manhole[3714:1569868995.0949]: Waiting for new connection (in pid:3714) ...
ERROR OSError: [Errno 36] File name too long: '/home/archivebot/ArchiveBot/pipeline-c/data/15698689839fc5fe70c33baa66-83/2dpu0yxg4tnyzemryrs5iyvxs/urls-transfer.notkiska.pw-facebook-@DrSayed-Noorullah-Jalili-%D8%AF%D9%88%DA%A9%D8%AA%D9%88%D8%B1-%D8%B3%DB%8C%D8%AF%D9%86%D9%88%D8%B1%D8%A7%D9%84%D9%84%D9%87-%D8%AC%D9%84%DB%8C%D9%84%DB%8C-2280050385540346-shallow-20190930-184313-2dpu0-00000.warc.gz-wpullinc'
Finished WgetDownload for Item
Incidentally, that job did not report an error or crash properly; cf. #423