internetarchive
internetarchive copied to clipboard
Way to skip PDFs that cause `Syntax error`
Hello!
Not sure if there is any workaround for this currently, but I'm trying to bulk upload a set of ~70,000 PDFs using ia upload. The problem is that I periodically get the error:
Uploaded content is unacceptable. - Syntax error detected in pdf data. You may be able to repair the pdf file with a repair tool, pdftk is one such tool.
Which returns an error. I then have to manually delete the PDF and run the command again to resume uploading. Is there a way to automatically skip the PDFs that throw the error so that manual intervention is not required?
There is not currently a way to skip failed uploads and continue uploading other files specified in the command (I support this feature though, if anybody has time to add it).
I would suggest finding and filtering any invalid PDFs before uploading:
» find my_pdf_dir -type f | parallel 'pdfinfo -- {} >/dev/null 2>&1 || echo invalid pdf: {}'
Thanks! For posterity that command didn't output anything, even though pdfinfo on an individual bad file was outputting correctly. I ended up writing a non-paralellized version:
for f in $(ls);
do
if pdfinfo $f 2>&1 >/dev/null | grep 'Syntax';
then echo 'Error on '$f;
fi;
done
Error