internetarchive icon indicating copy to clipboard operation
internetarchive copied to clipboard

Way to skip PDFs that cause `Syntax error`

Open wcedmisten opened this issue 1 year ago • 2 comments

Hello!

Not sure if there is any workaround for this currently, but I'm trying to bulk upload a set of ~70,000 PDFs using ia upload. The problem is that I periodically get the error:

Uploaded content is unacceptable. - Syntax error detected in pdf data. You may be able to repair the pdf file with a repair tool, pdftk is one such tool.

Which returns an error. I then have to manually delete the PDF and run the command again to resume uploading. Is there a way to automatically skip the PDFs that throw the error so that manual intervention is not required?

wcedmisten avatar Jun 02 '24 18:06 wcedmisten

There is not currently a way to skip failed uploads and continue uploading other files specified in the command (I support this feature though, if anybody has time to add it).

I would suggest finding and filtering any invalid PDFs before uploading:

» find my_pdf_dir -type f | parallel 'pdfinfo -- {} >/dev/null 2>&1 || echo invalid pdf: {}'

jjjake avatar Jun 04 '24 18:06 jjjake

Thanks! For posterity that command didn't output anything, even though pdfinfo on an individual bad file was outputting correctly. I ended up writing a non-paralellized version:

for f in $(ls);
  do
  if pdfinfo $f 2>&1 >/dev/null | grep 'Syntax';
    then echo 'Error on '$f;
  fi;
done

wcedmisten avatar Jun 08 '24 16:06 wcedmisten

Error

Tasnem2000 avatar Nov 22 '24 13:11 Tasnem2000