tarindexer
tarindexer copied to clipboard
Crash, most likely due to non-unicode characters in file name
Hello! Thanks for useful idea!
I tried to use your program on the big archive while using an UTF-8 locale and it crashed with the stack trace:
Traceback (most recent call last):
File "tarindexer.py", line 123, in
File "tarindexer.py", line 118, in main
indextar(dbtarfile,indexfile)
File "tarindexer.py", line 66, in indextar
outfile.write(rec)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 40-47: surrogates not allowed
The file name that most likely triggered the crash is
\317\360\356\341\353\345\354\373\ \341\345\347\356\357\340\361\355\356\361\362\350\ \342\ \310\322.pdf
(as output by ls -b), which indeed does not look like the valid UTF-8.
Unfortunately I cannot send you the archive, mostly because the file and the surrounding files are rather big.
While having this file in the archive is my fault, I think the program should avoid the crash, may be printing ls -b-style output instead.