opengrok
opengrok copied to clipboard
Opengrok is unable to detect that txt file containing © is a text file
Hey all,
I'm running Opengrok 1.0 over my source code and I realize that lots of text files has been considered as unknown type because of a special character on first line. The character is ©. Opengrok output when read this file is:
INFO: Add: /locales/es_US.txt (FileAnalyzer)
Original file is at:
As I said, I'm running Opengrok 1.0 Java jdk1.8.0_45 SO Linux 4.2.0-42-generic #49~14.04.1-Ubuntu SMP Wed Jun 29 20:22:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Any chance to fix this issue?
Thanks in advance
Marcelo M
what is your locale ?
Hello @tarzanek,
See my terminal output:
LANG=en_US.UTF-8
LANGUAGE=en_US:
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
Thanks
definitely a bug ... I guess I should find some time and finally solve all the utf-8 issues once and for all ... (and I did hope all our readers default to current locale/utf-8 by default ... oh well ... )
This is somewhat resolved by open pull request #1817, where the UTF-8 BOM in that sample file, es_US.txt, would be a precise indicator that the file is plain text.
(If the file lacked the BOM — and UTF-8 BOM is generally non-standard — that two-byte copyright symbol would still prevent the file from being inferred to be text.)
Another thought is that PlainAnalyzerFactory should perhaps define at least one suffix that should be quickly inferred as plain text (e.g., "TXT).
With 1.1-rc17 and higher, text files with UTF BOMs are analyzed with PlainAnalyzer
.