opengrok icon indicating copy to clipboard operation
opengrok copied to clipboard

Opengrok is unable to detect that txt file containing © is a text file

Open mmmarq opened this issue 7 years ago • 5 comments

Hey all,

I'm running Opengrok 1.0 over my source code and I realize that lots of text files has been considered as unknown type because of a special character on first line. The character is ©. Opengrok output when read this file is:

INFO: Add: /locales/es_US.txt (FileAnalyzer)

Original file is at:

es_US.txt

As I said, I'm running Opengrok 1.0 Java jdk1.8.0_45 SO Linux 4.2.0-42-generic #49~14.04.1-Ubuntu SMP Wed Jun 29 20:22:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Any chance to fix this issue?

Thanks in advance

Marcelo M

mmmarq avatar Oct 05 '17 21:10 mmmarq

what is your locale ?

tarzanek avatar Oct 06 '17 20:10 tarzanek

Hello @tarzanek,

See my terminal output:

LANG=en_US.UTF-8
LANGUAGE=en_US:
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Thanks

mmmarq avatar Oct 06 '17 21:10 mmmarq

definitely a bug ... I guess I should find some time and finally solve all the utf-8 issues once and for all ... (and I did hope all our readers default to current locale/utf-8 by default ... oh well ... )

tarzanek avatar Oct 10 '17 14:10 tarzanek

This is somewhat resolved by open pull request #1817, where the UTF-8 BOM in that sample file, es_US.txt, would be a precise indicator that the file is plain text.

(If the file lacked the BOM — and UTF-8 BOM is generally non-standard — that two-byte copyright symbol would still prevent the file from being inferred to be text.)

Another thought is that PlainAnalyzerFactory should perhaps define at least one suffix that should be quickly inferred as plain text (e.g., "TXT).

idodeclare avatar Oct 10 '17 18:10 idodeclare

With 1.1-rc17 and higher, text files with UTF BOMs are analyzed with PlainAnalyzer.

idodeclare avatar Jan 27 '18 21:01 idodeclare