cChardet icon indicating copy to clipboard operation
cChardet copied to clipboard

xed's detection is a bit better than cchardet's

Open JCCyC opened this issue 1 year ago • 1 comments

OS/Arch

system='Linux', node='jclvdell', release='6.8.0-40-generic', version='#40~22.04.3-Ubuntu SMP PREEMPT_DYNAMIC Tue Jul 30 17:30:19 UTC 2', machine='x86_64'

Python version

3.10.12

cChardet version

2.1.7

What is the problem?

A file (attached) with the Euro sign is correctly understood as ISO-8859-15 by the xed editor, but cchardet sees it as ISO-8859-1

Expected behavior

Corações Psicodélicos Nélida Piñón, § 2º, alínea 4ª, a 47° do eixo x. Custo: 50000¥ (ou €313,84)

Actual behavior

Corações Psicodélicos Nélida Piñón, § 2º, alínea 4ª, a 47° do eixo x. Custo: 50000¥ (ou ¤313,84)

(Euro symbol appears as "¤")

Steps to reproduce the behavior

  1. Get this file: pagininha2.html.gz

  2. Do this:

$ gunzip pagininha2.html.gz
$ python
>>> import cchardet as chardet
>>> with open("pagininha2.html", "rb") as f:
...   msg = f.read()
...   result = chardet.detect(msg)
...   print(result)
... 
{'encoding': 'ISO-8859-1', 'confidence': 0.7640712261199951}
>>> 

JCCyC avatar Sep 05 '24 17:09 JCCyC

for long inputs, i prefer charset_normalizer, but its slower than cchardet

milahu avatar Nov 21 '24 17:11 milahu