vlfi icon indicating copy to clipboard operation
vlfi copied to clipboard

Opening file with wrong encoding.

Open cmal opened this issue 7 years ago • 9 comments

a7d6a325-a4f3-49cb-805c-e5603e77d20b

Hi, I find vlf cannot find the encoding of file and open it with right encoding the same way as GNU Emacs default find-file.

I can open the same file with right encoding with find-file.

How can I open the file with the right encoding?

Thanks!

cmal avatar Apr 18 '17 09:04 cmal

Can you tell what encoding find-file reports? After opening the file, this can be checked with:

M-x describe-current-coding-system

Maybe I would be able to reproduce it on arbitrary file of my own and attempt some tweaks. Otherwise it's a known issue that detecting correct encoding starting at random part of file is imperfect: #16

m00natic avatar Apr 18 '17 22:04 m00natic

Coding system for saving this buffer:
  c -- chinese-gbk-dos (alias: gbk-dos cp936-dos windows-936-dos)

Default coding system (for new files):
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for keyboard input:
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for terminal output:
  U -- utf-8-unix (alias: mule-utf-8-unix)

Coding system for inter-client cut and paste:
  nil
Defaults for subprocess I/O:
  decoding: U -- utf-8-unix (alias: mule-utf-8-unix)

  encoding: U -- utf-8-unix (alias: mule-utf-8-unix)


Priority order for recognizing coding systems when reading files:
  1. utf-8 (alias: mule-utf-8)
  2. chinese-gbk (alias: gbk cp936 windows-936)
  3. iso-2022-cn (alias: chinese-iso-7bit)
  4. chinese-big5 (alias: big5 cn-big5 cp950)
  5. chinese-iso-8bit (alias: cn-gb-2312 euc-china euc-cn cn-gb gb2312)
  6. iso-2022-7bit 
  7. iso-2022-8bit-ss2 
  8. emacs-mule 
  9. raw-text 
  10. iso-2022-jp (alias: junet)
  11. in-is13194-devanagari (alias: devanagari)
  12. utf-8-auto 
  13. utf-8-with-signature 
  14. utf-16 
  15. utf-16be-with-signature (alias: utf-16-be)
  16. utf-16le-with-signature (alias: utf-16-le)
  17. utf-16be 
  18. utf-16le 
  19. japanese-shift-jis (alias: shift_jis sjis)
  20. undecided 

  Other coding systems cannot be distinguished automatically
  from these, and therefore cannot be recognized automatically
  with the present coding system priorities.

Particular coding systems specified for certain file names:

  OPERATION	TARGET PATTERN		CODING SYSTEM(s)
  ---------	--------------		----------------
  File I/O      "\\.dz\\'"              (no-conversion . no-conversion)
                "\\.txz\\'"             (no-conversion . no-conversion)
                "\\.xz\\'"              (no-conversion . no-conversion)
                "\\.lzma\\'"            (no-conversion . no-conversion)
                "\\.lz\\'"              (no-conversion . no-conversion)
                "\\.g?z\\'"             (no-conversion . no-conversion)
                "\\.\\(?:tgz\\|svgz\\|sifz\\)\\'"
                                        (no-conversion . no-conversion)
                "\\.tbz2?\\'"           (no-conversion . no-conversion)
                "\\.bz2\\'"             (no-conversion . no-conversion)
                "\\.Z\\'"               (no-conversion . no-conversion)
                "\\.elc\\'"             utf-8-emacs
                "\\.el\\'"              prefer-utf-8
                "\\.utf\\(-8\\)?\\'"    utf-8
                "\\.xml\\'"             xml-find-file-coding-system
                "\\(\\`\\|/\\)loaddefs.el\\'"
                                        (raw-text . raw-text-unix)
                "\\.tar\\'"             (no-conversion . no-conversion)
                "\\.po[tx]?\\'\\|\\.po\\."
                                        po-find-file-coding-system
                "\\.\\(tex\\|ltx\\|dtx\\|drv\\)\\'"
                                        latexenc-find-file-coding-system
                ""                      (undecided)
  Process I/O	nothing specified
  Network I/O	nothing specified

cmal avatar Apr 19 '17 05:04 cmal

I found vlf can correctly open the file I cut from the beginning of the large file which cannot be opened correctly.

cmal avatar Apr 19 '17 05:04 cmal

Thank you for the details! It seems in line with what I observed once upon a time with utf-16. The case back then was that there were some magic header bytes in the beginning of the file which specified encoding. Inserting arbitrary batch from anywhere beside the beginning doesn't get this information and the insert function is unable to detect proper encoding.

Probably in such cases VLF has to keep track of the initially observed encoding and use it in case auto detection fails on other batches. I'll look deeper probably this weekend and hopefully come up with solution this time. Keep your file around for just in case ;-)

m00natic avatar Apr 20 '17 01:04 m00natic

Thank you for your work. I recall that one of the chapters of Emacs or Elisp manual has some description about the magic header bytes of files with other encoding.

cmal avatar Apr 20 '17 14:04 cmal

I've just pushed something that fixes the issue with utf-16 (at least). Hopefully it will work in this case too.

m00natic avatar May 01 '17 16:05 m00natic

Sorry for reopened. I just opened a wrong file. And the file mentioned above still cannot be opened correctly.

cmal avatar May 02 '17 03:05 cmal

The file is on http://vdisk.weibo.com/s/utbH7Zm3Y8yvm , if you can access to it, and want to use it for testing.

To download it, please click on the image in this page,

and then click on the image in the popup window.

Note that this page should not be opened on mobile, you can check the url after opening it, the url should not be changed to http://vdisk.weibo.com/wap/s/utbH7Zm3Y8yvm .

If you cannot access to this file, and want to get this file to test, plz @ me and I will upload it to dropbox and send it to you.

Thanks a lot!

cmal avatar May 02 '17 03:05 cmal

Got the file, thanks!

So the battle continues. I'll investigate in the coming days.

m00natic avatar May 02 '17 18:05 m00natic