nbmerge icon indicating copy to clipboard operation
nbmerge copied to clipboard

not save as utf8, UnicodeDecodeError('utf-8',

Open Yensan opened this issue 7 years ago • 10 comments

if there are some character beyond ASCII, it do not save as utf-8. for example, Chinese in *.ipynb, it is saved as GBK actully. So cause UnicodeDecodeError('utf-8',

Yensan avatar Jan 31 '18 02:01 Yensan

Hello, @Yensan. Can you post a gist to an example notebook along with the command you used to reproduce it?

Thanks!

jbn avatar Feb 07 '18 22:02 jbn

@jbn just as what you said in readme nbmerge file_1.ipynb file_2.ipynb file_3.ipynb > merged.ipynb, but the file I edit have some some character beyond ASCII. It is very simple to you to reproduce: new a *.ipynb; paste some Chinese; then nbmerge file_1.ipynb file_2.ipynb file_3.ipynb > merged.ipynb I use VScode(Editor) to reset the encode, every thing is ok.

Yensan avatar Feb 08 '18 02:02 Yensan

Sorry for the delay, @Yensan!

I was unable to replicate this. Are you on windows? I think the default encoding for command line is not unicode for windows, so when you pipe output it's going to give a problem. Try doing,

nbmerge file_1.ipynb file_2.ipynb file_3.ipynb -o _merged.ipynb

instead to skip piping. If not, let me know and I'll go back to debugging.

jbn avatar Mar 06 '18 14:03 jbn

@jbn Not sorry at all. Thank you for this tool and reply. Yes you are right, I was using company computer which is Win7. I use MacOS, I just resigned one week ago. So it will delay to replicate

Yensan avatar Mar 17 '18 10:03 Yensan

Hi @Yensan.

I read up a bit on the problem and would like to fix it. Any chance I could get you to run this script:

https://gist.github.com/jbn/6b87f180cff5dae4b6554ef58ba26c6f

in the directory with your notebooks, replacing "./YOUR_NOTEBOOK_FILE.ipynb" with your notebook name. If you copy and paste the output, it should be a relatively easy fix.

Thanks if you can :)

jbn avatar Apr 18 '18 13:04 jbn

(⊙o⊙) oh! Sorry I can't open https://gist.github.com/ in my net.... Because 'Greate wall' issue 😄 You can just paste here. I am in a new company now, so this is not the same environment. But I will use Chinese or other Non-Ascii words to test it. Recent days I get in an ctypes trouble, if you know how to slove it, please paste your answer. https://stackoverflow.com/questions/49913956/ctypes-use-pointer-and-cfunctype

Yensan avatar Apr 19 '18 06:04 Yensan


import sys, locale


exprs = """
locale.getpreferredencoding()
type(fp)
fp.encoding
sys.stdout.isatty()
sys.stdout.encoding
sys.stdin.isatty()
sys.stdin.encoding
sys.stderr.isatty()
sys.stderr.encoding
sys.getdefaultencoding()
sys.getfilesystemencoding()
"""

with open("./YOUR_NOTEBOOK_FILE.ipynb", "r") as fp:
    for expr in exprs.strip().split():
        print(expr.rjust(30), eval(expr))

Can't help with the ctypes issue. Never really use that code.

jbn avatar Apr 20 '18 12:04 jbn

I am so sorry to reply so late, because my career is so tortuous. (If any remote job will be grateful)

This .ipynb file is edited in Windows and Mac, then I run your script in Windows 10 pro(Chinese-simpfied), Although Win10 is a virtual machine, but never mind, the result is the same.

Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)] on win32

Windows:

C:\Users\aC>systeminfo
主机名:           C53
OS 名称:          Microsoft Windows 10 专业版
OS 版本:          10.0.17763 暂缺 Build 17763
OS 制造商:        Microsoft Corporation
OS 配置:          独立工作站
OS 构件类型:      Multiprocessor Free
初始安装日期:     2019/1/6, 14:03:29
系统启动时间:     2019/1/11, 0:28:07
系统类型:         x64-based PC
处理器:           安装了 1 个处理器。
                  [01]: Intel64 Family 6 Model 61 Stepping 4 GenuineIntel ~1600 Mhz
BIOS 版本:        Parallels Software International Inc. 14.0.1 (45154), 2018/9/7
系统区域设置:     zh-cn;中文(中国)
输入法区域设置:   en-us;英语(美国)

Your script output:

 locale.getpreferredencoding() cp936
                      type(fp) <class '_io.TextIOWrapper'>
                   fp.encoding cp936
           sys.stdout.isatty() True
           sys.stdout.encoding cp936
            sys.stdin.isatty() True
            sys.stdin.encoding cp936
           sys.stderr.isatty() True
           sys.stderr.encoding cp936
      sys.getdefaultencoding() utf-8
   sys.getfilesystemencoding() mbcs

Yensan avatar Jan 10 '19 16:01 Yensan

import sys, locale

exprs = """ locale.getpreferredencoding() type(fp) fp.encoding sys.stdout.isatty() sys.stdout.encoding sys.stdin.isatty() sys.stdin.encoding sys.stderr.isatty() sys.stderr.encoding sys.getdefaultencoding() sys.getfilesystemencoding() """

with open("./YOUR_NOTEBOOK_FILE.ipynb", "r") as fp: for expr in exprs.strip().split(): print(expr.rjust(30), eval(expr)) Can't help with the ctypes issue. Never really use that code.

Hello @jbn, I'm also having this problem while merging three notebooks with chinese characters, here's the output of your script and I've also attached my three files to be merged: Desktop.zip

!nbmerge 1.ipynb 2.ipynb 3.ipynb > merged.ipynb

Thx a lot!!

Best, PJ

 locale.getpreferredencoding() cp936
                      type(fp) <class '_io.TextIOWrapper'>
                   fp.encoding cp936
           sys.stdout.isatty() False
           sys.stdout.encoding UTF-8
            sys.stdin.isatty() False
            sys.stdin.encoding cp936
           sys.stderr.isatty() False
           sys.stderr.encoding UTF-8
      sys.getdefaultencoding() utf-8
   sys.getfilesystemencoding() utf-8 

shpj123 avatar Dec 08 '19 09:12 shpj123

To clarify, is this issue only on Windows, and not Unix (Linux or Mac OS)?

EDIT: I just ran this on Ubuntu Bionic (copy-pasted Chinese characters into two notebooks), e.g.

nbmerge unicode1.ipynb unicode2.ipynb > new.ipynb

and ran into new issues whatsoever.

So I think it could be helpful to label this issue as being specific to Windows only (to avoid unnecessarily freaking out/turning off people who aren't running this with Windows).

This is a great package by the way! Elegant solution to a recurring problem.

krinsman avatar Feb 02 '20 20:02 krinsman