camel
camel copied to clipboard
[BUG] Pre-commit license check fails due to encoding issues (GBK vs UTF-8)
Required prerequisites
- [X] I have read the documentation https://camel-ai.github.io/camel/camel.html.
- [X] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
- [X] Consider asking first in a Discussion.
What version of camel are you using?
0.1.0
System information
>>> import sys, camel >>> print(sys.version, sys.platform) 3.8.10 (tags/v3.8.10:3d8993a, May 3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)] win32 >>> print(camel.version) 0.1.0
Problem description
Here is a sample issue written according to your requirements:
Title: Encoding issues with UTF-8 and GBK across different systems
Problem Description:
I am currently using camel version x.y.z on a Windows system. When attempting to run pre-commit checks using the update_license.py script, I encountered an error. This error appears to be due to an encoding mismatch - while my system is defaulting to GBK, the script seems to be expecting files encoded in UTF-8.
This issue specifically occurs when the script attempts to open and read from a file. The error message received is:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x9d in position 145: illegal multibyte sequence
The expected behavior is for the script to successfully read from the file and execute the pre-commit checks. However, due to the encoding mismatch, this is not happening.
Reproducible example code
Reproducible Example Code:
Python Snippets:
Unfortunately, without knowing the exact content of your update_license.py
script, I can only provide a generic example of where the issue may arise. The issue most likely occurs when the script attempts to read a file:
with open("file.txt") as f:
content = f.read()
If file.txt
is encoded in UTF-8 but the system defaults to GBK, this will raise a UnicodeDecodeError
.
Command Lines:
This issue is encountered when running pre-commit checks using the update_license.py
script:
python update_license.py
Extra Dependencies:
No additional dependencies are necessary to reproduce this issue. However, ensure you are using the correct version of Python and that you have all necessary packages installed.
Steps to Reproduce:
- Create or prepare a file encoded in UTF-8.
- On a Windows machine with Python installed, attempt to run the
update_license.py
script. - When the script attempts to open and read from the file, observe the
UnicodeDecodeError
.
Traceback
> git -c user.useConfigOnly=true commit --quiet --allow-empty-message --file -
Format code..............................................................Passed
Sort imports.............................................................Passed
Check PEP8...............................................................Passed
Check License............................................................Failed
- hook id: check-license
- exit code: 1
Traceback (most recent call last):
File "****\camel\licenses\update_license.py", line 118, in <module>
update_license_in_directory(
File ""****\camel\licenses\update_license.py", line 93, in update_license_in_directory
if update_license_in_file(
^^^^^^^^^^^^^^^^^^^^^^^
File ""****\camel\licenses\update_license.py", line 42, in update_license_in_file
content = f.read()
^^^^^^^^
UnicodeDecodeError: 'gbk' codec can't decode byte 0x9d in position 145: illegal multibyte sequence
### Expected behavior
Expected Behavior:
The script should successfully read from the file, regardless of the encoding used. It should handle different types of encodings without raising an error, and should carry out the pre-commit checks seamlessly.
### Additional context
## Potential Solution:
I suggest that the script be modified to explicitly use UTF-8 encoding when opening files, irrespective of the system defaults. This can help avoid such issues in the future, especially considering that UTF-8 is widely used across many systems and platforms.
Another option is to provide a way for users to specify the encoding that should be used by the script. This can be in the form of a command-line argument or a configuration file setting.
## Impact:
This issue can disrupt workflows, especially for users working on Windows systems. It can prevent successful execution of pre-commit checks, which can lead to overlooked errors or inconsistencies in the code.
## Additional Context:
This issue seems to stem from the fact that different operating systems default to different encodings. For instance, Windows defaults to GBK, while Linux and MacOS default to UTF-8. Given that UTF-8 is widely used and is a standard on many systems, it may be beneficial to align the script's encoding handling with this standard.
Hm.. I develop on Windows and do not encounter this error. I normally run stuff in Anaconda powershell console.
Thanks. I tried pre-commit
on Linux, and it works too. It could be the error of my Windows env or default settings. When I resolve the issue, I will put the feedback here.
Hi, I found this solution works for me on Windows 11:
To change the default character encoding in Windows, you need to modify Python's locale settings. Python uses the locale library to handle locale-related tasks such as character encoding, number, and date formats, etc. This is a somewhat advanced operation and may affect all Python programs on your system.
In Python 3.7 and later versions, you can globally set Python to use UTF-8 encoding by default in Windows environments by setting the PYTHONUTF8
environment variable to 1
.
Here are the steps to do that:
- Press
Win+X
, and selectSystem
. - Click on
About
, then on the right, selectSystem info
. - In the list on the left, choose
Advanced system settings
. - In the System Properties dialog, select
Environment Variables
. - In the Environment Variables dialog, click on
New
below, and in the new row, inputPYTHONUTF8
and1
.
Then click OK
, close all dialog boxes.
Restart your command prompt or PowerShell window, Python will use UTF-8 as the default character encoding.
Please note that this method will change the default encoding method for all Python programs. If some programs depend on GBK or other encodings, unpredictable problems may occur. You need to ensure you understand the impact of this operation and know how to restore settings if something goes wrong.
It seems unrelated to the project itself but more like a common pitfall for contributors with a Chinese-English working environment. Maybe introducing a docker container or VSCode Dev container in the workflow could eliminate such issues from the root.
It seems unrelated to the project itself but more like a common pitfall for contributors with a Chinese-English working environment. Maybe introducing a docker container or VSCode Dev container in the workflow could eliminate such issues from the root.
Introducing a docker container sounds great. Thanks @kuang-da for the suggestion!
Hi, I found this solution works for me on Windows 11:
To change the default character encoding in Windows, you need to modify Python's locale settings. Python uses the locale library to handle locale-related tasks such as character encoding, number, and date formats, etc. This is a somewhat advanced operation and may affect all Python programs on your system.
In Python 3.7 and later versions, you can globally set Python to use UTF-8 encoding by default in Windows environments by setting the
PYTHONUTF8
environment variable to1
.Here are the steps to do that:
- Press
Win+X
, and selectSystem
.- Click on
About
, then on the right, selectSystem info
.- In the list on the left, choose
Advanced system settings
.- In the System Properties dialog, select
Environment Variables
.- In the Environment Variables dialog, click on
New
below, and in the new row, inputPYTHONUTF8
and1
.Then click
OK
, close all dialog boxes.Restart your command prompt or PowerShell window, Python will use UTF-8 as the default character encoding.
Please note that this method will change the default encoding method for all Python programs. If some programs depend on GBK or other encodings, unpredictable problems may occur. You need to ensure you understand the impact of this operation and know how to restore settings if something goes wrong.
Thank you very much for your method! I am also a windows11 user and a contributor with Chinese-English working environment. And I met the same error and found that your solution is useful. By the way, to solve this problem temporarily, inputing 'set PYTHONUTF8=1
' befor input 'git commit ...
' in command prompt or PowerShell window is also a convenient solution.
@yiyiyi0817 Glad to hear that it is helpful for you.