Support for non UTF-8 text encodings
Describe the feature
Zed currently only supports UTF-8 text. This is an enhancement request for it to support other text encodings.
- [ ] ISO-8859-1 (Latin-1) latin1.txt
- [ ] CP865 (DOS Nordic) cp865.txt
- [ ] KOI8-R (Russian) koi8_r.txt
- [ ] UTF-16 (16bit Unicode) utf16.txt
- [ ] Shift-JIS (Japanese) main_shift-jis.cpp.txt
- [ ] Windows 1252 (CP1252) cp1252.txt
- [ ] Big-5 (Traditional Chinese) big5.txt
- [ ] GB 2312, GB18030, GBK (Simplified Chinese) gb2312.txt
Workarounds
You can convert your files to UTF-8 using external tools like Sublime Text or iconv. If you know the encoding (e.g. ISO-8859-1 Latin1):
iconv -f ISO-8859-1 -t input.txt > output.txt
Etc
There are separate issues for supporting non-text binary data and CR/LF line-endings:
- https://github.com/zed-industries/zed/issues/5250
- https://github.com/zed-industries/zed/issues/5294
If there are additional specific encodings you would like to see, please comment below with a sample file and I will add it to the list above.
Please 👍 upvote this issue if you would like to see this feature prioritized. (+1 comments will be removed).
I think Windows 1252 (or CP1252) would be a useful addition. cp1252.txt
Yes. UTF-16 which i need, it doesn't open.
Over all first impressions of Zed, it is amazing simple and fast. What can we do, to make additional encodings to see a daylight..
Windows-1251 and really all 125x
I also would like to be able to load files that have some invalid UTF-8 in them but are otherwise UTF-8. For example, I have some log files collected over serial that have an invalid byte or two at the beginning, but are otherwise valid.
Edit: Filed as:
- https://github.com/zed-industries/zed/issues/21072
The way Zed silently ignores files it can't open is a bad thing. It should be considered as a critical bug. I had to refresh my memory with 20 y/o Delphi project today. Was searching for ASCII variable. Zed – 3 files, 5 matches. Sublime – 1191 matches across 10 files.
Why not to warn: "ignored files we can't read"?
Also:
There should be a way to copy this text (or error ID)
– "Stream did not contain valid UTF-8"? Shouldn't that be "stream contains INVALID UTF-8?"
There should be a way to copy this text (or error ID)
Please file an issue for this, that's a defect.
Why not to warn: "ignored files we can't read"?
I understand the desire for this, but practically speaking the existence of a single binary or non-UTF-8 file in a repository would mean this warning would trigger on every single project search. Additionally the immediate response to see that warning would be "Ok, which files?" which require some way for us to enumerate/display the skipped files. Since many projects would trip this on every search, we also want to make it persistently dismissible, which would require that be serialized into the workspace db and/or a setting to ignore that altogether. I'm not saying we shouldn't warn, but there's more complexity beyond a simple if statement -> warning pop-up.
I would suggest to add support for Chinese encodings like GB2312 and BIG-5.
encoding_rs seems to be an option for the encoding and decoding stuff.
@Jisu-Woniu Can you provide example files for these encodings?
would it be also possible to open a mixed encoding/"Non-ISO extended-ASCII text" at least under different unicode/ascii encoding displayed with warning that some letters are wrong?
i have this file where the letter ď is inside the file the command file test.txt says ISO-8859 text
the second problematic letter š makes the file say Non-ISO extended-ASCII text
this is a standard text file created under Slovak environment of Windows, as far as i know, the country uses ISO-8859-2, "latin extended", but some letters like the mentioned š are written non-latin Windows 1252, when it's not necessary to use the latin one, and copypasted into the file such as this one
Maybe its possible to show the raw format instead as an alternative or integrate an hex editor. Some config files are not opening.
what's going no now? is it still under consideration?
Currently, there are text editors that support multiple encoding formats implemented using rust. I hope this function will be promoted as soon as possible.
https://github.com/search?q=repo%3Amicrosoft%2Fedit+encoding&type=code
Please add support to us-ascii. Cant open files with us-ascii encodings. Error "stream did not contain valid UTF-8"
@Benjdao Do you have an example file which triggers this? My understanding is that all ASCII files are valid UTF-8 (7bit safe). Most likely your file is actually latin1, cp1252, cp437 or some other "Extended ASCII" flavor.
An example from a powershell script. Not my script but one I was looking to review. This is intentionally looking for characters that are not 'correct' and then transforming them (in this case to simple non-accented latin equivalent). Attempting to open this gives the "stream did not contain valid UTF-8. Please try again" error.
In this case using iconv to 'fix' the text will break the purpose of the program. It would be nice if it were possible to open the file containing invalid characters and ideally flag somehow that they are not compliant with the target charset, while still allowing the file to save. e.g. "I know what I'm doing, save anyway!"
Vim as a workaround fills the gap for now, so not anything urgent, but it'd be nice.
@puckdoug This is off-topic, but what you attached is actually valid utf-8/iso-8859-1/latin-1 text. I assume somewhere else in your script there there are some byte sequences like ÿ (utf-8 U+00FF; Windows-1252/Latin-1 0xFF) which are actually invalid UTF-8. I believe in PowerShell you could replace those with their escaped equivalents ([char]0xFF) but obviously not everyone can convert legacy projects to only use UTF-8.
@puckdoug This is off-topic, but what you attached is actually valid utf-8/iso-8859-1/latin-1 text. I assume somewhere else in your script there there are some byte sequences like
ÿ(utf-8 U+00FF; Windows-1252/Latin-1 0xFF) which are actually invalid UTF-8. I believe in PowerShell you could replace those with their escaped equivalents ([char]0xFF) but obviously not everyone can convert legacy projects to only use UTF-8.
Yes but still why is it not possible to open this file and view the wrong sequences. Notepad++, vim, nano, all can do that. So it would be great to view all files without switching to different tools for a minor reason.
Please add ISO-8859-2
Please add ISO-8859-2
Why not all ISO-8859-*, from 1 to 16 (except 12 that was abandoned and never used)?
I mentioned encoding_rs being an option for the encoding and decoding stuff, they support nearly all the encodings mentioned above.
Please add ISO-8859-2
Why not all [ISO-8859-*]
I use only 8859-2
I use only 8859-2
I just wanted to say, Zed should support all ISO-8859 encodings.
Kinda annoying to find out that zed does not support other text encodings. I've been loving the editor so add this please
Hello, Some news about this issue ?
As a non-US based Windows user currently trying the Zed Windows Beta, I stumbeled over this multiple times in my first attempts of using Zed.
I stumbled over this so often I think it should be blocker for a stable Windows release.
+1 to a non-US based Windows user: we do need ISO-8859-1 encoding.
UTF-16 LE support would also be great, a good example is Microsoft SQL Server ERRORLOG files which defaults to this encoding.
+1 for UTF-16 LE. This format is used in EDK II for storing multilingual strings (in .uni files), meaning UEFI devs like myself require support for this encoding.
I receive a slightly different error when trying to open a file that Notepad++ shows as having ANSI encoding. Nothing related to UTF-8 which made it a little tricky to track down. Opening in Notepad++ and then choosing Convert to UTF8 and saving then allowed Zed to open the file.
Notepad++ shows as having ANSI encoding
ANSI is not a specific encoding, but the local encoding (i.e. the default encoding of your system language) of Windows system. For example, Windows 1252 for West European languages, GBK for Simplified Chinese, and Shift-JIS for Japanese.
You are advised to upload a sample text file for the encoding, especially those characters not in ASCII character set.