zed icon indicating copy to clipboard operation
zed copied to clipboard

Support for non UTF-8 text encodings

Open notpeter opened this issue 1 year ago • 12 comments

Describe the feature

Zed currently only supports UTF-8 text. This is an enhancement request for it to support other text encodings.

Workarounds

You can convert your files to UTF-8 using external tools like Sublime Text or iconv. If you know the encoding (e.g. ISO-8859-1 Latin1):

iconv -f ISO-8859-1 -t input.txt > output.txt

Etc

There are separate issues for supporting non-text binary data and CR/LF line-endings:

  • https://github.com/zed-industries/zed/issues/5250
  • https://github.com/zed-industries/zed/issues/5294

If there are additional specific encodings you would like to see, please comment below with a sample file and I will add it to the list above.

Please 👍 upvote this issue if you would like to see this feature prioritized. (+1 comments will be removed).

notpeter avatar Aug 27 '24 18:08 notpeter

I think Windows 1252 (or CP1252) would be a useful addition. cp1252.txt

FrittenKeeZ avatar Aug 28 '24 07:08 FrittenKeeZ

Yes. UTF-16 which i need, it doesn't open.

Over all first impressions of Zed, it is amazing simple and fast. What can we do, to make additional encodings to see a daylight..

polytect avatar Oct 10 '24 11:10 polytect

Windows-1251 and really all 125x

niki-sp avatar Nov 13 '24 20:11 niki-sp

I also would like to be able to load files that have some invalid UTF-8 in them but are otherwise UTF-8. For example, I have some log files collected over serial that have an invalid byte or two at the beginning, but are otherwise valid.

Edit: Filed as:

  • https://github.com/zed-industries/zed/issues/21072

russelltg avatar Nov 22 '24 16:11 russelltg

The way Zed silently ignores files it can't open is a bad thing. It should be considered as a critical bug. I had to refresh my memory with 20 y/o Delphi project today. Was searching for ASCII variable. Zed – 3 files, 5 matches. Sublime – 1191 matches across 10 files.

Why not to warn: "ignored files we can't read"?

Also: image There should be a way to copy this text (or error ID)

– "Stream did not contain valid UTF-8"? Shouldn't that be "stream contains INVALID UTF-8?"

varyform avatar Nov 24 '24 20:11 varyform

There should be a way to copy this text (or error ID)

Please file an issue for this, that's a defect.

Why not to warn: "ignored files we can't read"?

I understand the desire for this, but practically speaking the existence of a single binary or non-UTF-8 file in a repository would mean this warning would trigger on every single project search. Additionally the immediate response to see that warning would be "Ok, which files?" which require some way for us to enumerate/display the skipped files. Since many projects would trip this on every search, we also want to make it persistently dismissible, which would require that be serialized into the workspace db and/or a setting to ignore that altogether. I'm not saying we shouldn't warn, but there's more complexity beyond a simple if statement -> warning pop-up.

notpeter avatar Nov 25 '24 14:11 notpeter

I would suggest to add support for Chinese encodings like GB2312 and BIG-5.

encoding_rs seems to be an option for the encoding and decoding stuff.

Jisu-Woniu avatar Jan 19 '25 15:01 Jisu-Woniu

@Jisu-Woniu Can you provide example files for these encodings?

notpeter avatar Jan 20 '25 15:01 notpeter

Sure! Here are some examples:

big5.txt

gb2312.txt (also compatible with GBK and GB18030)

Jisu-Woniu avatar Jan 21 '25 01:01 Jisu-Woniu

would it be also possible to open a mixed encoding/"Non-ISO extended-ASCII text" at least under different unicode/ascii encoding displayed with warning that some letters are wrong?

i have this file where the letter ď is inside the file the command file test.txt says ISO-8859 text the second problematic letter š makes the file say Non-ISO extended-ASCII text

test.txt

this is a standard text file created under Slovak environment of Windows, as far as i know, the country uses ISO-8859-2, "latin extended", but some letters like the mentioned š are written non-latin Windows 1252, when it's not necessary to use the latin one, and copypasted into the file such as this one

verybigelephants avatar Jan 23 '25 09:01 verybigelephants

Maybe its possible to show the raw format instead as an alternative or integrate an hex editor. Some config files are not opening.

iMonZ avatar Apr 18 '25 12:04 iMonZ

what's going no now? is it still under consideration?

davelet avatar May 08 '25 08:05 davelet

Currently, there are text editors that support multiple encoding formats implemented using rust. I hope this function will be promoted as soon as possible.

https://github.com/search?q=repo%3Amicrosoft%2Fedit+encoding&type=code

Vixb1122 avatar May 22 '25 03:05 Vixb1122

Please add support to us-ascii. Cant open files with us-ascii encodings. Error "stream did not contain valid UTF-8"

Benjdao avatar Jun 05 '25 09:06 Benjdao

@Benjdao Do you have an example file which triggers this? My understanding is that all ASCII files are valid UTF-8 (7bit safe). Most likely your file is actually latin1, cp1252, cp437 or some other "Extended ASCII" flavor.

notpeter avatar Jun 05 '25 18:06 notpeter

An example from a powershell script. Not my script but one I was looking to review. This is intentionally looking for characters that are not 'correct' and then transforming them (in this case to simple non-accented latin equivalent). Attempting to open this gives the "stream did not contain valid UTF-8. Please try again" error.

invalid.txt

In this case using iconv to 'fix' the text will break the purpose of the program. It would be nice if it were possible to open the file containing invalid characters and ideally flag somehow that they are not compliant with the target charset, while still allowing the file to save. e.g. "I know what I'm doing, save anyway!"

Vim as a workaround fills the gap for now, so not anything urgent, but it'd be nice.

puckdoug avatar Jun 12 '25 17:06 puckdoug

@puckdoug This is off-topic, but what you attached is actually valid utf-8/iso-8859-1/latin-1 text. I assume somewhere else in your script there there are some byte sequences like ÿ (utf-8 U+00FF; Windows-1252/Latin-1 0xFF) which are actually invalid UTF-8. I believe in PowerShell you could replace those with their escaped equivalents ([char]0xFF) but obviously not everyone can convert legacy projects to only use UTF-8.

notpeter avatar Jun 12 '25 17:06 notpeter

@puckdoug This is off-topic, but what you attached is actually valid utf-8/iso-8859-1/latin-1 text. I assume somewhere else in your script there there are some byte sequences like ÿ (utf-8 U+00FF; Windows-1252/Latin-1 0xFF) which are actually invalid UTF-8. I believe in PowerShell you could replace those with their escaped equivalents ([char]0xFF) but obviously not everyone can convert legacy projects to only use UTF-8.

Yes but still why is it not possible to open this file and view the wrong sequences. Notepad++, vim, nano, all can do that. So it would be great to view all files without switching to different tools for a minor reason.

iMonZ avatar Jun 12 '25 17:06 iMonZ

Please add ISO-8859-2

lykamspam avatar Jul 07 '25 21:07 lykamspam

Please add ISO-8859-2

Why not all ISO-8859-*, from 1 to 16 (except 12 that was abandoned and never used)?

I mentioned encoding_rs being an option for the encoding and decoding stuff, they support nearly all the encodings mentioned above.

Jisu-Woniu avatar Jul 08 '25 04:07 Jisu-Woniu

Please add ISO-8859-2

Why not all [ISO-8859-*]

I use only 8859-2

lykamspam avatar Jul 10 '25 12:07 lykamspam

I use only 8859-2

I just wanted to say, Zed should support all ISO-8859 encodings.

Jisu-Woniu avatar Jul 10 '25 12:07 Jisu-Woniu

Kinda annoying to find out that zed does not support other text encodings. I've been loving the editor so add this please

zamonary1 avatar Jul 13 '25 20:07 zamonary1

Hello, Some news about this issue ?

Image

RubenVP2 avatar Jul 23 '25 09:07 RubenVP2

As a non-US based Windows user currently trying the Zed Windows Beta, I stumbeled over this multiple times in my first attempts of using Zed.

I stumbled over this so often I think it should be blocker for a stable Windows release.

thatguy7 avatar Oct 09 '25 08:10 thatguy7

+1 to a non-US based Windows user: we do need ISO-8859-1 encoding.

sergpryimachuk avatar Oct 16 '25 07:10 sergpryimachuk

UTF-16 LE support would also be great, a good example is Microsoft SQL Server ERRORLOG files which defaults to this encoding.

Image

pnvnd avatar Oct 18 '25 03:10 pnvnd

+1 for UTF-16 LE. This format is used in EDK II for storing multilingual strings (in .uni files), meaning UEFI devs like myself require support for this encoding.

brendon-felix avatar Oct 20 '25 16:10 brendon-felix

I receive a slightly different error when trying to open a file that Notepad++ shows as having ANSI encoding. Nothing related to UTF-8 which made it a little tricky to track down. Opening in Notepad++ and then choosing Convert to UTF8 and saving then allowed Zed to open the file.

Image

kczx3 avatar Nov 20 '25 16:11 kczx3

Notepad++ shows as having ANSI encoding

ANSI is not a specific encoding, but the local encoding (i.e. the default encoding of your system language) of Windows system. For example, Windows 1252 for West European languages, GBK for Simplified Chinese, and Shift-JIS for Japanese.

You are advised to upload a sample text file for the encoding, especially those characters not in ASCII character set.

Jisu-Woniu avatar Nov 21 '25 01:11 Jisu-Woniu