zed icon indicating copy to clipboard operation
zed copied to clipboard

Spell Checking

Open rkusa opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe. I regularely make a lot of typos in code comments, docs, but also in variable names. It is always annoying if pull-requests get post-poned just because of typos reviewers found.

Describe the solution you'd like I find it tremendously helpful if Zed could spell checks my code. I personally rely a lot on this in e.g.:

  • Sublime Text: https://www.sublimetext.com/docs/spell_checking.html, and
  • VSCode: https://marketplace.visualstudio.com/items?itemName=streetsidesoftware.code-spell-checker.

Things the spell checker might check:

  • comments,
  • doc comments,
  • strings,
  • segments of a variable name (eg. in get_name/GetName/getName, it would check get and name).

Additional useful features:

  • The possibility to add words to a system-wide dictionary.
  • The possibility to add words to a project-specific dictionary.
  • The possibility to check multiple languages at the same time (e.g. for non-English code, variables are often in English, while strings and comments might contain text in another language).

I'd find it especially neat if the spell check would use native system APIs so that I can expect consistent behaviour between the editor and other apps on my system.

Screenshots

Sublime Text:

image

VSCode Code Spell Checker extension:

image

rkusa avatar Jul 10 '22 13:07 rkusa

My dream would be to embed something like Grammarly right in Zed :)

iamnbutler avatar Jul 10 '22 14:07 iamnbutler

Just came along the following code spelling library written in Rust. Just dropping the link in case it would help: https://github.com/crate-ci/typos

rkusa avatar Dec 10 '22 09:12 rkusa

I just tried setting up Zed today and this was the showstopper issue that made me go back to VS Code. Spell check is a critical concern of an editor; it's generally too complex to enforce correct spelling cross-platform via CLI and CI with any degree of confidence and you may be contributing spelling errors to projects with tooling and CI not in your control. I am hopelessly dependent on spellcheck when writing hundreds of words worth of comments and docs per day, or when naming things like types, functions, and variables.

VS Code spell check is quite poor because it's via an extension (https://marketplace.visualstudio.com/items?itemName=streetsidesoftware.code-spell-checker) and not the native macOS spellcheck system, so the spell check UI is inconsistent with other apps and the learned words and not in sync with every other app on your system. But at least it has a solution.

There are non VS Code editors out there that have really nice native OS based spell check, so it's possible.

jaydenseric avatar Jan 25 '24 03:01 jaydenseric

I would prefer LanguageTool to Grammarly as it FOSS and supports more languages

SpyrosMourelatos avatar Mar 04 '24 12:03 SpyrosMourelatos

This feature feels like a rather large one to add to Zed, when considering all of the things you should be able to do with spell check. What does a good first pass look like? What would be the bare minimum needed to ship something useable?

Some unknowns:

  • Is there a Rust crate out there suitable for spell checking in the context of code?
  • Would we have some sort of setting for which file types to spell check, or, maybe a setting for which file types to not spell check.. or both? I think VS Code's spell check has these settings.
  • How do we want to surface the spelling errors? Is it just another red squiggly line and do we have a way to suggest a fix? I feel like for a first pass, we could simply display the potential matches in the hover, and a future follow-up could add the possibility to accept a suggestion for a typo.
  • VS Code provides spell checking for multiple languages, do we just ship English first? Maybe we'll be lucky and have a crate that can handle multiple languages.

Bonus points

Spell checking in Zed's chat editor.

Future AI ideas

A distant future goal might be to somehow leverage the supported AI models in Zed to fill suggestions for mispelled words, if the suggestions provided by some crate aren't the greatest.

JosephTLyons avatar Mar 25 '24 17:03 JosephTLyons

How do we want to surface the spelling errors? Is it just another red squiggly line and do we have a way to suggest a fix?

In VS Code, the Grammarly extension integrates with the "suggested fix" feature provided by language servers. I can use the same keyboard shortcut to fix ESlint violations and spelling/grammar mistakes.

bajtos avatar Mar 26 '24 08:03 bajtos

Vale is an offline rule-based "prose linter" (spelling & style checker) with an official LSP implementation. It is also code-aware so it can check code comments and won't get confused by markdown or HTML. Seems like a great fit?

jansol avatar Mar 26 '24 17:03 jansol

Testing Vale this week, and it works quite well on the CLI level. The configuration might be a bit tricky to set up, but should be fine sailing afterwards.

arthur-st avatar Apr 05 '24 14:04 arthur-st

Future AI ideas

A distant future goal might be to somehow leverage the supported AI models in Zed to fill suggestions for mispelled words, if the suggestions provided by some crate aren't the greatest.

I frequently switch among Zed, ChatGPT, Gemini, etc., for Markdown editing, focusing on three key tasks:

  1. Proofing: My standard response settings are tailored for specific styles, as exemplified below. I need the proofreading to adhere to a distinct style based on the context—basic questions, formal responses, and technical discourse—and to be delivered in Markdown format using British English spelling.

For example, part of my chatGPT response template is - "Basic questions should be concise and direct, emulating the style of the Australian Daily Telegraph, without colloquialisms. Formal responses should be brief and direct, utilizing a military writing style. Technical responses should cater to a postgraduate audience and remain technical." Ideally, I'd like the option to easily select the level of proofing required right after highlighting the text I've just written. i.e a "quick proof", followed by a more reasoned proof, where I give the AI's more context to assist in their feedback.

  1. Argument Improvement & Revision: I alternate between AI tools to refine and enhance my work. Direct integration of such feedback within Zed would streamline my editing process.

  2. Post-Production: The final adjustments include ensuring APA v7 compliance for references, adding document headers, and formatting the output to specific standards (PDF/HTML or voice script).

Incorporating these features into Zed would significantly enhance my editing workflow, minimizing the need to switch between different tools for each task, and negate the need for inline spelling checks

phaynes avatar Apr 06 '24 00:04 phaynes

I would prefer LanguageTool to Grammarly as it FOSS and supports more languages

Yeah, I've never been overly impressed with the feedback I get from Grammarly.

brandondrew avatar Jun 27 '24 22:06 brandondrew

I'm trying to use zed to write docs in MD and MDX. The lack of a spell checker is really painful.

levlaz avatar Jul 11 '24 22:07 levlaz

Future AI ideas

A distant future goal might be to somehow leverage the supported AI models in Zed to fill suggestions for mispelled words, if the suggestions provided by some crate aren't the greatest.

I frequently switch among Zed, ChatGPT, Gemini, etc., for Markdown editing, focusing on three key tasks:

1. _Proofing_: My standard response settings are tailored for specific styles, as exemplified below. I need the proofreading to adhere to a distinct style based on the context—basic questions, formal responses, and technical discourse—and to be delivered in Markdown format using British English spelling.

For example, part of my chatGPT response template is - "Basic questions should be concise and direct, emulating the style of the Australian Daily Telegraph, without colloquialisms. Formal responses should be brief and direct, utilizing a military writing style. Technical responses should cater to a postgraduate audience and remain technical." Ideally, I'd like the option to easily select the level of proofing required right after highlighting the text I've just written. i.e a "quick proof", followed by a more reasoned proof, where I give the AI's more context to assist in their feedback.

2. _Argument Improvement & Revision_: I alternate between AI tools to refine and enhance my work. Direct integration of such feedback within Zed would streamline my editing process.

3. _Post-Production_: The final adjustments include ensuring APA v7 compliance for references, adding document headers, and formatting the output to specific standards (PDF/HTML or voice script).

Incorporating these features into Zed would significantly enhance my editing workflow, minimizing the need to switch between different tools for each task, and negate the need for inline spelling checks

This is very common for me as well. As someone who struggles with formatting and grammar, I use chatGPT and Gemini to help me spot issues. Though, I don't usually use them for their generative capabilities. A powerful spell and grammar checker like Grammarly would greatly assist this. I don't think, however, you need to integrate any AI into the codebase.

albassort avatar Jul 22 '24 22:07 albassort

Hi,

I have taken a first stab at creating a configurable full spell checker / grammar checker and proofing engine that integrates to Zed - example key bindings included, uses the OpenAI and Anthropic API's.

The markdown proofing engine is here, and I am finalising an initial baseline to part of a general publication engine from text - although I am starting with research papers.

Any and all feedback would be greatly appreciated. This is genuinely a first drop of the approach.

Philip

phaynes avatar Jul 23 '24 23:07 phaynes

I truly appreciate the effort the team and contributors put into bringing this feature to life, but I have to ask: is there any chance of getting a spellchecker for Zed that doesn't rely on a remote service or require a local GPU?

florinpatrascu avatar Jul 27 '24 15:07 florinpatrascu

I truly appreciate the effort the team and contributors put into bringing this feature to life, but I have to ask: is there any chance of getting a spellchecker for Zed that doesn't rely on a remote service or require a local GPU?

The Vale extension does exactly this. It relies on a local dictionary (defined with plain text files) that it matches words and phrases against with plain old regular expressions. Unfortunately it currently advertises support for Markdown files (Vale itself also supports spellchecking comments in programming languages), and the language server was crashing a lot when used from zed last time I tried. Nobody really knew why, though.

jansol avatar Jul 27 '24 21:07 jansol

The short answer is yes, if you are happy to run a local LLM. I performed a smoke test of the concept using Llama-3.1-8b-Instant on groq, and this seems to be viable.

Would a solution that uses Llama-3.1-8b running on something like Ollama locally work for you, or running Llama inside a Docker container?

The engine above is designed to support multiple AI engines (via the -ai flag), but it needs checking and would likely require modifications to the configurable prompts.

Some more explanation:

The goal of the above engine is to not only consider spell checking for different international languages but also complex grammar checking and proofing against a range of criteria required for publication-grade technical documentation.

Thus, spell checking the following should be fine:

  • Definition: Acids are hydrogen-containing compounds that dissociate in water to give H(^+) ions.
  • Equation: (\text{HCl(aq)} \rightarrow \text{H}^+(\text{aq}) + \text{Cl}^-(\text{aq}))

As should the following prompt:

Without providing comments, proof the following text for spelling and grammar, using British English, active voice, markdown, academic style, and do not change BibTeX references.

Furthermore, spell and grammar checking within a software IDE requires an understanding of code and data formats such as JSON to avoid the ongoing frustration of not being able to closely integrate code and data across the full large-scale software engineering lifecycle.

While it is conceivable these considerations could be met otherwise, with the advent of high-quality open-source LLMs, I believe it would be cost prohibitive to do so.

phaynes avatar Jul 27 '24 22:07 phaynes

While it is conceivable these considerations could be met otherwise, with the advent of high-quality open-source LLMs, I believe it would be cost prohibitive to do so.

An understandable concern, this has only been done without LLMs for decades after all. Faster and more deterministically than a LLM ever will, too.

It is truly fascinating how quickly people have discarded the very idea of even trying to do anything without a LLM.

jansol avatar Jul 27 '24 23:07 jansol

Would a solution that uses Llama-3.1-8b running on something like Ollama locally work for you, or running Llama inside a Docker container?

As a non-native English speaker, I find a standard spellchecker incredibly helpful. If I have to run any LLM in the background or use Docker for such tasks, it would diminish my enjoyment of coding. Additionally, I do not consider Docker an essential tool for my workstation — servers or similarly related tasks/needs, perhaps, but not for my personal setup. This is just my personal preference, of course.

florinpatrascu avatar Jul 28 '24 00:07 florinpatrascu

An understandable concern, this has only been done without LLMs for decades after all. Faster and more deterministically than a LLM ever will, too.

It is truly fascinating how quickly people have discarded the very idea of even trying to do anything without a LLM.

The x86 assembly source code for my 1984 scientific word processor was indeed fast but is unfortunately lost to history. However, your assertion that "people have quickly discarded the idea of doing anything without an LLM" is incorrect and underestimates the complexity of the problem I have outlined.

Large companies like Grammarly exist for a reason: grammar checking, citations, and writing to a style are challenging problems, particularly for any spoken language. This is before considering how these technologies must be closely integrated within software environments and optimized against specific criteria. These functions are indeed well-suited for LLMs.

As I stated, my goal is to have the capability to produce publication-grade documentation from the source. This capability is a key mechanism for whole lifecycle integration in large software-intensive programs.

If this is not within your scope, that is okay too.

phaynes avatar Jul 28 '24 01:07 phaynes

As a non-native English speaker, I find a standard spellchecker incredibly helpful. If I have to run any LLM in the background or use Docker for such tasks, it would diminish my enjoyment of coding. Additionally, I do not consider Docker an essential tool for my workstation — servers or similarly related tasks/needs, perhaps, but not for my personal setup. This is just my personal preference, of course.

I’m glad I asked! Would the following option be useful instead: a simpler, dictionary-based English spell checker that assumes the entire line is text? It would handle only basic spelling errors within the text block, leaving grammar and more sophisticated proofing for a different stage.

phaynes avatar Jul 28 '24 01:07 phaynes

Almost every spellchecker out there runs on plain old dictionary checks. Think of your iPhone or Android keyboards. Your Mac. Every browser you've used. They all run with almost basic spell checking and thats alright. lenght, heigth, gramar, enginer, ...

Over complicating spell checking with LLMs brings no tangible benefits to the table but rather disadvantages such as higher resource usage, slower feedback loops and non deterministic results. It's a Code editor, not MS Word. I ain't writing Harry Potter books on it.

Altair-Bueno avatar Jul 28 '24 08:07 Altair-Bueno

Would a solution that uses Llama-3.1-8b running on something like Ollama locally work for you, or running Llama inside a Docker container?

As a non-native English speaker, I find a standard spellchecker incredibly helpful. If I have to run any LLM in the background or use Docker for such tasks, it would diminish my enjoyment of coding. Additionally, I do not consider Docker an essential tool for my workstation — servers or similarly related tasks/needs, perhaps, but not for my personal setup. This is just my personal preference, of course.

@florinpatrascu actually, I have a different idea that I would appreciate your input on.

Question 1: As a non-native English speaker, would the ability to write initial comments in your native language and then translate them to English be beneficial?

Context:

“Spell checking” can mean various things: from simple autocorrect in mobile messaging to spell checking in text, basic spell checking in code, and the more rigorous spell checking required in formal system documents, typically for larger software systems. To avoid the complexity of writing parsers for every language and context, robustly identifying written words in text, and enabling sentence-level contextual spell checking:

Question 2: Would having a small embedded LLM periodically run on your local machine as needed to profile your code statistically, simplify spell checking, and incorporate custom rules (e.g., ignoring LaTeX/references), be acceptable?

phaynes avatar Jul 29 '24 02:07 phaynes

Almost every spellchecker out there runs on plain old dictionary checks. Think of your iPhone or Android keyboards. Your Mac. Every browser you've used. They all run with almost basic spell checking and thats alright. lenght, heigth, gramar, enginer, ...

Over complicating spell checking with LLMs brings no tangible benefits to the table but rather disadvantages such as higher resource usage, slower feedback loops and non deterministic results. It's a Code editor, not MS Word. I ain't writing Harry Potter books on it.

Your 2 points are at odds with each other. First you want a simple spell checker that "runs on plain old dictionary checks". Next you point out that Zed is a code editor.

While I can see why someone would choose Zed to write a novel in--it's a pretty awesome editor--you're exactly right that Zed is primarily aimed at editing code. This very fact makes the spell checker much more complicated than you seem to realize. For example, do you want your spell checker complaining about every token in your code that doesn't match a word in the dictionary? That's what you would get with nothing but "plain old dictionary checks".

brandondrew avatar Jul 29 '24 04:07 brandondrew

do you want your spell checker complaining about every token in your code that doesn't match a word in the dictionary

Yes. Exactly. That's a great behavior. Even better if I can add my own custom dictionaries. For non-native English speakers, such as florin and I, that's all we need. And even English speakers would benefit from it.

However, it needs to be mindful of case style (e.g CollectorBuilder should be transformed into collector and builder and check separately).

Altair-Bueno avatar Jul 29 '24 07:07 Altair-Bueno

do you want your spell checker complaining about every token in your code that doesn't match a word in the dictionary

Yes. Exactly. That's a great behaviour.

No, that is terrible behavior and will never work.

I sit in the camp of writing code, yet I somehow find myself writing 30-40K+ high-quality words per year across the full SDLC.

My guess is that it is important Zed correctly supports international work. The spell checker can’t disrupt global build systems when people from the Commonwealth insist on using British English, and US teams dig their heels in, insisting on American English. Also, we all know how much the French love having their code comments translated to English, and that Japanese teams prefer writing and reading code in a language other than US English.

For this reason, I think the spell checker needs to work in batch mode with common parameters. In this way, code comments are saved and then checked into a common baseline, then translated to the programmer’s spoken language of choice on the fly when opening a file, permitting editing of the file in the person’s native language. When checking the code back in, the written words are transformed back into the baseline language. This is just solving a more complicated version of the spaces/tabs and curly braces wars.

Additionally, the near-ubiquitous use of code within comments pretty much requires some sort of smart parameterization so only written words are checked. The spell checker needs to work not only in mainline code but with config and data files of every flavor. Again, this will need to be tailored to the circumstances. I can’t see how this can easily be done without the use of LLMs.

Now, I definitely think an embedded mode would be super useful—and indeed essential when development environments are not internet-connected. Given that LLMs can also generate code, an acceptable compromise could be for the LLM to pre-assess code locally, allowing simplified spell checkers to be generated on the fly and permit more lightweight checking.

Anyway, these are my thoughts and relatively straightforward enhancements based on the work I have done above.

Philip

phaynes avatar Jul 29 '24 08:07 phaynes

Hi @phaynes,

Thank you for your questions and for your interest in my/our style of composing and editing code.

Your questions have already been covered by the responses of other participants, so I do not believe I can or should add lots more.

In my case, I do not want any LLM running in the background, on my local machine, when I am composing and editing code - have no troubles using a remote service, but that's off-topic.

I understand that I might be part of the last generation of programmers who write code "by hand" — my mentor, very many moons ago, used to say that we should aspire to become "software writers" not "software developers", and I am convinced that I have failed that aspiration :)

Therefore, I will continue to write code as I have been, relying on my own "local LLM" that I have "trained" over several decades, whether it is considered foolish or not - it continues to pay my bills, so far.

I am certain that the solutions you suggest will appeal to 99% of the new generation. However, I continue to seek a simple and non-intrusive solution - I do not want to have to use rebase/amend/etc. and push again, because of a typo. That is all.

I have much more to say about the aspect related to comments written in the native language, but I do not think additional support is necessary. Zed already has a well done in-line assistant that is very good for this purpose, if one desires it.

However, if someone does not know English and relies solely on an automatic translator, they should at least ensure they write grammatically correct in their own language. Otherwise, the result can be deplorable - I have seen code written by consultants who barely spoke English, and I can only say that I would have preferred that code to have no comments at all or, better yet, to have been written by someone sporting a real coding experience. But I digress.

That is all I had to say.

Thank you. ッFlorin

florinpatrascu avatar Jul 29 '24 12:07 florinpatrascu

do you want your spell checker complaining about every token in your code that doesn't match a word in the dictionary

Yes. Exactly. That's a great behavior. Even better if I can add my own custom dictionaries. For non-native English speakers, such as florin and I, that's all we need. And even English speakers would benefit from it.

However, it needs to be mindful of case style (e.g CollectorBuilder should be transformed into collector and builder and check separately).

You just contradicted yourself, proving my point. First you said that would be great behavior, but then you added a caveat, showing that you don't actually want simple comparison of tokens to dictionary entries. You actually want the spell checker to be "smart" enough to split compound class names into separate tokens--basically you want it to be aware of differences that are a result of the fact that you're editing code and not normal English text. You can't simultaneously claim this is a simple problem and also say that you want such benefits which require added complexity. The two claims are contradictory. You can't have it both ways.

Maybe it would be better to be thankful for all the work that the Zed programmers are doing for us, instead of complaining that it's not good enough.

brandondrew avatar Jul 29 '24 16:07 brandondrew

You know what?, you are right. Its splitting tokens accordingly is not that of an easy problem. Thankfully, is an already solved one. And much easier with tree sitter built in. You can ignore language keywords to speed up spellchecking.

However, instead of debating about how difficult this is we can just look what is out there.

This is the source code for IntelliJ's spellchecker. Which i personally consider to be the best out there (code wise). License is Apache https://github.com/JetBrains/intellij-community/tree/master/spellchecker/src/com/intellij

And this one is Code Spell Checker for VSCode. Not as good. Not as powerful (IntelliJ also caches a lot more stuff like prepositions), but probably simpler. GNU license. https://github.com/streetsidesoftware/vscode-spell-checker

Both of these implement the behavior I stated earlier.

Altair-Bueno avatar Jul 29 '24 22:07 Altair-Bueno

Dear @florinpatrascu,

Thank you very much for your response and for answering my questions. I found your insights very interesting and appreciate your considered post above.

Perhaps Zed's performant nature is the reason for this discussion. It is addictive, like driving a Ferrari. The choice to drive a Ferrari comes with emotion and passion, and of course, a price tag. You often have to accept that the Ferrari omits many features available in consumer cars, like a Ford.

The "driving pleasure" of Zed is something I haven't experienced since the '80s with my x86 assembly language scientific editor. Since then, due to economics, a reasonable IDE, and Word, is what I have had to use. Now, later in life, I routinely apply the Marie Kondo principle of "Ask Yourself If It Sparks Joy" when deciding whether to keep something. Reflecting on my over 25 years of Word usage, I find that Word does not "spark joy" for me. I am quite motivated to place it into the garbage bin and now have a clear and achievable pathway for doing so.

Florin, I completely respect your choice not to have an LLM running in parallel with your editor. You enjoy the Ferrari driving experience, accept it without heated seats, and are able to commute and live around your home without expecting to go cross-country — awesome. It is fascinating to see how different people live and the choices they make when you travel.

People are often a product of their experiences. Part of mine involves seeing how fantastically expensive it can become when writing precise programming tools, and that in practice, it is almost always harder than initially thought or reasonable. So, congratulations to the Zed team for the amazing work they have done. My ambition and budget are more modest. I aim to do all my programming and editing work in a single performant environment, with Word and its ecosystem, along with tools such as Grammarly, becoming just a painful memory.

In this context, I find connecting Zed to LLMs as definitely the lesser of two evils. My usage requires high-quality grammar checking, connection of different tools and security standards into the programming environment, and production of publication-grade documentation.

This trade-off is not for everyone. That stated, mostly I get to drive the Ferrari, but I accept that occasionally the Alfa is driven when ferrying the kids about. But at least the Ford can be sold.

Do look after yourself, and enjoy the Olympics.

Philip

phaynes avatar Jul 30 '24 00:07 phaynes

In my opinion, an LLM based spell checking tool is an entirely different feature from a local spell checker, and arguing it is a good replacement is like saying a typewriter is a replacement for a pen - sure for some people for some tasks it's a much better tool, but for others it's simply not a replacement.

An LLM based spell checking tool currently has some very big drawbacks:

  • Requires running a local LLM (on most computers today it requires large power draw, heat and noise) or relying on (and maybe paying for) some external server farm
  • Significantly worst performance on languages that are not English
  • Sometimes makes mistakes

All of these will improve over time, and maybe at some point in the future would not make sense to have dictionary based spell checking, but currently this is not an alternative for the majority of people.

Also, that is not to say that language models cannot be used within such a local spell checker, but they have to be tiny enough to run on most configurations, optional, and still use a dictionary as a source of truth. One really good example of that is ordering of possible corrections - as a non-native English speaker I personally find myself a lot of times chucking a really badly spelled word into a search engine and it gives much better results than the builtin autocomplete in browsers. Firefox had success doing something similar with a gpt2 based model for alt text generation.

bbb651 avatar Aug 02 '24 12:08 bbb651