ast-grep icon indicating copy to clipboard operation
ast-grep copied to clipboard

Non UTF-8 inputs

Open iXialumy opened this issue 1 year ago • 7 comments

⭐ Suggestion

I would like to be able to use ast-grep on files that are not saved in UTF-8 or UTF-16 but different file encodings

💻 Use Cases

Especially in older code bases (where large scale automatable refactorings would be really useful) many files happen to not be encoded in ascii, utf-8 or utf-16, but some other encoding. At the moment the only workaround is to re-encode the files to utf-8, then run ast-grep for refactorings, and then finally to re-encode back to the original encoding so build systems etc do not break.

In theory this could be done with a cli flag where you have to specify the known file encoding. I do not know how reading files is implemented at the moment, but I think rust should provide a way to specify encoding when reading from a file, so implementation should (hopefully) be pretty straight forward.

iXialumy avatar Sep 12 '24 14:09 iXialumy

Thanks for the reporting. This is a valid use case but I'm not personally interested.

If anyone want to work on it, please leave a comment here and we can discuss the implementation. Otherwise, if GitHub sponsors back this request, it will be prioritized.

HerringtonDarkholme avatar Sep 12 '24 14:09 HerringtonDarkholme

I could work on implementing it. It seemed to me, that reading files is done in utils. We could use something like encoding_rs to get an encoding and handle the file as utf-8 internally. I would propose validating the encoding up front.

iXialumy avatar Sep 12 '24 15:09 iXialumy

You also need to design the CLI argument, default behavior, doc and a lot of other stuff. It is not an easy task though.

My suggestion is first list working items and we can discuss.

HerringtonDarkholme avatar Sep 12 '24 15:09 HerringtonDarkholme

Can you send me to an example which i could use as a reference? Otherwise I would first start with a proof of concept and implement reading with another encoding and go from there.

iXialumy avatar Sep 12 '24 15:09 iXialumy

Regarding defaults, without providing anything i would expect default behaviour not to change.

iXialumy avatar Sep 12 '24 15:09 iXialumy

https://github.com/BurntSushi/ripgrep/blob/master/GUIDE.md

Please read the ripgrep guide to design the behavior, CLI argument.

HerringtonDarkholme avatar Sep 12 '24 15:09 HerringtonDarkholme

That said, I don't expect ast-grep to handle non-utf8 legacy codebase.

File encoding is probably the first thing to handle if you want to refactor your codebase. code rewrite isn't the first priority compared to encoding.

HerringtonDarkholme avatar Sep 12 '24 16:09 HerringtonDarkholme

I will close this first. @iXialumy if you still want to work on this, let me know and I will reopen it.

HerringtonDarkholme avatar Nov 03 '24 03:11 HerringtonDarkholme