Non UTF-8 inputs
⭐ Suggestion
I would like to be able to use ast-grep on files that are not saved in UTF-8 or UTF-16 but different file encodings
💻 Use Cases
Especially in older code bases (where large scale automatable refactorings would be really useful) many files happen to not be encoded in ascii, utf-8 or utf-16, but some other encoding. At the moment the only workaround is to re-encode the files to utf-8, then run ast-grep for refactorings, and then finally to re-encode back to the original encoding so build systems etc do not break.
In theory this could be done with a cli flag where you have to specify the known file encoding. I do not know how reading files is implemented at the moment, but I think rust should provide a way to specify encoding when reading from a file, so implementation should (hopefully) be pretty straight forward.
Thanks for the reporting. This is a valid use case but I'm not personally interested.
If anyone want to work on it, please leave a comment here and we can discuss the implementation. Otherwise, if GitHub sponsors back this request, it will be prioritized.
I could work on implementing it. It seemed to me, that reading files is done in utils. We could use something like encoding_rs to get an encoding and handle the file as utf-8 internally. I would propose validating the encoding up front.
You also need to design the CLI argument, default behavior, doc and a lot of other stuff. It is not an easy task though.
My suggestion is first list working items and we can discuss.
Can you send me to an example which i could use as a reference? Otherwise I would first start with a proof of concept and implement reading with another encoding and go from there.
Regarding defaults, without providing anything i would expect default behaviour not to change.
https://github.com/BurntSushi/ripgrep/blob/master/GUIDE.md
Please read the ripgrep guide to design the behavior, CLI argument.
That said, I don't expect ast-grep to handle non-utf8 legacy codebase.
File encoding is probably the first thing to handle if you want to refactor your codebase. code rewrite isn't the first priority compared to encoding.
I will close this first. @iXialumy if you still want to work on this, let me know and I will reopen it.