coAST icon indicating copy to clipboard operation
coAST copied to clipboard

[WIP] tools: Add pygments import script

Open iamkroot opened this issue 6 years ago • 3 comments

Currently, we can retrieve the regex patterns from lexers for the required tokens of all languages not found in the coAST schema.

TODO:

  • [x] Identify all the required Token types, and corresponding coAST entities
  • [x] Write proper abstraction to handle regex -> keyword conversion
  • [x] (Optional) Add the filenames property to Language schema

Will close #96

iamkroot avatar Dec 18 '18 16:12 iamkroot

I feel like the number of lines is getting too big. Will probably break up the script into two or three files.

iamkroot avatar Dec 21 '18 17:12 iamkroot

Also, I'm not really satisfied with the extraction logic for the keywords. I'm currently going word by word, handling each regex metacharacter and its behaviour separately, which is obviously not very sustainable, and leaves out many edge cases. To verify that keywords have been extracted properly, we simply match each keyword with the original pattern if was extracted from. As of now, the script fails for about 100 languages, which can be improved drastically, by doing either of the following:

  • manually handle each edge case - easily leads to bloated code, which will be hard to maintain/update
  • make a nice parser/abstraction

I've been trying, rather unsuccessfully, to do the second one using regexes, but I'm not very skilled at that, so I couldn't figure out the proper logic to do so. If someone can help out, it would be greatly appreciated :smiley:

iamkroot avatar Dec 21 '18 17:12 iamkroot

I guess most of the hard part is completed now. I've hit a snag on the yaml file dumping, as the pyyaml package sorts the keys in alphabetical order before the dump. There's already a PR in place to fix this over at yaml/pyyaml#254, so we might have to wait for that to be merged, but that too will only help for Py >= 3.6 where creation order is preserved in dicts. The other alternative is to use wimglenn/oyaml, but I would prefer not to add another dependency for this.

iamkroot avatar Mar 09 '19 18:03 iamkroot