crystal-libraries-needed
crystal-libraries-needed copied to clipboard
String tokenizer
a string tokenizer like PragmaticTokenizer (Ruby)
I am going to try giving this a stab real quick if no one has tackled this. I gave the PragmaticTokenizer source a quick look over and the only unmet dependency I saw for porting this almost to Crystal as it is now is the CGI call for unescaping HTML text which Crystal has an HTML module for instead of CGI.
@GrgDev its not really the CGI dep that is the issue, there are a couple of hurdles (from my sketchy memory)
- the regex's (and there are alot / and complex) wont work as is.
- the organization of the ruby code cannot be directly duplicated, so its a re-engineer in that respect
otherwise ill be interested in what you find, and may (time permitting) be able to offer some assistance
Yeah, I see that the project setup and file organization would need to change a bit. I'll dig into the regex differences and see what I find.
its the meta programming im more worried about ... its a bit tricky to untangle (unless you have a clear head and are locked in a room in silence)
Can you add a link to the tokenizer you're proposing to 'duplicate' ? (So every people coming here won't have to search it to see what you mean)
https://github.com/diasks2/pragmatic_tokenizer
I'm busy at work right now, but I went ahead and stubbed out a quick empty repo here. Please excuse the corny name.
https://github.com/GrgDev/crystalized_tokenizer
If I get around to this, the work will be there.
Not done yet. Putting a comment here in case I drop this for some reason so someone else can learn what I found already.
I ran into the metaprogramming issues, but they don't seem to be too bad. Only two found so far is:
- It does an inline extends string to add new custom methods. I just converted them to non-destructive methods that you pass the string to instead.
- It does check for if certain methods are
#defined?
in a language module at runtime which we replace with the#responds_to?
macro. So far only found this with the check forSingleQuotes
but that's a class/struct, not a method, so might be worth the hack to just throw in a constant Bool into the language modules for if it has it.
The regex so far has been a non-issue in terms of difficulty. Just annoyance. You just replace the \u
s with \x
. Also I went through and replaced the inline non-ascii characters and converted them to their proper unicode escape character form.
My project Cadmium has a number of tokenizers built in. None of them are quite as advanced as PragmaticTokenizer, but they should be sufficient for most needs.
I'm happy to announce that I just finished adding a (pretty much api complete) replica of the pragmatic tokenizer gem to cadmium. You can check the docs here. This was a lot of work, but hopefully it does everything that's needed and more.
beautiful!
My understanding is that Cadmium is an NLP engine. Since people may want this pragmatic tokenizer functionality without necessarily wanting the NLP as well, wouldn't it be a good idea to extract that into a separate shard?
@HCLarsen I believe that Crystal ignores unused code when compiling, so importing the whole library shouldn't hurt you.
You could argue that but tokenization is most commonly used while doing NLP so it’s kind of 50/50 in my opinion On 3 Jun 2019, 4:47 PM +1200, Chris Larsen [email protected], wrote:
My understanding is that Cadmium is an NLP engine. Since people may want this pragmatic tokenizer functionality without necessarily wanting the NLP as well, wouldn't it be a good idea to extract that into a separate shard? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
@watzon I do believe that's true. However, my reasoning isn't about the code size of the executable. It's more related to things like users being able to find a library that does this, or whether an update concerns a project that uses it as a dependency. Look at it as a matter of the Single Responsibility Principle, but applied to libraries. Software (and the shards.yml file) are much more concise and easy to understand if the dependencies are also concise.
@HCLarsen I can see that. I may split the library apart into several different shard as is a common practice with a lot of bigger JS libraries, but for now I'm going to work on completing Cadmium as a whole.
I guess this can be closed now since @watzon extracted the code back then.
PS: Just for completeness — @chenkovsky also created a tokenizer back then but I'm not quire sure how mature it is. You can find it here.
Yeah I totally forgot this issue existed so I never posted an update. Everything from cadmium exists in separate libraries now though.