crystal-libraries-needed String tokenizer

a string tokenizer like PragmaticTokenizer (Ruby)

May 15 '17 17:05 johnjansen

I am going to try giving this a stab real quick if no one has tackled this. I gave the PragmaticTokenizer source a quick look over and the only unmet dependency I saw for porting this almost to Crystal as it is now is the CGI call for unescaping HTML text which Crystal has an HTML module for instead of CGI.

Oct 16 '17 16:10 GrgDev

@GrgDev its not really the CGI dep that is the issue, there are a couple of hurdles (from my sketchy memory)

the regex's (and there are alot / and complex) wont work as is.
the organization of the ruby code cannot be directly duplicated, so its a re-engineer in that respect

otherwise ill be interested in what you find, and may (time permitting) be able to offer some assistance

Oct 16 '17 17:10 johnjansen

Yeah, I see that the project setup and file organization would need to change a bit. I'll dig into the regex differences and see what I find.

Oct 16 '17 17:10 GrgDev

its the meta programming im more worried about ... its a bit tricky to untangle (unless you have a clear head and are locked in a room in silence)

Oct 16 '17 17:10 johnjansen

Can you add a link to the tokenizer you're proposing to 'duplicate' ? (So every people coming here won't have to search it to see what you mean)

Oct 16 '17 19:10 bew

https://github.com/diasks2/pragmatic_tokenizer

Oct 16 '17 20:10 johnjansen

I'm busy at work right now, but I went ahead and stubbed out a quick empty repo here. Please excuse the corny name.

https://github.com/GrgDev/crystalized_tokenizer

If I get around to this, the work will be there.

Oct 16 '17 20:10 GrgDev

Not done yet. Putting a comment here in case I drop this for some reason so someone else can learn what I found already.

I ran into the metaprogramming issues, but they don't seem to be too bad. Only two found so far is:

It does an inline extends string to add new custom methods. I just converted them to non-destructive methods that you pass the string to instead.
It does check for if certain methods are #defined? in a language module at runtime which we replace with the #responds_to? macro. So far only found this with the check for SingleQuotes but that's a class/struct, not a method, so might be worth the hack to just throw in a constant Bool into the language modules for if it has it.

The regex so far has been a non-issue in terms of difficulty. Just annoyance. You just replace the \us with \x. Also I went through and replaced the inline non-ascii characters and converted them to their proper unicode escape character form.

Oct 23 '17 16:10 GrgDev

My project Cadmium has a number of tokenizers built in. None of them are quite as advanced as PragmaticTokenizer, but they should be sufficient for most needs.

May 03 '19 08:05 watzon

I'm happy to announce that I just finished adding a (pretty much api complete) replica of the pragmatic tokenizer gem to cadmium. You can check the docs here. This was a lot of work, but hopefully it does everything that's needed and more.

May 27 '19 10:05 watzon

beautiful!

May 27 '19 21:05 johnjansen

My understanding is that Cadmium is an NLP engine. Since people may want this pragmatic tokenizer functionality without necessarily wanting the NLP as well, wouldn't it be a good idea to extract that into a separate shard?

Jun 03 '19 04:06 HCLarsen

@HCLarsen I believe that Crystal ignores unused code when compiling, so importing the whole library shouldn't hurt you.

Jun 03 '19 04:06 watzon

You could argue that but tokenization is most commonly used while doing NLP so it’s kind of 50/50 in my opinion On 3 Jun 2019, 4:47 PM +1200, Chris Larsen [email protected], wrote:

My understanding is that Cadmium is an NLP engine. Since people may want this pragmatic tokenizer functionality without necessarily wanting the NLP as well, wouldn't it be a good idea to extract that into a separate shard? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Jun 03 '19 04:06 johnjansen

@watzon I do believe that's true. However, my reasoning isn't about the code size of the executable. It's more related to things like users being able to find a library that does this, or whether an update concerns a project that uses it as a dependency. Look at it as a matter of the Single Responsibility Principle, but applied to libraries. Software (and the shards.yml file) are much more concise and easy to understand if the dependencies are also concise.

Jun 03 '19 05:06 HCLarsen

@HCLarsen I can see that. I may split the library apart into several different shard as is a common practice with a lot of bigger JS libraries, but for now I'm going to work on completing Cadmium as a whole.

Jun 03 '19 05:06 watzon

I guess this can be closed now since @watzon extracted the code back then.

PS: Just for completeness — @chenkovsky also created a tokenizer back then but I'm not quire sure how mature it is. You can find it here.

Jan 03 '24 16:01 alexanderadam

Yeah I totally forgot this issue existed so I never posted an update. Everything from cadmium exists in separate libraries now though.

Jan 03 '24 18:01 watzon

crystal-libraries-needed crystal-libraries-needed copied to clipboard

String tokenizer

crystal-libraries-needed
crystal-libraries-needed copied to clipboard