HeliBoard Adding Glide Typing To Heliboard Without libjni

Adding Glide Typing To Heliboard Without libjni_latinimegoogle.so

Open the-eclectic-dyslexic opened this issue 10 months ago • 17 comments

I finally have a little bit of time to write down what I have been doing on this front, so I wanted to open an issue here to keep track of the progress I have made, just in case anyone wants to give me a hand, as my time is super limited at the moment and I blow chunks at juggling tasks. Also, it should mean my work here doesn't go to waste if I fall off the edge of the world or something, haha!

So, first off, I will put a list of things I have yet to figure out, so I can get moving on this. Unfortunately, most of these have nothing to do with the algorithm that needs to be implemented, but rather my unfamiliarity with the Heliboard code base and the JNI overall. If I already knew about these things, I might have already made a working implementation to build on.

Stuff I know I still need to do before I could bring gesture typing to Heliboard without the google library as a hack:

learn how to work with whatever form the pointer data for the gesture takes.
learn how to iterate over words in the dictionary data structure, in order to compare the input gesture against expected gestures for each word. This just is going to involve me reading a lot of code, because I can't really find any documentation on this.
learn how suggestions work in Heliboard. I don't know exactly what to return yet even if I wrote the gesture recognition code. This will also involve me reading a lot of code. Once again, I haven't found any documentation on this.
learn how to write the JNI code so that the JVM garbage collection can handle clean up where appropriate... such as when suggestions are returned? This is going to involve me following some JNI tutorials, which will probably take a while.
learn how to properly name a function inside the c++ code so that it conforms with the way it is done in Heliboard. The way it is done inside Heliboard seems to be different to tutorials about the JNI I have followed. I think this was done by someone, who worked on the original AOSP code and didn't like the JNI function naming scheme, that allows the JVM to find the function entry points in native libraries... so they have a clever way around it that involved some naughty casting. This might be standard in advanced JNI work, but I simply don't know. Regardless, the code that does this is definitely inherited from the AOSP, I looked as part of some code archeology. You will find an example below, of the clever code I don't understand yet, pulled from app/src/main/jni/com_android_inputmethod_latin_DicTraverseSession.cpp.


static const JNINativeMethod sMethods[] = {
    {
        const_cast<char *>("setDicTraverseSessionNative"),
        const_cast<char *>("(Ljava/lang/String;J)J"),
        reinterpret_cast<void *>(latinime_setDicTraverseSession)
    },
    {
        const_cast<char *>("initDicTraverseSessionNative"),
        const_cast<char *>("(JJ[II)V"),
        reinterpret_cast<void *>(latinime_initDicTraverseSession)
    },
    {
        const_cast<char *>("releaseDicTraverseSessionNative"),
        const_cast<char *>("(J)V"),
        reinterpret_cast<void *>(latinime_releaseDicTraverseSession)
    }
};

That last point is more important than it may appear at first, because doing this correctly is necessary for not breaking the compatibility with the google library's interface. Maintaining that compatibility is important, at least until the open implementation is as good or better than the google library one. I will touch on this more in a bit.

Now, onto the stuff I do know something about... I am sure some people reading this would already know some of the following, but I didn't know any of it... so it would be nice if it was documented here! I will also write proper docs for this stuff and do a PR at some point, unless someone beats me to it. It will likely take me a while.

I would put money on what is inside that google library. It is not just a gesture typing library, as anyone who understands the dynamic loading of the google library already knows. I am near completely certain it is actually a closed source fork of the AOSP keyboard JNI code, similar to how chrome is a closed source fork of chromium. I believe it was abandoned when google keyboard was superceded by gboard, and hasn't actually changed in any appreciable way since. I looked inside gboard's apk. There was no googlelatinime library in there from what I can remember. It is possible that gboard just relies on this library being in the system though. I don't know. Regardless, most of the code in the google library is likely exactly what is inside Heliboard. There are some other changes that aren't just the glide typing implementation, which someone else can probably speak to, but those aren't super important yet and I don't know what they are in practice. The main thing this means is, any change to the JNI code is totally useless, or worse, if we want to allow users to load the google latinime and assume users will do that until our gesture typing is better. Anything we do to change this interface is going to mean extra work to handle exceptions arising from differences in the two libraries, unless it is done with strict adherence to the interface offered by the google library. In my ideal world, I would write some unit tests that feed both libraries the exact same function calls, and compare the outputs to know exactly what is different. That is a pretty big ordeal though, as I would need a bunch of simulated user data, and I would need to build mock environments around those tests; I don't think it is absolutely necessary, but it would be the "right" way to go about this.

In lieu of doing it the "right" way, we should at least know how the interfaces are different. I have run a comparison of the two libraries, and output the results into compare.csv. The point of this is it will allow me to cross reference all the functions offered inside the google library with the Java/Kotlin code, and ensure that heliboard provides proper implementations of all of those JNI calls. I have not done this yet, I don't know when I will be able to get to it, but I do plan to do this leg work. I suspect all we need to do is implement a single additional function that is used to request suggestions based on an input gesture. I don't know this for sure though. Thus, why I plan to check, and report back to this thread.

compare.csv

The short version of what that csv says is, the google JNI library's interface is almost a strict superset of Heliboards JNI library interface. This lines up with what I claimed about the google latinime being a fork of the AOSP code. This means, if we don't stray from that common interface, where it matters, people outside of this project could extract the shared object file out of heliboard and use it in the place of the google library for the various community maintained packagings of GApps, such as MindTheGApps. This would give lineageOS and other android forks, that use the AOSP keyboard, a fully open keyboard by default, if they packaged the heliboard JNI instead of the google one.

The long version is, there are only 7 functions found inside the Heliboard latinime library that are not in that google latinime library. I don't think any of them are directly referenced in any Java/Kotlin code, but I haven't searched yet. They probably don't matter for the interface definition, as they are likely helper functions that have been added by contributors over the years since the code was inherited from the AOSP. Of the remaining 814 functions in the comparison, 111 are unique to the google library. Most have names that suggest they are unique to gesture typing, and give a tiny window into how they implemented gesture typing. Some of the functions that exist in only one library or the other are also likely generated from internal use of stdlib templates. Unfortunately, my analysis of the shared objects ends there. It should be possible to see what functions call what other functions by further static analysis, but I don't think that is necessary. It is also not something I have done before, and don't know much about tools that can assist in it.

For anyone that cares how I know what functions are in the google latinime, because I had no idea how to do this before, I included some scripts in jni_compare.zip. Simply cd into the extracted directory, run jni_lib_extractor.sh, pick the architecture for it to download the appropriate shared object files. Then run compare.sh and you should be able to replicate my results, assuming you have all the dependencies (python, awk, c++filt, readelf, and grep). You can also use the jni_lib_extractor.sh if you are looking to extract the google library for gesture typing in heliboard, but are a bit confused by the process of acquiring it... and don't trust someone sending you a random shared object file, but you do trust the maintainers of MindTheGApps.

jni_compare.zip

Now, comes the stuff I am most familiar with. I haven't implemented glide typing before, but I have worked on something similar. I worked on optimizing keyboard layouts for glide typing, to reduce the number of inherent errors in glide typing, based on user vocabulary. This basically involved implementing half-baked glide typing, to compare the gestures of words against one another.

This is all to say, I am somewhat familiar with literature around gesture typing, at least as far as it pertained to the work I was doing in the past on this. Odds are pretty decent that the method the google library uses for gesture recognition, as part of glide typing, is a modified version of SHARK2. One of the inventors of it, Shumin Zhai, has been associated with Google for a long time. It is likely, based on the timing and the naming of functions in the google library, that this paper is all we need to get very similar results. There is probably more recent literature in this domain, but I don't know any of it. Sorry! The bright side is though, once we have SHARK2 implemented, we can swap in other gesture recognition implementations as long as they can fit inside the same interface... or we can just iterate on what we have until it feels right! The thing about it is, while the algorithm is 20 years old, it does work pretty well, so I am not all that concerned. The more important part is how you score the results the gesture recognition returns to you, based on context. The original paper, and it appears the google library, simply use bigrams for context, but I think it is perfectly possible to do better than that once we have something to work with. I would put some money down, betting that whoever wrote the stuff in the google library at the very least talked to Shumin Zhai about how to do it. :-)

https://www.researchgate.net/publication/228875756_SHARK2A_large_Vocabulary_shorthand_writing_system_for_pen-based_computers

So, it has been a little while since I have had to implement anything related to SHARK2, but what I can tell you is there are a few important points.

All characters on the board can and should be represented as points, not as collision boxes, like they might often be in other forms of touch screen keyboard input. This allows us to create ideal versions of gestures to compare against, by simply playing connect the dots. These gestures to compare against are referred to as templates. How we actually compare the input gesture against the templates will be explained in a little bit.

Some words will have the same gesture template no matter what you do, and there isn't really a clear way of disambiguating them, aside from letting the user draw exaggerated circular paths at the double letters. A lot of users won't want to do this, because the whole point is glide typing is supposed to be faster than the alternative. Instead, we should infer based on context, by leveraging other suggestion methods. The idea is you take the suggestions given based on the gesture, and compare their likelihood of occuring based on the context of the previous few words. It is not too dissimilar to the way Heliboard already suggests the next word to you outside of gesture typing. For example of what I am talking about "feel", "fell", and "fel", are all linked to the same gesture, the one described by creating a path through the keys "fel" in order. The only way to disambiguate these is with context. Additionally, characters that aren't directly represented on the base layer of the keyboard need to be simplified, which will create yet more ambiguous templates. For instance, "it's" and "its" have the same template. So do "résumé" and "resume".

Reducing the search space is a must. You don't want to compare your input gesture against every single word in the dictionary. Ideally, all the words in the dictionary would be organized into buckets labelled by the shared first and last character of the words in that bucket. Then we would perform collision checks between the points that represent the keys and some shapes, likely circles, positioned at the first and last points in the input gesture. This allows us to narrow our search space down to a much more manageable list. It is also possible to simply filter the dictionary for words that match the pattern we are looking for, but it is slower than having them pre-sorted into buckets. An example of how this would work is illustrated below.

gesture_reduce

Here we are looking at the gesture for the word "watch" in QWERTY. We have performed a collision check at the beginning and the end. The result is we know the word we are looking for should start with one of (q,w,e,a,s). We also know it should end with one of (y,u,g,h,j,v,b,n). Then all we would have to do is search through the buckets labeled accordingly, reducing the search space down to 5x8=40 different buckets. This is opposed to looking in all 26x26=676 buckets. This is not going to be a 94% reduction in search space, as it may suggest by looking at the number of buckets, because words are not evenly distributed among the buckets. It does help a huge amount though! Sometimes you will get an outlandish reduction in the search space this way, and sometimes you only get a big reduction. The keyboard layout actually plays a role in how much this will speed up the evaluation of common gestures. The size of these collision shapes is variable. There is probably a suggestion in the paper, but nothing stops us from tweaking these. It is going to be a matter of balancing speed against the danger of pruning away the correct answer. It is very possible to make the size of these collision based on user settings... assuming the interface allows for this. It is also possible to experiment with different shapes, based on assumptions about how users usually make motor errors while glide typing. That probably isn't necessary, but it is worth thinking about. Every time you reduce the search space, without removing the correct gesture from the search space, you have a chance to improve the suggestions returned... and it goes faster!

We might not have much choice between buckets and filtering the dictionary, depending on what we have to work with for the interface. I would, however, assume there is a way to filter the dictionary pretty quickly, as the original paper did not use the bucket method, and simply filtered the dictionary, which was stored in a linked list.

In the absolute ideal scenario, if we did presort the word list into buckets, we actually would probably want to presort the simplified words, not the raw words. We would also want a map for reversing the process on all the suggestions at the end of the gesture comparisons. We don't want to return the simplified words, we want to return the actual words they correspond to. This also may result in returning more suggestions than expected by the function caller, but never less than they expected. The only way to end up with less than the caller expects is if we narrow our search space down to a word list smaller than the number of suggestions expected. With regards to my ideal scenario, I highly doubt iterating over the simplified words and then decoding after comparisons is done is the way it is done in the google library; it might be a good stretch goal for heliboard, for an optimized way of generating glide typing suggestions. The sizable downside is it would likely require breaking compatibility with the google library. I haven't seen what the real world performance difference is in practice, but not having to iterate over the entire dictionary is always good. I know doing this made my code for optimizing keyboard layouts go way way faster, but that code didn't need to run in real time, so performance measurements are a bit different. It was the sort of thing you set to run on your plugged in laptop and come back the next day. It shaved a couple hours off the run time to not iterate over the dictionary, but that may not be noticable from the perspective of response time provided to the user.

Now that we have pruned our search space, we need to compare the user input against the templates for our possible suggestions. There are two scores we need to care about. There are shape scores and location scores. The smaller these scores are, the better our match is considered to be. When looking at each potential suggestion, we score them, and keep them in a priority queue, with a max capacity of the number of suggestions we want. It is then possible to return this queue as our set of ordered suggestions. Interface permitting, we can also return the shape and location scores (or a single combined score) along with the results to the caller. This allows them to use the scores as part of how they weigh the suggestions.

Shape scores are really straight forward. The first step is we sample the user input uniformly along the path it describes. Then we take the same number of samples, uniformly, of the gesture template for the word we want to compare against. (Remember the gesture template is where we play connect the dots with the keys that make up the word we care about.) Next we want to normalize the the gesture and the template we are comparing against. We do this by scaling the template so that its longest dimension, for its bounding box, is the same as the longest dimension of the user input's bounding box. Then we translate the template by some delta so that the centre of it's bounding box, the centroid, is in the same location as the centroid of the user input. (In theory, we actually should scale both the template and the user input to some predefined size, but it is also possible to just scale the final score appropriately so that it is properly normalized instead of moving everything everywhere.) Then we use the pythagorean theorem to get the euclidean distance between each pair of comparable points. Once we have all those distances, we take an average of the distances measured, and tada you have a shape score! An example of comparing some fake user input against the gesture for the word "watch" is below.

gesture_compare

The red path is the template, and the blue path would be the user input. The black line segments are representations of the euclidean distance measurements between each pair of sampled points. I have ignored the normalization step in the diagram, because I am not sure how to draw it. If this becomes something people are confused by, I am happy to give a shot at drawing it though. All you really need to know about normalizing is, we need the gestures in the same place when comparing them, and they should be the same size. Our previous pruning step actually comes in to save us here, because if we weren't pruning the search space, it is possible we could have an identical template somewhere else completely on the keyboard, and once we normalized to our template, they would look the same even though we didn't gesture anywhere near that location!

The number of samples we take is determined by how fast versus how accurate we want the gesture comparisons. 100 samples is a good place to start, but this too can be a user setting, if the interface allows. On some proprietary keyboards this was/is done with a slider with one end being "accurate" and the other being "fast". I remember someone suggested to me there is also the possibility of using some fancy continuous math to avoid sampling the path and get a more accurate measure. However, doing the pythagorean theorem 100 times is way more convenient, and I don't remember the specifics of the continuous approach, only the basic idea. That method might also be prone to the same problems the authors of the paper found with elastic matching. It is better to just do the dumber thing for now; it works.

With location scores, we do something a little different. The idea here is, if the user does an excellent job of giving precise input, we should boost the closest match. I never needed to use these in my past work, because if I remember correctly, they were too expensive to consider when trying to score a keyboard layout. So... I am a bit hazy on these. I also suck at reading mathematical notation, so I am winging it a bit here as I read the paper while writing this.

It looks like we sample the user input uniformly again, but this time we are trying to ensure that no points of the user input fall outside of an area defined by some distance to the template path. In the paper they used one key width as this distance cut off. We don't actually want to normalize the input to compare favourably to the template this time, because the whole point is this is to reward precise inputs. The analogy here is sort of like if you were driving a car for a test, and you stayed on the road the whole time, you don't receive any penalty to your test score. However, if you ever go off the road, you get immediately penalized. I think the trick here is to take every pair of key centers that make up the line segments in the unsampled template, and then check if the user input samples fall inside the a certain distance of any of the line segments. If the current point doesn't fall inside the area we want, tally on the distance between that sample point and the corresponding point in the template samples. It looks like they do it a little differently, but I think my way would be faster? This looks like it would be O(N^2)... for every point in the sampled user input they are performing euclidean distance measurements against every sample point in the template. It is really expensive by comparison to shape score. We also need to keep in mind, we are comparing this input against multiple templates. I will probably have to think about this some more to make sure I understand well enough to reduce the complexity the way I am thinking. It should be possible to perform no more than 3-9 measurements per user input sample, instead of 100. That is if we even need location score; it may not be necessary to have location scores at all to start, based on what I am reading in the following sections. I don't know how to draw this yet, and I am tired, so I won't for now. Request if you want this and I will draw an example.

It looks like we have two presented options here. We can either try to force the two different scores into a single final score by adjusting the location score, such that it is more directly comparable to the shape score, or we can weight dynamically between them. The dynamic weighting looks at whether the gesture is being performed above or below some threshold speed. Now, given that I didn't use location scores I didn't look at dynamically weighting methods at all. The math involved in this is the stuff I find the most difficult to understand; I may get this entirely wrong. It looks like they have a special function, based on a long history of lab data from previous studies, that assumes how long it should take for the average user to be able to glide type a specific word's gesture. How fast the gesture is drawn determines how much to weight each input model. Forcing the two scores into a single score statically is possible, but might not give great results without a lot of tweaking, and may never give as good results as dynamic weighting. It looks like they actually trained some baysian model to find good weights for their location score weights... for every word, per keyboard layout. That is certainly a way to do it, but I don't think we will be doing that. They had the benefit of working with one keyboard layout and only having to support English.

So, with all of that in mind. Let's get into related reading!

There are some github projects that implement gesture typing. A few of them are explicitly billed as being through SHARK2. They look like they might be university assignments? It spins up a web server, and allows you to glide type in a browser using a mouse. Unfortunately the webpage it provides isn't very mobile friendly, so you can't really test if it works well with thumb input. Some of them are god awful, but I did find one that was good enough that I was sold, and probably use it to figure out how to weight the shape vs location scores, when we get to that point.

https://github.com/varun93/shark2 https://github.com/aferruzza/SHARK2-Gesture-Typing <- the good one, if I remember correctly https://github.com/Trisha11r/Decode_Gesture_SHARK2_algo

None of these have licenses attached to them, though, so copying directly is a no-no.

Helium also directed me at an implementation of gesture typing inside an OpenBoard fork, by a user named wordmage. That is also not licensed, because it is based on some 7 year old code by a user named shijieKika which is of course... not licensed! I have suspicions about what the older repo might be, but I won't go on about them because I am not certain my thoughts are founded one way or the other. In a total shot in the dark, I opened an issue asking the author to to add a license. I expect nothing to come of that, so don't count on being able to copy any code from that repo. It is probably okay to get ideas about how to use the data structures in Heliboard, while doing gesture matching, from these two projects though. We can also reference at least one of the python projects to get a different perspective on the same thing that is going on inside the paper.

https://github.com/wordmage/openboard https://github.com/shijieKika/GoogleKeyboardV7

I will state, I think even if we did directly copy the code from the wordmage repo or the shijieKika repo, which we definitely should not unless it magically gets licensed, it may not even do what we want. The wordmage repo has worse gesture typing than the google library does, and it doesn't seem to follow the correct interface based on a silly test I tried where I pulled out its internal library and tried to import it into heliboard in place of the google library. (edit: Helium confirmed it does follow the same interface, as long as a commit that messed up some of the names is reverted https://github.com/wordmage/openboard/commit/70e59e88ec3d4ac5a50190ceb199765a5ad0078c). I also haven't been able to get the shijieKika repo to compile at all yet. However, to be fair, I haven't tried that hard.

The only option I know of that contains gesture typing that is licensed in a way we can use directly is florisboard. It was added to an older version with commit https://github.com/florisboard/florisboard/commit/c0f90a7ea4cc59f671e174bd08226e3c7872d156, I think? I haven't looked it over yet, add it to the pile of todos.

Okay, that is all I have for now. Info dump done. I will check back when I have time to continue my quest. Let me know if anyone wants clarification on anything I wrote, or wants to help me on my fool's errand. :-)

Apr 10 '24 07:04 the-eclectic-dyslexic

HeliBoard HeliBoard copied to clipboard

Adding Glide Typing To Heliboard Without libjni_latinimegoogle.so

HeliBoard
HeliBoard copied to clipboard