Symbol tables and compilation
Hi,
I've got three changes for consideration. I'm looking for feedback on approach, and I need to comment the code better, but I thought it would be better to get early feedback before I comment. Feel free to reject this if I am breaking style.
-
I added a SymbolTable class, which is callable so it can act as a string_mapper but also allows for two-way lookup and dynamic allocation of symbols.
-
I allow for different symbol maps in the input and output side, as in standard OpenFst. This also required changing the check for input/output matching in jupyter output.
-
I created a compiler function which replicates fstcompile. This is the biggest question: should this actually be a function in the FST class?
Feedback on approach is appreciated.
(and apologies for all of the commits - the only way I could test was within Colab so I had to push debugging out to GitHub, which wasn't pretty.)
-Eric
As a tip, you can if you install jupyter locally, then you can install using python setup.py develop and then you can edit the files without having to reinstall.
It is also possible to do this on a remote Linux machine and then use port forwarding to access jupyter
Lol, yeah, I was just being a bit lazy. My usual MO is to run a virtual machine on my mac with Linux. Trying to compile OpenFST on my mac is usually a bear; haven't tried it on my most recent mac.
So I don't understand what the compiler function is supposed to do. It looks like it is very specific to a particular use case. For example, there is a string (or list of strings) that takes as an argument. That string is getting split into different components and then matched positionally.
It looks like you are intending for the user to do something like mfst.compile(['something 1 2.3', 'another']) but I can't easily follow the control flow here.
This pull request also doesn't address the main issue I see with adding symbol tables into mfst which is how methods like compose are going to handle dealing with different symbol tables. Currently, by not handling the system tables (or letting them just be something that could be managed by the user + using string_mapper as an assistant) the method composed does not have to deal with symbol table miss-match. But suppose that I am composing two FSTs with the following symbol tables:
1: abc
1: a
2: b
3: c
To make different symbol tables "just work" there is going to need to be some way to align two different symbol tables. Construct that as another FST, and add that into the mix when doing compose.
Furthermore, there is no reason to assume that the symbol table corresponds with strings. They could instead correspond with arbitrary Python values (which is at least somewhat common in some ongoing research that we have) so then it would basically be impossible to do this sort of alignment automatically.
Trying to compile OpenFST on my mac is usually a bear; haven't tried it on my most recent mac.
mfst works on Mac OS X and will compile OpenFst for you. It is even in the CI tests https://github.com/matthewfl/openfst-wrapper/runs/1755496329?check_suite_focus=true
you should be able to install it just using the same pip install assuming you have a python environment/anaconda setup
compiler is meant to replicate the general command-line fstcompile of OpenFst, which converts an ascii representation of a FST/A (left over from the AT&T FSM days) into the internal representation. The wrapper in openfst-python provides a similar interface that I was trying to replicate.
The idea is that you might have different symbol tables on input/output, which is slightly more general than the original case.
An example of the use case for MFST is at https://colab.research.google.com/drive/1yE4hClLwHkffdET5i4OvmOu84yuQUV7x?usp=sharing
The original exercise I'm trying to replicate with MFST is at https://colab.research.google.com/drive/13ptBdIX-ZOtTyYGIZ1qA3sN-XEmWBSby?usp=sharing
That said, I just encountered one bug, which is that the symbol after fstcompose isn't set correctly, so clearly some work to do. Students can use a single shared symbol table for now, which should be ok, but having separate symbol tables allows the ideas of encode/decode to be used appropriately.
For getting something working for your class, I think that it is fine to add whatever functions or behavior you need for your class. IMO, unless you already have a bunch of FSTs which are encoded with the old ATT FSM format, then there is no reason to use the fstcompile method in the first place. Using the add_arc method is probably more pythonic anyways.
As I mentioned in the above comment, the different symbol tables run into issues with compose, so it is not just as simple as adding two different symbol table attributes to the FST class.