Daniel Swanson comments

Results 66 comments of


Daniel Swanson

Merge tagger binary formats into a single format

https://gist.github.com/mr-martian/a8d1562ee95a0f636effd37b35e33171 The above is a python implementation of our 2.5 methods of reading and writing binary data (3 for floats and 2 for everything else) and I could probably turn...

Merge tagger binary formats into a single format

If `modes.xml` is to be trusted, Perceptron is only used by English, and Unigram 2 is used by the following places: ``` ./apertium-nhi-nhn ./apertium-oci ./apertium-tur-tat ./apertium-tur-uzb ./apertium-nci-nhi ./apertium-fao-nor ./apertium-hin-pan ./apertium-kan-mar...

Merge tagger binary formats into a single format

I'm currently working on this in #130 My current design plan is as follows (feedback welcome): The new binary format will start with a header like the transducer one, probably...

Merge tagger binary formats into a single format

New proposal: All of the tagger models are equivalent to single-layer perceptrons with various restrictions on what the features can be. Thus I would like to amend my previous plan...

Merge tagger binary formats into a single format

Yes, old `.prob` files would continue working, but instead of their being an HMM tagger it would be because file reading code interprets HMM files as Perceptrons where the features...

Merge tagger binary formats into a single format

``` Parameters: window width W depth D beam search size B Algorithm: for word in input: for reading in word: features = [extract from an FST] + [last W selected...

Capitalization Post-processor

> And we keep lemmas dictionary-cased throughout the pipeline? Yes. > I think it should work, if the second module can have some lemma/PoS-specific rules and access to at least...

Capitalization Post-processor

Actually the reason I started on this now was because I couldn't figure out how to handle case properly within postgen and so was trying to handle it after postgen....

Capitalization Post-processor

So the proposed pipeline would be | command | use | |------|--------| | `lt-proc -b` | generator | | `cg-proc` | preferences | | `lsx-proc -p` (or something) | postgen...

Capitalization Post-processor

It occurs to me that the rules in the final step could have roughly the same syntax and semantics as LRX: ```xml ```