lazy
lazy copied to clipboard
A lazy decoder for syntax
This code takes in a hypergraph and a language model then outputs a sentence.
It is split into a library (search/) and a standalone wrapper (alone/). The
library is also in Moses (-search-algorithm 5) and cdec (--incremental_search
$lm).
COMPILING Requires Boost >= 1.41. Tested on linux.
Compile with ./bjam
USAGE After compiling, the decoder is bin/decode. Run without an argument for help.
To run, you will need one language model, feature weights, and hypergraphs.
The language model must be in ARPA or KenLM format. Pass -l lm where lm is the file name.
Feaure weights can be specified in a file using -w or on the command line with -W. Weights are key=value pairs like cdec. The hard-coded features are LanguageModel, LanguageModel_OOV, and WordPenalty. WordPenalty is word count times -1/ln(10) for odd historical reasons dating back to Hiero. The feature definitions are compatible with Moses and cdec.
Hypergraphs are stored in a directory with one file per sentence. The files are named starting with 0. The first line of each file is
total_vertex_count total_edge_count
Then the file enumerates each vertex in bottom-up order (i.e. they can only reference vertices that have already been defined). A vertex is simply a list of competing ways to derive it (downward edges). The first line lists the number of edges. An edge looks like
foo [3] bar [7] [5] baz ||| Feature=5 AnotherFeature=10
where foo, bar, and baz are literal words and [n] references vertex n. Edges
can have arbitrary arity (i.e. as many references as desired). The tokens
and should appear explicitly; they are not added by the decoder.
A complete example:
7 13
1
||| Quux=10
2
[0] le ||| Distance=1.5
[0] la ||| Distance=1.1
2
[1] petit ||| Distance=0.0
[1] peti ||| Distance=3.0 Foo=4
3
[2] chas ||| Distance=1.1
[2] char [1] ||| Distance=0.8
[2] chat ||| Distance=1.0
2
[3] est ||| Distance=2.0
[3] Est ||| Distance=0.0
2
[4] more ||| Distance=1.0
[4] mort ||| Distance=0.0
1
[5] |||
This is the format produced by cdec's --show_target_graph option. But if you're using cdec, the code has already been natively ported and can be accessed using --incremental_search lm.
DIRECTORY LAYOUT
util and lm: copied from KenLM
search: core search algorithm and portable to other decoders.
alone: a standalone wrapper around the search implementation.