nlp-discussion
nlp-discussion copied to clipboard
Existing Work: Readers/Writers/Datatypes
for e.g. CONLL format(s)
For parsers of NLP related formats, there are e.g.,
- CoNLL-X reader / writer by @danieldk: https://github.com/danieldk/conllx-rs
- CoNLL-U, CoNLL-X, CoreNLP CoNLL parsers: https://docs.rs/nlp-io/ . Unfortunately, the underlying repo no longer exists and the releases were yanked from cargo.io. We could try to contact the authors once this repo is in a bit better shape.
- Another format I would be interested is some way to represent sparse document-term matrices. It's maybe more related to serializing sparse data in general. For instance,
- svmlight / libsvm sparse data file format, similar to https://github.com/mblondel/svmlight-loader
- or more generally some way of serializing sparse CSR /CSC matrices to some standard (language agnostic) format
I just released the first proper version of a crate to read and process constituency trees at https://github.com/sebpuetz/lumberjack.
The crate is still rather unpolished and I'm unsure about what the public API should be, but it supports reading the NEGRA export format, various flavours of bracketed trees, conversion from and to @danieldk's conllx format with and without encoded constituency structure. Further, a bunch of operations on the trees are possible like filtering specific non-terminals.
There is another inactive Rust crate for reading bracketed constituency trees at https://github.com/sjmielke/ptb-reader-rust.