langid.c icon indicating copy to clipboard operation
langid.c copied to clipboard

Pure C natural language identifier with support for 97 languages

================ langid.c readme

Introduction

langid.c is an experimental implementation of the language identifier described by [1] in pure C. It is largely based on the design of langid.py[2], and uses langid.py to train models.

Planned features

See TODO

Speed

Initial comparisons against Google's cld2[3] suggest that langid.c is about twice as fast.

(langid.c) @mlui langid.c git:[master] wc -l wikifiles 
28600 wikifiles
(langid.c) @mlui langid.c git:[master] time cat wikifiles | ./compact_lang_det_batch > xxx
cat wikifiles  0.00s user 0.00s system 0% cpu 7.989 total
./compact_lang_det_batch > xxx  7.77s user 0.60s system 98% cpu 8.479 total
(langid.c) @mlui langid.c git:[master] time cat wikifiles | ./langidOs -b > xxx           
cat wikifiles  0.00s user 0.00s system 0% cpu 3.577 total
./langidOs -b > xxx  3.44s user 0.24s system 97% cpu 3.759 total

(langid.c) @mlui langid.c git:[master] wc -l rcv2files 
20000 rcv2files
(langid.c) @mlui langid.c git:[master] time cat rcv2files | ./langidO2 -b > xxx     
cat rcv2files  0.00s user 0.00s system 0% cpu 31.702 total
./langidO2 -b > xxx  8.23s user 0.54s system 22% cpu 38.644 total
(langid.c) @mlui langid.c git:[master] time cat rcv2files | ./compact_lang_det_batch > xxx 
cat rcv2files  0.00s user 0.00s system 0% cpu 18.343 total
./compact_lang_det_batch > xxx  18.14s user 0.53s system 97% cpu 19.155 total

Model Training

Google's protocol buffers [4] are used to transfer models between languages. The Python program ldpy2ldc.py can convert a model produced by langid.py [2] into the protocol-buffer format, and also the C source format used to compile an in-built model directly into executable.

Dependencies

Protocol buffers [4] protobuf-c [5]

Contact

Marco Lui [email protected]

References

[1] http://aclweb.org/anthology-new/I/I11/I11-1062.pdf [2] https://github.com/saffsd/langid.py [3] https://code.google.com/p/cld2/ [4] https://github.com/google/protobuf/ [5] https://github.com/protobuf-c/protobuf-c