crfsuite
crfsuite copied to clipboard
Multi core support for training on large number of instances
I think CRFSuite can be optimized to utilize multiple cores available on all machines these days. A simple fix I thought for that was computing the scores in the for loop of encoder_objective_and_gradients_batch
especially at line https://github.com/chokkan/crfsuite/blob/8c0028c5070546a5fc08d2c8175ab244c618f35f/lib/crf/src/crf1d_encode.c#L825
An additional dependency might be added if we want to use a multi processing library like openMP for implementing the feature, which can be switched on or off using a flag.
Some API changes might also be needed in order to ensure the proper aggregation of results from each of the parallel jobs.
I would love to have a feedback on this and know if anyone else is working on this patch?
@kmike @chokkan @ogrisel What do you guys think about it ?
I just submitted a pull request #68 for it but with difference loops annotated.
@tianjianjiang All due respect to the author of CRFSuite (did really great job) but it would take a while to get your improvement merged in. Perhaps the best bet for you would be to fork the project and work there. Thanks for your contribution.
@usptact , I think he already did. https://github.com/tianjianjiang/crfsuite-openmp
@usptact Not a problem at all. @bratao Thanks for the clarification.
In fact, it's rather a good idea to wait for a while. I've noticed that in different OS with different compilers and on certain data set, the calculation can be inefficient or even hanging (0% CPU time).
The pull request #68 has just been updated to improve the performance. It seems finally faster than original version now.
@tianjianjiang thanks for the work. Can you add some test scripts for benchmarking the performance. An ipython notebook would be a very good option.
Hii, I am new to the field of multi processing and I just want to know how to run CRFsuite using the library openMP as without it, it's extremely slow for big data sets? Thank you in advance
@CSabty If you need speed for learning from very large datasets, please take a look at Wapiti or use Vowpal Wabbit in learning to search mode. I use the latter when I need to train a NER model very quickly.
@usptact could you please share what command line you used for ner with Vowpal? I was never able to come with a working command line for taggging.
@bratao Sure, here you go:
vw --data train.feat \
--learning_rate 0.5 \
--cache --kill_cache \
--threads \
--passes 10 \
--search_task sequence \
--search $NUM_LABELS \
--search_rollin=policy \
--search_rollout=none \
--named_labels "$(< labels)" \
-b 28 \
--l1=1e-7 \
-f $MODEL \
--readable_model $MODEL.txt
You will need the training file ("train.feat") in multi-line format (see doc) and a file "labels" with string labels that are BIO tags (in my case). If there are only few, you can list the tags as comma-separated list in console.
@usptact Thank you so much for your reply, I am working on NER training as well. Do you think Wapiti or Vowpal Wabbit are better in performance (speed wise) than CRF++ ? As I was planning to use CRF++ using multi-core because I feel it has more recourses online and maybe simpler compared to the other ones.
@CSabty In my experience, performance-wise, the CRF is still the best although I did not do thorough comparison.
@usptact
You will need the training file ("train.feat") in multi-line format (see doc) and a file "labels" with string labels that are BIO tags (in my case). If there are only few, you can list the tags as comma-separated list in console.
In POS task, can i use the same feature with crfsuite when training by Vowpal Wabbit tool? And features can follows with a " : " and then a float scaling value in crfstuite train dateset, but it seems like the ':' is used to set the feature value rather than feature importance in Vowpal Wabbit.
it's too painful to use Vowpal Wabbit, do you have write some sequence search related blog? thanks ~~
Both in CRFSuite and VW, the ":" character is special. In former you can escape it like this "\:" but in latter you can't. Assuming you don't want to change default weight of 1.0.
I wonder if this development of multicore CRF has been dead or not. I am dying for such feature.
@jbkoh If you are looking for multi CPU training of CRFs, take a look at https://github.com/zhongkaifu/CRFSharp
In my experiences, CRFsuite and libLBFGS are not OpenMP friendly. Of course there are other ways to have multi-core support, but for OpenMP, it might even require fundamental changes, which is probably an unacceptable cost, in CRFsuite.
@usptact @tianjianjiang Thanks for the information! I wish I could have exploited the cores with PyCRFSuite, but I can switch to the pointer. Thank you all.