crfsuite Multi core support for training on large number of instances

I think CRFSuite can be optimized to utilize multiple cores available on all machines these days. A simple fix I thought for that was computing the scores in the for loop of encoder_objective_and_gradients_batch especially at line https://github.com/chokkan/crfsuite/blob/8c0028c5070546a5fc08d2c8175ab244c618f35f/lib/crf/src/crf1d_encode.c#L825

An additional dependency might be added if we want to use a multi processing library like openMP for implementing the feature, which can be switched on or off using a flag.

Some API changes might also be needed in order to ensure the proper aggregation of results from each of the parallel jobs.

I would love to have a feedback on this and know if anyone else is working on this patch?

Jun 06 '16 21:06 napsternxg

@kmike @chokkan @ogrisel What do you guys think about it ?

Jun 08 '16 20:06 napsternxg

I just submitted a pull request #68 for it but with difference loops annotated.

Jul 02 '16 20:07 tianjianjiang

@tianjianjiang All due respect to the author of CRFSuite (did really great job) but it would take a while to get your improvement merged in. Perhaps the best bet for you would be to fork the project and work there. Thanks for your contribution.

Jul 04 '16 17:07 usptact

@usptact , I think he already did. https://github.com/tianjianjiang/crfsuite-openmp

Jul 04 '16 19:07 bratao

@usptact Not a problem at all. @bratao Thanks for the clarification.

In fact, it's rather a good idea to wait for a while. I've noticed that in different OS with different compilers and on certain data set, the calculation can be inefficient or even hanging (0% CPU time).

Jul 06 '16 08:07 tianjianjiang

The pull request #68 has just been updated to improve the performance. It seems finally faster than original version now.

Jul 14 '16 06:07 tianjianjiang

@tianjianjiang thanks for the work. Can you add some test scripts for benchmarking the performance. An ipython notebook would be a very good option.

Jul 15 '16 03:07 napsternxg

Hii, I am new to the field of multi processing and I just want to know how to run CRFsuite using the library openMP as without it, it's extremely slow for big data sets? Thank you in advance

Feb 18 '17 12:02 CSabty

@CSabty If you need speed for learning from very large datasets, please take a look at Wapiti or use Vowpal Wabbit in learning to search mode. I use the latter when I need to train a NER model very quickly.

Feb 18 '17 18:02 usptact

@usptact could you please share what command line you used for ner with Vowpal? I was never able to come with a working command line for taggging.

Feb 18 '17 19:02 bratao

@bratao Sure, here you go:

vw  --data train.feat \
    --learning_rate 0.5 \
    --cache --kill_cache \
    --threads \
    --passes 10 \
    --search_task sequence \
    --search $NUM_LABELS \
    --search_rollin=policy \
    --search_rollout=none \
    --named_labels "$(< labels)" \
    -b 28 \
    --l1=1e-7 \
    -f $MODEL \
    --readable_model $MODEL.txt

You will need the training file ("train.feat") in multi-line format (see doc) and a file "labels" with string labels that are BIO tags (in my case). If there are only few, you can list the tags as comma-separated list in console.

Feb 18 '17 20:02 usptact

@usptact Thank you so much for your reply, I am working on NER training as well. Do you think Wapiti or Vowpal Wabbit are better in performance (speed wise) than CRF++ ? As I was planning to use CRF++ using multi-core because I feel it has more recourses online and maybe simpler compared to the other ones.

Feb 20 '17 10:02 CSabty

@CSabty In my experience, performance-wise, the CRF is still the best although I did not do thorough comparison.

Feb 20 '17 17:02 usptact

@usptact

You will need the training file ("train.feat") in multi-line format (see doc) and a file "labels" with string labels that are BIO tags (in my case). If there are only few, you can list the tags as comma-separated list in console.

In POS task, can i use the same feature with crfsuite when training by Vowpal Wabbit tool? And features can follows with a " : " and then a float scaling value in crfstuite train dateset, but it seems like the ':' is used to set the feature value rather than feature importance in Vowpal Wabbit.

it's too painful to use Vowpal Wabbit, do you have write some sequence search related blog? thanks ~~

Jun 29 '17 08:06 yiqingyang2012

Both in CRFSuite and VW, the ":" character is special. In former you can escape it like this "\:" but in latter you can't. Assuming you don't want to change default weight of 1.0.

Jun 29 '17 17:06 usptact

I wonder if this development of multicore CRF has been dead or not. I am dying for such feature.

Sep 21 '17 06:09 jbkoh

@jbkoh If you are looking for multi CPU training of CRFs, take a look at https://github.com/zhongkaifu/CRFSharp

Sep 22 '17 05:09 usptact

In my experiences, CRFsuite and libLBFGS are not OpenMP friendly. Of course there are other ways to have multi-core support, but for OpenMP, it might even require fundamental changes, which is probably an unacceptable cost, in CRFsuite.

Sep 22 '17 10:09 tianjianjiang

@usptact @tianjianjiang Thanks for the information! I wish I could have exploited the cores with PyCRFSuite, but I can switch to the pointer. Thank you all.

Sep 22 '17 19:09 jbkoh

crfsuite crfsuite copied to clipboard

Multi core support for training on large number of instances

crfsuite
crfsuite copied to clipboard