crfsuite icon indicating copy to clipboard operation
crfsuite copied to clipboard

Multi core support for training on large number of instances

Open napsternxg opened this issue 8 years ago • 19 comments

I think CRFSuite can be optimized to utilize multiple cores available on all machines these days. A simple fix I thought for that was computing the scores in the for loop of encoder_objective_and_gradients_batch especially at line https://github.com/chokkan/crfsuite/blob/8c0028c5070546a5fc08d2c8175ab244c618f35f/lib/crf/src/crf1d_encode.c#L825

An additional dependency might be added if we want to use a multi processing library like openMP for implementing the feature, which can be switched on or off using a flag.

Some API changes might also be needed in order to ensure the proper aggregation of results from each of the parallel jobs.

I would love to have a feedback on this and know if anyone else is working on this patch?

napsternxg avatar Jun 06 '16 21:06 napsternxg

@kmike @chokkan @ogrisel What do you guys think about it ?

napsternxg avatar Jun 08 '16 20:06 napsternxg

I just submitted a pull request #68 for it but with difference loops annotated.

tianjianjiang avatar Jul 02 '16 20:07 tianjianjiang

@tianjianjiang All due respect to the author of CRFSuite (did really great job) but it would take a while to get your improvement merged in. Perhaps the best bet for you would be to fork the project and work there. Thanks for your contribution.

usptact avatar Jul 04 '16 17:07 usptact

@usptact , I think he already did. https://github.com/tianjianjiang/crfsuite-openmp

bratao avatar Jul 04 '16 19:07 bratao

@usptact Not a problem at all. @bratao Thanks for the clarification.

In fact, it's rather a good idea to wait for a while. I've noticed that in different OS with different compilers and on certain data set, the calculation can be inefficient or even hanging (0% CPU time).

tianjianjiang avatar Jul 06 '16 08:07 tianjianjiang

The pull request #68 has just been updated to improve the performance. It seems finally faster than original version now.

tianjianjiang avatar Jul 14 '16 06:07 tianjianjiang

@tianjianjiang thanks for the work. Can you add some test scripts for benchmarking the performance. An ipython notebook would be a very good option.

napsternxg avatar Jul 15 '16 03:07 napsternxg

Hii, I am new to the field of multi processing and I just want to know how to run CRFsuite using the library openMP as without it, it's extremely slow for big data sets? Thank you in advance

CSabty avatar Feb 18 '17 12:02 CSabty

@CSabty If you need speed for learning from very large datasets, please take a look at Wapiti or use Vowpal Wabbit in learning to search mode. I use the latter when I need to train a NER model very quickly.

usptact avatar Feb 18 '17 18:02 usptact

@usptact could you please share what command line you used for ner with Vowpal? I was never able to come with a working command line for taggging.

bratao avatar Feb 18 '17 19:02 bratao

@bratao Sure, here you go:

vw  --data train.feat \
    --learning_rate 0.5 \
    --cache --kill_cache \
    --threads \
    --passes 10 \
    --search_task sequence \
    --search $NUM_LABELS \
    --search_rollin=policy \
    --search_rollout=none \
    --named_labels "$(< labels)" \
    -b 28 \
    --l1=1e-7 \
    -f $MODEL \
    --readable_model $MODEL.txt

You will need the training file ("train.feat") in multi-line format (see doc) and a file "labels" with string labels that are BIO tags (in my case). If there are only few, you can list the tags as comma-separated list in console.

usptact avatar Feb 18 '17 20:02 usptact

@usptact Thank you so much for your reply, I am working on NER training as well. Do you think Wapiti or Vowpal Wabbit are better in performance (speed wise) than CRF++ ? As I was planning to use CRF++ using multi-core because I feel it has more recourses online and maybe simpler compared to the other ones.

CSabty avatar Feb 20 '17 10:02 CSabty

@CSabty In my experience, performance-wise, the CRF is still the best although I did not do thorough comparison.

usptact avatar Feb 20 '17 17:02 usptact

@usptact

You will need the training file ("train.feat") in multi-line format (see doc) and a file "labels" with string labels that are BIO tags (in my case). If there are only few, you can list the tags as comma-separated list in console.

In POS task, can i use the same feature with crfsuite when training by Vowpal Wabbit tool? And features can follows with a " : " and then a float scaling value in crfstuite train dateset, but it seems like the ':' is used to set the feature value rather than feature importance in Vowpal Wabbit.

it's too painful to use Vowpal Wabbit, do you have write some sequence search related blog? thanks ~~

yiqingyang2012 avatar Jun 29 '17 08:06 yiqingyang2012

Both in CRFSuite and VW, the ":" character is special. In former you can escape it like this "\:" but in latter you can't. Assuming you don't want to change default weight of 1.0.

usptact avatar Jun 29 '17 17:06 usptact

I wonder if this development of multicore CRF has been dead or not. I am dying for such feature.

jbkoh avatar Sep 21 '17 06:09 jbkoh

@jbkoh If you are looking for multi CPU training of CRFs, take a look at https://github.com/zhongkaifu/CRFSharp

usptact avatar Sep 22 '17 05:09 usptact

In my experiences, CRFsuite and libLBFGS are not OpenMP friendly. Of course there are other ways to have multi-core support, but for OpenMP, it might even require fundamental changes, which is probably an unacceptable cost, in CRFsuite.

tianjianjiang avatar Sep 22 '17 10:09 tianjianjiang

@usptact @tianjianjiang Thanks for the information! I wish I could have exploited the cores with PyCRFSuite, but I can switch to the pointer. Thank you all.

jbkoh avatar Sep 22 '17 19:09 jbkoh