crfsuite Size of Training Data.

I'm trying to train a model with a text file that is 42G in size. I have more than enough memory on my machine but I seem to be getting a segmentation core dump while training. Any reason why this would happen?

My team and I have trained multiple models on smaller datasets on the same machine, so we are confident that crfsuit is setup correctly.

May 02 '14 19:05 akazer2

Hi akazer2 Were you able to resolve the issue? I was also trying to model a text file (10 MB) but crfsuite gives segmentation fault. Thanks, in advance

Mar 10 '15 16:03 jndevanshu

Did anyone manage to resolve this ?

Aug 02 '16 22:08 viveksck

The thing is that during training much more memory is quested than just fitting your dataset in the memory.

For this big datasets, I suggest to use online algorithms. I found the Vowpal Wabbit to be not only very versatile but also scaling very well. Yes, including sequence tagging as CRFSuite does. I can show how to do sequence tagging with VW.

Aug 02 '16 23:08 usptact

@usptact , could you please provide an example of sequence tagging in Vowpal ? What command line and input format ?

Aug 03 '16 00:08 bratao

The data format is similar to that of CRFSuite, except spaces are used to separate features. VW also introduces feature spaces. The following is a training example for sequence tagging in VW format (notice the empty line between the two examples; I am using only one feature space, called "f"):

label1 |f f1 f2 f3
label2 |f f2 f3 f4
label3 |f f4 f5 f1

label2 |f f2 f4
label3 |f f1 f3

The sequence tagging model can be trained with this command:

vw  --data train.feat \
    --cache \
    --passes 10 \                                   # keep this small
    --search_task sequence \              # the task is sequence tagging
    --search $NUM_LABELS \             # number of possible labels
    --search_rollin=policy \
    --search_rollout=none \
    --named_labels "$(< labels)" \      # provide a comma-separated list of string labels if integer labels are not used
    -b 28 \                                             # number of bits for feature hashing - more is better
    --l2=1e-5 \                                      # per-example regularization
    --l1=1e-7 \
    -f $MODEL \                                   # store the model
    --readable_model $MODEL.txt    # store the model in readable format

Aug 03 '16 03:08 usptact

crfsuite crfsuite copied to clipboard

Size of Training Data.

crfsuite
crfsuite copied to clipboard