crfsuite
crfsuite copied to clipboard
Size of Training Data.
I'm trying to train a model with a text file that is 42G in size. I have more than enough memory on my machine but I seem to be getting a segmentation core dump while training. Any reason why this would happen?
My team and I have trained multiple models on smaller datasets on the same machine, so we are confident that crfsuit is setup correctly.
Hi akazer2 Were you able to resolve the issue? I was also trying to model a text file (10 MB) but crfsuite gives segmentation fault. Thanks, in advance
Did anyone manage to resolve this ?
The thing is that during training much more memory is quested than just fitting your dataset in the memory.
For this big datasets, I suggest to use online algorithms. I found the Vowpal Wabbit to be not only very versatile but also scaling very well. Yes, including sequence tagging as CRFSuite does. I can show how to do sequence tagging with VW.
@usptact , could you please provide an example of sequence tagging in Vowpal ? What command line and input format ?
The data format is similar to that of CRFSuite, except spaces are used to separate features. VW also introduces feature spaces. The following is a training example for sequence tagging in VW format (notice the empty line between the two examples; I am using only one feature space, called "f"):
label1 |f f1 f2 f3
label2 |f f2 f3 f4
label3 |f f4 f5 f1
label2 |f f2 f4
label3 |f f1 f3
The sequence tagging model can be trained with this command:
vw --data train.feat \
--cache \
--passes 10 \ # keep this small
--search_task sequence \ # the task is sequence tagging
--search $NUM_LABELS \ # number of possible labels
--search_rollin=policy \
--search_rollout=none \
--named_labels "$(< labels)" \ # provide a comma-separated list of string labels if integer labels are not used
-b 28 \ # number of bits for feature hashing - more is better
--l2=1e-5 \ # per-example regularization
--l1=1e-7 \
-f $MODEL \ # store the model
--readable_model $MODEL.txt # store the model in readable format