parserator icon indicating copy to clipboard operation
parserator copied to clipboard

Exposing lower level model evaulation data

Open phillbaker opened this issue 7 years ago • 2 comments

Thanks for all the hard work on this! Parserator has definitely made it easy to create a model with crfsuite. As I dig into fine tuning my model, I'd like to have access to the metrics provided by crfsuite (accuracy, precision, recall).

It looks the python wrapper does provide access to this data (https://github.com/scrapinghub/python-crfsuite/issues/42#issuecomment-227805517), what do you think of a PR that exposes this as a return value of trainModel?

phillbaker avatar Feb 07 '17 04:02 phillbaker

that sounds interesting. would like to see a PR, yes.

On Mon, Feb 6, 2017 at 10:21 PM, Phillip Baker [email protected] wrote:

Thanks for all the hard work on this! Parserator has definitely made it easy to create a model with crfsuite. As I dig into fine tuning my model, I'd like to have access to the metrics provided by crfsuite (accuracy, precision, recall).

It looks the python wrapper does provide access to this data ( scrapinghub/python-crfsuite#42 (comment) https://github.com/scrapinghub/python-crfsuite/issues/42#issuecomment-227805517), what do you think of a PR that exposes this as a return value of trainModel https://github.com/datamade/parserator/blob/master/parserator/training.py#L29 ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/datamade/parserator/issues/35, or mute the thread https://github.com/notifications/unsubscribe-auth/AAgxbTPtsXtby9wcJLO3GTAo8N2OF2q2ks5rZ_FAgaJpZM4L5CbA .

-- 773.888.2718

fgregg avatar Feb 07 '17 04:02 fgregg

Just following up on this. We ended up using modified versions of the parse and tag functions:

def parse(raw_string, verbose=False):
    if not TAGGER:
        raise IOError(
            '\nMISSING MODEL FILE: %s\nYou must train the model before you can '
            'use the parse and tag methods\nTo train the model annd create the '
            'model file, run:\nparserator train [traindata] [modulename]' % MODEL_FILE)

    tokens = tokenize(raw_string)
    if not tokens:
        return []

    features = tokens2features(tokens)

    tags = TAGGER.tag(features)

    if verbose:
        probabilities = []
        for index, tag in enumerate(tags):
            probabilities.append(TAGGER.marginal(tag, index))
        return list(zip(tokens, tags, probabilities))

    return list(zip(tokens, tags))


def tag(raw_string, probability_cutoff=None):
    tagged = OrderedDict()
    if probability_cutoff:
        tagged_probability = OrderedDict()
        for token, label, probability in parse(raw_string, verbose=True):
            tagged_probability.setdefault(label, {'tokens': []})
            if tagged_probability[label].get('probability'):
                tagged_probability[label]['probability'] = tagged_probability[label]['probability'] * probability
            else:
                tagged_probability[label]['probability'] = probability

            tagged_probability[label]['tokens'].append(token)

        for label, token_probabilities in tagged_probability.items():
            if token_probabilities['probability'] > probability_cutoff:
                tagged[label] = token_probabilities['tokens']
    else:
        for token, label in parse(raw_string):
            tagged.setdefault(label, []).append(token)

    for token in tagged:
        component = ' '.join(tagged[token])
        component = component.strip(' ,;')
        tagged[token] = component

    return tagged

phillbaker avatar Nov 10 '17 16:11 phillbaker