abbreviation-resolver icon indicating copy to clipboard operation
abbreviation-resolver copied to clipboard

Python library, which task is to identify and disambiguate acronyms and abbreviation in text.

Abbreviation Resolver

Abbreviation resolver is a Python library, which task is to identify and disambiguate acronyms and abbreviation in text. For example, given a sentence "Web site underwent a severe DOS attack.", the program should suggest the right interpretation of "DOS" among the set of candidates "Denial-of-service", "Disk operating system" and "Data over signalling".

Abbreviation resolver supports Python versions 2.7 and 3.4

Installation

    $ git clone https://github.com/estnltk/abbreviation-resolver
    $ cd abbreviation-resolver
    $ python setup.py install

Development installation with zc.buildout

    $ git clone https://github.com/estnltk/abbreviation-resolver
    $ cd abbreviation-resolver
    $ python bootstrap.py
    $ ./bin/buildout

Usage

To run abbreviation resolver, first it's necessary to create a configuration file which specifies file locations of the abbreviation and word2vec models, e.g.

[MODEL]
ABBREVIATION_MODEL=/opt/home/sass/projects/lyhendid/tasks/model/results/model.csv
WORD2VEC_MODEL=/opt/home/sass/projects/lyhendid/tasks/etl/results/word2vec/all.snts.word.wvm

and export an environment variable CONFIG pointing to the configuration file

$ export CONFIG=<configuration file path>

Now abbreviation resolver is ready for use:

>> from abresolver import Text
>> t = Text(u'kolmas p palavik')
>> t.tokenize_abs()
[{
  'text': 'p',
  'start': 7,
  'end': 8,
  'expansions': ['päev',
                 'parem',
                 'parietaalne',
                 'pupill',
                 'pool'],
  'scores': [0.99974249284129602,
             0.00013896032431022265,
             0.00010385371199489893,
             9.3145225880433136e-06,
             5.3785998108879645e-06],
  }]

>> t = Text(u'püsib p pahhüpleuraalne ladestus')
>> t.tokenize_abs()
[{'text': 'p',
  'start': 6,
  'end': 7,
  'expansions': ['parietaalne',
                 'päev',
                 'parem',
                 'pupill',
                 'pool'],
  'scores': [0.83779262694858747,
             0.072167145074973585,
             0.06692486376766027,
             0.023099431317849875,
             1.5932890928882162e-05],
  }]

A call to tokenize_abs() creates a new layer 'abr' in a Text object, which contains analysis information for each abbreviation or acronym identified in text. Analysis entry includes abbreviation text itself, its start and end position in the document, a list of candidate full forms with the corresponding scores. The candidate terms are sorted by score, such that the most likely candidate with a higher score comes first.

These attributes can be accessed individually using the corresponding properties:

>> t = Text(u'püsib p pahhüpleuraalne ladestus. kolmas p palavik')
>> t.abr_texts
['p', 'p']
>> t.abr_spans
[(6, 7), (41, 42)]
>> t.abr_expansions
[['parietaalne', 'päev', 'parem', 'pupill', 'pool'],
 ['päev', 'parem', 'parietaalne', 'pupill', 'pool']]
>> t.abr_scores
[[0.6196715074809509,
  0.36973995956261818,
  0.0097006165941946505,
  0.00087920614701522952,
  8.7102152210321713e-06],
 [0.99974249284129602,
  0.00013896032431022265,
  0.00010385371199489893,
  9.3145225880433136e-06,
  5.3785998108879645e-06]]

Data

Abbreviation resolver requires two datafiles - abbreviation model and word2vec model - which are not included into the package due to data protection issues.

Abbreviation Model

Abbreviation model provides probabilities P(term|abbreviation) which were estimated based on a training corpus. The model is stored in a .csv file with columns term, abbreviation, and P(term|abbreviation), e.g.

t a P(t|a)
temperatuur t 0.383632
tund t 0.242967
tänav t 0.005115
tumor t 0.056266
diameeter d 0.669767
diagnoos d 0.304651
distants d 0.016279
distants d 0.016279
disc d 0.009302

Word2vec Model

Word2vec model enable to estimate how well individual words, such as abbreviation full forms, fit the sentence context. Word2vec models can be trained using gensim or word2vec software. To load the model, abbreviation resolver uses gensim API:

gensim.models.Word2Vec.load(model_file_name)

Pre-trained general purpose word2vec models for Estonian can be obtained from https://github.com/estnltk/word2vec-models.