prolog-talk icon indicating copy to clipboard operation
prolog-talk copied to clipboard

An English Grammar Dictionary and a Natural Language Parser --- in Common Lisp and Prolog

  • PROLOG-TALK

** Description This project is written in Common Lisp and an implementation of Prolog built on top of it (https://code.google.com/p/cl-gambol/), for the 2013 summer programming contest Lisp in Summer Projects (http://lispinsummerprojects.org/).

With this project I wanted to create a better english dictionary, freely usable for AI programming in common lisp. The other objective was exploring the field of natural language parsing, building a parser for standard english sentences.

*** The English Grammar Dictionary At the beginning of the project I tried to use the module Basic English Grammar (http://www.cliki.net/Basic-English-Grammar). Although it is a nice starting point for simple experiments it lacks some indispensable relationships between words, and only contains ~10000 not unique words (i.e. the conjugations of verbs are counted, etc...).

I built a new /English Grammar dictionary/ exploiting the niceties of the prolog language to define relationships between words. As a reference I used the Webster Unabridged Dictionary from Project Gutenberg (http://www.gutenberg.org/ebooks/673). This dictionary is no longer covered by copyrights and contains about 110000 unique words (i.e. plurals, conjugations are not counted), more than ten times larger Basic English Grammar.

In addition some useful grammar's relationships:

  1. Verbs are marked as intransitive or transitive and are linked to their third person, imperfect, past participle, and -ing forms.
  2. Nouns are linked to their plurals.
  3. Adjectives are linked to their comparative and superlative forms.
  4. The dictionary includes pronouns, adverbs and prepositions.

Since the dictionary is simply a text file containing prolog facts it can be easily reformatted to be used with other programming languages.

*** The Natural Language Parser This part of the project is an experiment I used to improve my knowledge of linguistics and natural language processing.

The prolog-talk.lisp file is an english sentence parser, written using a mix of lisp and prolog. In particular:

  • facts (i.e. words and their categories) are coded in prolog to exploit it's automatic backtracking. Is thus very easy to add new words or to define new relationships between them.
  • finite state automatas are also prolog fact, but written in a special DSL using lisp macros. Sometimes prolog programs needs long lists of facts, and a macro system is perfect to generate them. This was the /main reason for the common-lisp/prolog/ coupling.

**** Usage

The gambol module is required to run the script. You can use quicklisp (http://www.quicklisp.org/) to install it, or if your common-lisp implementation imcludes asdf you can use asdf-install (deprecated). Here we assume gambol is correctly installed and the lisp implementation is SBCL.

To run the program (inside the directory): #+BEGIN_SRC $ sbcl --script ./prolog-talk.lisp #+END_SRC

After the launch, the program loads the dictionary. This can require some time since the prolog implementation we use is not optimized for big databases of facts.

When the loading process completes you are free to enter english sentences: #+BEGIN_SRC Loading dictionary: A.B.C.D.E.F.G.H.I.J.K.L.M.N.O.P.Q.R.S.T.U.V.W.X.Y.Z. input> #+END_SRC

The program is not perfect, but can handle english phrases of a certain complexity: #+BEGIN_SRC input> the monkey heard about the very next ship which is yellow and green (:SENTENCE (:NOMINAL (:ARTICLE THE) (:NOUN MONKEY)) (:VERBAL :INTRANSITIVE (:PAST-SIMPLE (:IMPERFECT HEARD)) (:PREPOSITIONAL (:PREP ABOUT) (:NOMINAL (:ARTICLE THE) (:ADJECTIVAL (:ADVERBAL (:ADVERB VERY)) (:ADJ NEXT)) (:NOUN SHIP) (:PRONOMINAL (:GAP-PRONOUN WHICH) (:SENTENCE (:GAP) (:VERBAL :INTRANSITIVE (:PRESENT-SIMPLE (:VERB-I IS)) (:ADJECTIVAL (:ADJ YELLOW) (:CONJ AND) (:ADJ GREEN))))))))) #+END_SRC

Without taking into account the senselessness of the sentence, we see that the sintactic structure is correctly recognized, even the subordinate clause which includes a gap.

The structure is better visualized using a tree. We can generate visual graphs instead of lisp-style trees giving the command: #+BEGIN_SRC dot output #+END_SRC

The same sentence generates this graph: #+BEGIN_SRC digraph G {

"SENTENCE NIL" -> "NOMINAL (1)" "NOMINAL (1)" -> "ARTICLE (1 1)" "ARTICLE (1 1)" -> "THE (1 1 1)" "THE (1 1 1)" [shape=box,style=filled,color=".7 .3 1.0"]

"NOMINAL (1)" -> "NOUN (1 2)" "NOUN (1 2)" -> "MONKEY (1 2 1)" "MONKEY (1 2 1)" [shape=box,style=filled,color=".7 .3 1.0"]

"SENTENCE NIL" -> "VERBAL (2)" "VERBAL (2)" -> "INTRANSITIVE (2 1)" "INTRANSITIVE (2 1)" [style=filled,color=yellow] "VERBAL (2)" -> "PAST-SIMPLE (2 2)" "PAST-SIMPLE (2 2)" -> "IMPERFECT (2 2 1)" "IMPERFECT (2 2 1)" -> "HEARD (2 2 1 1)" "HEARD (2 2 1 1)" [shape=box,style=filled,color=".7 .3 1.0"]

"VERBAL (2)" -> "PREPOSITIONAL (2 3)" "PREPOSITIONAL (2 3)" -> "PREP (2 3 1)" "PREP (2 3 1)" -> "ABOUT (2 3 1 1)" "ABOUT (2 3 1 1)" [shape=box,style=filled,color=".7 .3 1.0"]

"PREPOSITIONAL (2 3)" -> "NOMINAL (2 3 2)" "NOMINAL (2 3 2)" -> "ARTICLE (2 3 2 1)" "ARTICLE (2 3 2 1)" -> "THE (2 3 2 1 1)" "THE (2 3 2 1 1)" [shape=box,style=filled,color=".7 .3 1.0"]

"NOMINAL (2 3 2)" -> "ADJECTIVAL (2 3 2 2)" "ADJECTIVAL (2 3 2 2)" -> "ADVERBAL (2 3 2 2 1)" "ADVERBAL (2 3 2 2 1)" -> "ADVERB (2 3 2 2 1 1)" "ADVERB (2 3 2 2 1 1)" -> "VERY (2 3 2 2 1 1 1)" "VERY (2 3 2 2 1 1 1)" [shape=box,style=filled,color=".7 .3 1.0"]

"ADJECTIVAL (2 3 2 2)" -> "ADJ (2 3 2 2 2)" "ADJ (2 3 2 2 2)" -> "NEXT (2 3 2 2 2 1)" "NEXT (2 3 2 2 2 1)" [shape=box,style=filled,color=".7 .3 1.0"]

"NOMINAL (2 3 2)" -> "NOUN (2 3 2 3)" "NOUN (2 3 2 3)" -> "SHIP (2 3 2 3 1)" "SHIP (2 3 2 3 1)" [shape=box,style=filled,color=".7 .3 1.0"]

"NOMINAL (2 3 2)" -> "PRONOMINAL (2 3 2 4)" "PRONOMINAL (2 3 2 4)" -> "GAP-PRONOUN (2 3 2 4 1)" "GAP-PRONOUN (2 3 2 4 1)" -> "WHICH (2 3 2 4 1 1)" "WHICH (2 3 2 4 1 1)" [shape=box,style=filled,color=".7 .3 1.0"]

"PRONOMINAL (2 3 2 4)" -> "SENTENCE (2 3 2 4 2)" "SENTENCE (2 3 2 4 2)" -> "GAP (2 3 2 4 2 1)" "GAP (2 3 2 4 2 1)" [style=filled,color=red] "SENTENCE (2 3 2 4 2)" -> "VERBAL (2 3 2 4 2 2)" "VERBAL (2 3 2 4 2 2)" -> "INTRANSITIVE (2 3 2 4 2 2 1)" "INTRANSITIVE (2 3 2 4 2 2 1)" [style=filled,color=yellow] "VERBAL (2 3 2 4 2 2)" -> "PRESENT-SIMPLE (2 3 2 4 2 2 2)" "PRESENT-SIMPLE (2 3 2 4 2 2 2)" -> "VERB-I (2 3 2 4 2 2 2 1)" "VERB-I (2 3 2 4 2 2 2 1)" -> "IS (2 3 2 4 2 2 2 1 1)" "IS (2 3 2 4 2 2 2 1 1)" [shape=box,style=filled,color=".7 .3 1.0"]

"VERBAL (2 3 2 4 2 2)" -> "ADJECTIVAL (2 3 2 4 2 2 3)" "ADJECTIVAL (2 3 2 4 2 2 3)" -> "ADJ (2 3 2 4 2 2 3 1)" "ADJ (2 3 2 4 2 2 3 1)" -> "YELLOW (2 3 2 4 2 2 3 1 1)" "YELLOW (2 3 2 4 2 2 3 1 1)" [shape=box,style=filled,color=".7 .3 1.0"]

"ADJECTIVAL (2 3 2 4 2 2 3)" -> "CONJ (2 3 2 4 2 2 3 2)" "CONJ (2 3 2 4 2 2 3 2)" -> "AND (2 3 2 4 2 2 3 2 1)" "AND (2 3 2 4 2 2 3 2 1)" [shape=box,style=filled,color=".7 .3 1.0"]

"ADJECTIVAL (2 3 2 4 2 2 3)" -> "ADJ (2 3 2 4 2 2 3 3)" "ADJ (2 3 2 4 2 2 3 3)" -> "GREEN (2 3 2 4 2 2 3 3 1)" "GREEN (2 3 2 4 2 2 3 3 1)" [shape=box,style=filled,color=".7 .3 1.0"]

} #+END_SRC

We can save the output and generate the image with the command: #+BEGIN_SRC $ dot -Tpng sentence.dot -o sentence.png #+END_SRC

The image for this particular sentence is: [[file:sentence.png]]

One last option needs explanations. With the command multiple trees the program will try to parse sentences that present ambiguity in every possible way. For example: #+BEGIN_SRC input> multiple trees input> i look at the apples on the tree ((:SENTENCE (:NOMINAL (:PRONOUN I)) (:VERBAL :INTRANSITIVE (:PRESENT-SIMPLE (:VERB-I LOOK)) (:PREPOSITIONAL (:PREP AT) (:NOMINAL (:ARTICLE THE) (:NOUN APPLES) (:PREPOSITIONAL (:PREP ON) (:NOMINAL (:ARTICLE THE) (:NOUN TREE))))))) (:SENTENCE (:NOMINAL (:PRONOUN I)) (:VERBAL :INTRANSITIVE (:PRESENT-SIMPLE (:VERB-I LOOK)) (:PREPOSITIONAL (:PREP AT) (:NOMINAL (:ARTICLE THE) (:NOUN APPLES))) (:PREPOSITIONAL (:PREP ON) (:NOMINAL (:ARTICLE THE) (:NOUN TREE)))))) #+END_SRC

Here two parse trees are returned. In the first one the prepositional syntagm "on the tree" is inside the nominal syntagm "the apples"; this means that the apples I'm looking at, are on a tree. In the second case "on the tree" is inside the verbal syntagm; this means that I'm looking while I am on the tree (not the apples).

This kind of ambiguity can not be reduced without using some semantic method (and even in that case both senses are possible, so the problem probably requires also some probabilistic method...)

Note: this modality consumes a LOT of memory, I need to switch or to write a different implementation of prolog to resolve this.

** Future Developments

There are some improvements I would like to do to in future, if I will be able to reconcile study and my social life, in the next few months.

*** Dictionary Improvements

  1. I plan to add relationships for synonyms and antonyms, to create a base for a semantic development of the parser.
  2. I should also add a database of personal nouns to improve nominal syntagms recognition.

*** Parser Improvements

  1. Apply optimizations (or even rewrite some parts of the prolog interpreter) to improve the performance. Because, really, it uses too much memory.
  2. Reduce the ambiguity of verbal syntagms when prepositions and adverbs are present in sentences (this probably requires some semantics).
  3. Add semantical parsing.

*** Applications built on top of the system

  • I would like to experiment with a languange translation tool built on top of the parser. I want to program a non-literal translator, i.e. a translator that reconstrunct sentences in different languages exploiting relationships between words and not their positions.

For example: /A red cat crossed the street while i was driving./

Can be translated literally as: /Un chat rouge a traversé la rue alors que je conduisais./

But can be also expressed as: /J'étais en voiture et un chat rouge traversé la rue./ With a minimal shift of meaning.

A non-literal translator should also be able to introduce synonyms for common words to avoid repetitions.

  • Adding an /internal knowledge representation system/, it should be possible to create an interface that can respond to user input (in a natural language), matching it against some features of the input internal representation.

More simply: developing something like Siri (http://www.apple.com/ios/siri/) or Iris (https://play.google.com/store/apps/details?id=com.dexetra.iris) using a specialized regexp language working on natural language parse trees.

** Further Reading

An excellent tutorial to familiarize with NLP is: Natural Language Processing Techniques in Prolog by Patrick Blackburn and Kristina Striegnitz (http://cs.union.edu/~striegnk/courses/nlp-with-prolog/html/). I used their tutorial as a starting point for this project.

If you also want to add semantic knowledge about the language to your program, some courses about Computational Semantics are available at http://www.let.rug.nl/bos/comsem/.