nltk
nltk copied to clipboard
Grammar fixes
Hey @stefkauf! You might've noticed that the tests are failing for this. The reason for this is explained in this comment: https://github.com/nltk/nltk/pull/2822#issuecomment-928037649. I hope this helps.
Thanks, @tomaarsen, the problem was a bit mysterious to me. I've fixed the files with pre-commit and updated the pull request.
I understand! I'm still unsure how to help new contributors with it in an intuitive way.
Upon reading your commit comment:
Conversion to Chomsky Normal Form (CNF), implemented as the '.chomsky_normal_form()' method in grammar.CFG instances, was incomplete. In particular, it could not deal with empty productions or mixed productions (with terminals and non-terminals co-occuring on the right-hand side). The new implementation removes those shortcomings. Like in the earlier version, the individual steps of conversion are carried out by calling class methods.
I presume these are related to the following Errors: https://github.com/nltk/nltk/blob/68e4e589eb62750993aabef5554548973fe1e280/nltk/grammar.py#L746-L749 https://github.com/nltk/nltk/blob/68e4e589eb62750993aabef5554548973fe1e280/nltk/grammar.py#L751-L756
Would it be possible to provide some tests that cover these new cases? For example in:
-
nltk/test/grammar.doctest (which can be tested with
pytest nltk/test/grammar.doctest
) - nltk/test/grammartestsuites.doctest
- nltk/test/parse.doctest
- or if you prefer pytest, then in a new file under
nltk/test/unit/test_grammar.py
.
That way, even people like myself who know about Chomsky Normal Form, but don't know the algorithm, can get some confidence in the correctness of the implementation.
I've been working on some doctest examples. But it is tricky because the requirements for matching outputs are so stringent. For instance, the following grammar for the language a^nb^n:
S -> 'a' S 'b'
S ->
was on one run converted to this one, which I put in the doctest file:
S0 ->
S0 -> T0 B1
S0 -> T0 T1
B0 -> S T1
B1 -> S T1
S -> T0 B0
S -> T0 T1
T0 -> 'a'
T1 -> 'b'
But sometimes I get a different (but equivalent) result, for instance this one, which differs only in the order in which the new nonterminals were created:
S0 ->
S0 -> T1 B0
S0 -> T1 T0
B0 -> S T0
B1 -> S T0
S -> T1 B1
S -> T1 T0
T0 -> 'b'
T1 -> 'a'
There is some arbitrariness in the order in which new nonterminals are created because I use sets. I'm a bit reluctant to change this just to make doctest stop complaining. Would it be acceptable to use some other way to illustrate how it works? For instance, I could put a 'demo' function at the end of grammar.py, for users to invoke.
You may also want to consider unit tests for this. Stick with the ones that give you the biggest bang for the buck: simple to write and can cover more ground. I would argue pytest
may be a better choice here than doctest.
@iliakur Actually, in my latest update I've solved the problem by using dictionaries instead of sets where the order mattered. Since dictionaries preserve order of insertion, that eliminated the arbitrary variation.
Well done on adding those doctests! I've handled a merge conflict between this PR and changes introduced from #2888. I'll have a look if I can inspect the implementation itself more thoroughly soon.
Apologies, I have not yet had time. I've been quite busy.
@stefkauf thanks for this great contribution... I've reviewed the code, and just think some of your comments can go, as noted, then we can merge this.
@stevenbird Thanks, Steven, for your messages. I've removed those comments and committed the new version of grammar.py.
What is the status of this @stevenbird?
Hi Rob,
you didn't get a reply about this, did you? I can't tell you what became of it. I have the impression that NLTK is not managed all that actively anymore, but I might be wrong. In any case, I'm not sure.
Hope things are well, Stefan
Department of Linguistics, University of Connecticut http://stefan-kaufmann.uconn.edu/
From: Rob Malouf @.> Sent: Tuesday, March 5, 2024 7:12 PM To: nltk/nltk @.> Cc: Kaufmann, Stefan @.>; Mention @.> Subject: Re: [nltk/nltk] Grammar fixes (PR #2884)
Message sent from a system outside of UConn.
What is the status of this?
— Reply to this email directly, view it on GitHubhttps://github.com/nltk/nltk/pull/2884#issuecomment-1979848617, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWNJMKBO3GBHCAMYLHMVEX3YWZNQHAVCNFSM5H7O7B5KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJXHE4DIOBWGE3Q. You are receiving this because you were mentioned.Message ID: @.***>