nltk icon indicating copy to clipboard operation
nltk copied to clipboard

Grammar fixes

Open stefkauf opened this issue 3 years ago • 15 comments

stefkauf avatar Nov 14 '21 05:11 stefkauf

Hey @stefkauf! You might've noticed that the tests are failing for this. The reason for this is explained in this comment: https://github.com/nltk/nltk/pull/2822#issuecomment-928037649. I hope this helps.

tomaarsen avatar Nov 16 '21 17:11 tomaarsen

Thanks, @tomaarsen, the problem was a bit mysterious to me. I've fixed the files with pre-commit and updated the pull request.

stefkauf avatar Nov 16 '21 18:11 stefkauf

I understand! I'm still unsure how to help new contributors with it in an intuitive way.

tomaarsen avatar Nov 16 '21 18:11 tomaarsen

Upon reading your commit comment:

Conversion to Chomsky Normal Form (CNF), implemented as the '.chomsky_normal_form()' method in grammar.CFG instances, was incomplete. In particular, it could not deal with empty productions or mixed productions (with terminals and non-terminals co-occuring on the right-hand side). The new implementation removes those shortcomings. Like in the earlier version, the individual steps of conversion are carried out by calling class methods.

I presume these are related to the following Errors: https://github.com/nltk/nltk/blob/68e4e589eb62750993aabef5554548973fe1e280/nltk/grammar.py#L746-L749 https://github.com/nltk/nltk/blob/68e4e589eb62750993aabef5554548973fe1e280/nltk/grammar.py#L751-L756

Would it be possible to provide some tests that cover these new cases? For example in:

That way, even people like myself who know about Chomsky Normal Form, but don't know the algorithm, can get some confidence in the correctness of the implementation.

tomaarsen avatar Nov 17 '21 08:11 tomaarsen

I've been working on some doctest examples. But it is tricky because the requirements for matching outputs are so stringent. For instance, the following grammar for the language a^nb^n:

    S -> 'a' S 'b'
    S -> 

was on one run converted to this one, which I put in the doctest file:

    S0 -> 
    S0 -> T0 B1
    S0 -> T0 T1
    B0 -> S T1
    B1 -> S T1
    S -> T0 B0
    S -> T0 T1
    T0 -> 'a'
    T1 -> 'b'

But sometimes I get a different (but equivalent) result, for instance this one, which differs only in the order in which the new nonterminals were created:

    S0 -> 
    S0 -> T1 B0
    S0 -> T1 T0
    B0 -> S T0
    B1 -> S T0
    S -> T1 B1
    S -> T1 T0
    T0 -> 'b'
    T1 -> 'a'

There is some arbitrariness in the order in which new nonterminals are created because I use sets. I'm a bit reluctant to change this just to make doctest stop complaining. Would it be acceptable to use some other way to illustrate how it works? For instance, I could put a 'demo' function at the end of grammar.py, for users to invoke.

stefkauf avatar Nov 19 '21 05:11 stefkauf

You may also want to consider unit tests for this. Stick with the ones that give you the biggest bang for the buck: simple to write and can cover more ground. I would argue pytest may be a better choice here than doctest.

iliakur avatar Nov 19 '21 19:11 iliakur

@iliakur Actually, in my latest update I've solved the problem by using dictionaries instead of sets where the order mattered. Since dictionaries preserve order of insertion, that eliminated the arbitrary variation.

stefkauf avatar Nov 19 '21 20:11 stefkauf

Well done on adding those doctests! I've handled a merge conflict between this PR and changes introduced from #2888. I'll have a look if I can inspect the implementation itself more thoroughly soon.

tomaarsen avatar Nov 20 '21 00:11 tomaarsen

Apologies, I have not yet had time. I've been quite busy.

tomaarsen avatar Dec 01 '21 15:12 tomaarsen

@stefkauf thanks for this great contribution... I've reviewed the code, and just think some of your comments can go, as noted, then we can merge this.

stevenbird avatar Jul 05 '22 12:07 stevenbird

@stevenbird Thanks, Steven, for your messages. I've removed those comments and committed the new version of grammar.py.

stefkauf avatar Jul 06 '22 11:07 stefkauf

What is the status of this @stevenbird?

rmalouf avatar Mar 06 '24 00:03 rmalouf

Hi Rob,

you didn't get a reply about this, did you? I can't tell you what became of it. I have the impression that NLTK is not managed all that actively anymore, but I might be wrong. In any case, I'm not sure.

Hope things are well, Stefan


Department of Linguistics, University of Connecticut http://stefan-kaufmann.uconn.edu/


From: Rob Malouf @.> Sent: Tuesday, March 5, 2024 7:12 PM To: nltk/nltk @.> Cc: Kaufmann, Stefan @.>; Mention @.> Subject: Re: [nltk/nltk] Grammar fixes (PR #2884)

Message sent from a system outside of UConn.

What is the status of this?

— Reply to this email directly, view it on GitHubhttps://github.com/nltk/nltk/pull/2884#issuecomment-1979848617, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWNJMKBO3GBHCAMYLHMVEX3YWZNQHAVCNFSM5H7O7B5KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJXHE4DIOBWGE3Q. You are receiving this because you were mentioned.Message ID: @.***>

stefkauf avatar Apr 10 '24 15:04 stefkauf