Pearlli streamline model chat unit test
Patch description
- removed all logic inside ._check_final_chat_data() and ._check_output_key() in the class
AbstractModelChatTestinto ._remove_non_deterministic_keys(). - changed 4 unit tests
test_model_chat.py,test_model_image_chat.py,test_demo_chat.py, andtest_qa_data_collection.pyto use pytest regressions, which replaces._check_output_key()inAbstractParlAIChatTest - deleted expected_states for the above unit tests and added .yml file generated by pytest regressions
- update
_test_agent_states()signature to no longer includeexpected_statesbecause it is not needed
Tests:
-
pytest ~/ParlAI/tests/crowdsourcing/tasks/qa_data_collection/test_qa_data_collection.py -
pytest ~/ParlAI/tests/crowdsourcing/tasks/test_chat_demo.py -
pytest ~/ParlAI/tests/crowdsourcing/tasks/model_chat/test_model_chat.py -
pytest ~/ParlAI/tests/crowdsourcing/tasks/model_chat/test_model_image_chat.py
Hmm could you add the commands to run the 4 unit tests that you've ported over?
Ah, make sure to delete the unused imports inside parlai/crowdsourcing/utils/tests.py :)
This PR has not had activity in 30 days. Closing due to staleness.
Hmm, I'll try to look into this on Friday if I have time - the PR is all ready to go except for a crowdsourcing check failing on the CI box but not on a devfair, and I want to understand why this is
Any updates on this PR?
Any updates on this PR?
Not yet, will merge from main and look into this again later today, thanks for the ping
@JackUrb do you have a sense of what might be causing the tests/crowdsourcing/tasks/model_chat/test_model_image_chat.py::TestModelImageChat::test_base_task CI check to fail here?
TLDR: it's failing because there's only one message in agent.state.get_data() at the end of the test, instead of the expected 19-ish at the end of all turns. I've added a print(f'NUM MESSAGES: {len(self.messages):d}') line to ParlAIChatAgentState.get_data(), and if you look the logs of the failing check here, it looks like this print statement is being called twice per turn instead of once:
(1) In "Captured stdout call", you see NUM MESSAGES print statements that stay constant at 1
(2) In "Captured stdout teardown" (not sure why some of the logs are diverted here), you see NUM MESSAGES print statements that increment to the expected number
Is this behavior familiar to you at all? Is it possible that there are 2 copies of the same agent, one that is receiving new messages and one that isn't?
I'm having a hard time parsing how the above is occurring. It would appear two agents are being established, however I would need more log info to debug, like the actual agent that's present in the above print call. I haven't really seen this type of behavior though, considering the world is actually playing out I find it really strange that data isn't getting saved here, especially if this isn't reproducible on devfair and only occurs on CI.
I'm having a hard time parsing how the above is occurring. It would appear two agents are being established, however I would need more log info to debug, like the actual agent that's present in the above print call. I haven't really seen this type of behavior though, considering the world is actually playing out I find it really strange that data isn't getting saved here, especially if this isn't reproducible on devfair and only occurs on CI.
Yeah, it sometimes but not always occurs when I run on devfair, and it always occurs on CI. Hmm - @JackUrb do you happen to know offhand if there's another way to find agents other than self.db.find_agents(), to tell if/how two different agents are being established? If not, no worries - I'll have to dig into this a lot more when I can.
@klshuster if I'm too slow on fixing this CI test, it's fine with me if you want to close the PR for now for cleanliness purposes - Pearl worked hard on this and the streamlining that she made to the unit test will make it much easier to maintain these tests in the future, but since it's a relatively lopri fix I might not be able to work on it more immediately
Hi @pearlli98!
Thank you for your pull request.
We require contributors to sign our Contributor License Agreement, and yours needs attention.
You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!
@JackUrb @EricMichaelSmith is there any update here?
@JackUrb @EricMichaelSmith is there any update here?
@JackUrb I'm finding that the test fails because the messages aren't getting saved until the test class's ._teardown() method, presumably during self.operator.force_shutdown() or self.server.shutdown_mock(), and thus the part of the test that reads the messages back in can't see the messages because they aren't saved yet. Is this familiar to you at all?
@klshuster Urrg... looking at my timeline I realistically may not have time to debug this further until mid-October-ish, given ICLR, BlenderBot integrity, EMNLP revisions, PTO, etc. Feel free to close it if you'd like, and I have a note to come back to it later when I have time!
I'm finding that the test fails because the messages aren't getting saved until the test class's
._teardown()method, presumably duringself.operator.force_shutdown()orself.server.shutdown_mock(), and thus the part of the test that reads the messages back in can't see the messages because they aren't saved yet. Is this familiar to you at all?
Not directly familiar, but feels a little reminiscent of some of the bugs where I had to update tests to run things in async contexts to be sure that the background (async) thread was able to execute the pending jobs. Here's an example of some of the helpers we've put together for this type of test on the Mephisto side, though they're not in the public api yet.
Admittedly this was prior to me learning how to make asyncio test fixtures, so perhaps the longer-term solution will be to expose an operator._await_pending_activity() (closest to this one, but with actual task filtering) that could be used in tests, and then that could be used here.
This PR has not had activity in 30 days. Closing due to staleness.