maggma icon indicating copy to clipboard operation
maggma copied to clipboard

MemoryStore: replace mongomock with pymongo-inmemory

Open rkingsbury opened this issue 1 year ago • 6 comments

Summary

Major changes:

  • replace mongomock with pymongo-inmemory
  • eliminate a significant amount of bespoke code from MemoryStore, which can now inherit nearly all methods from MongoStore.
  • BREAKING CHANGE - MemoryStore() instantiated with the same collection name now point to the same instance in memory. This is in contrast to previous behavior (which was somewhat unintuitive) and leads more intuitive __eq__ behavior (see #788 )
  • Modify tests to account for stricter behavior of MemoryStore, e.g., must call connect() again after run().
  • Closes #788
  • Closes #830
  • Some docstring cleanups

pymongo-inmemory creates a thin wrapper around MongoClient() using a downloaded instance of mongod, so it guaranteed to work and perform exactly like "real" pymongo, unlike mongomock which has some limitations and inconstencies.

Performance impact

I did a basic timing test for connecting to a FileStore (400 files) on a system with slow (HDD) storage. On first connection, pymongo-inmemory was slower by about 50% (19s vs. 13s), presumably because it has to download and spin up the mongod instance. On subsecquent fs.connect calls pymongo-inmemory was at least as fast as mongomock (12s vs. 13s).

On a system with faster storage (SSD) and 800 files, pymongo-inmemory and mongomock had comparable speed (26-30s for pymongo-inmemory vs. 27s for mongomock)

Todos

If this is work in progress, what else needs to be done?

  • [ ] Pending response to issue, move configuration settings to pyproject.toml - (I need to open a PR for this; not necessary to complete before merging this PR).
  • [ ] Resolve remaining CI problems for which I need assistance. See below.

Remaining test failures / CI problems

I need some help (please!) from @munrojm and maybe @gpetretto to resolve the remaining test failures. I'll break them down one by one

test_jsonstore_orjson_options (@gpetretto )

The changes here prevent the JSONEncodeError from being raised, for reasons I haven't yet identified.

    def test_jsonstore_orjson_options(test_dir):
        class SubFloat(float):
            pass
    
        with ScratchDir("."):
            jsonstore = JSONStore("d.json", read_only=False)
            jsonstore.connect()
            with pytest.raises(orjson.JSONEncodeError):
>               jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
E               Failed: DID NOT RAISE <class 'TypeError'>

tests/stores/test_mongolike.py:498: Failed

By updating MemoryStore to use "real" Mongo, I was able to inherit the update method from MongoStore rather than use the custom update method that had previously been used. That custom update is now inside MontyStore. I suspect that a subtle difference between the two is causing this test to fail, but I haven't been able to pin it down. I would appreciate a 2nd set of eyes. I suggest comparing the two update methods side by side as a starting point.

test_submission_patch (@munrojm )

For some reason the API is now generating a Bad Response code instead of code 200. This must somehow be related to owner_store which is a MemoryStore, but I'm not familiar enough with the API to figure out the connection

    def test_submission_patch(owner_store, post_query_op, patch_query_op):
/home/runner/work/maggma/maggma/tests/cli/test_serial.py:5: PytestCollectionWarning: cannot collect test class 'TestBuilder' because it has a __init__ constructor (from: tests/cli/test_serial.py)
        endpoint = SubmissionResource(
  class TestBuilder(Builder):
            store=owner_store,
            get_query_operators=[PaginationQuery()],
            post_query_operators=[post_query_op],
            patch_query_operators=[patch_query_op],
            calculate_submission_id=True,
            model=Owner,
        )
        app = FastAPI()
        app.include_router(endpoint.router)
    
        client = TestClient(app)
        update = json.dumps({"last_updated": "2023-06-22T17:32:11.645713"})
    
        assert client.get("/").status_code == 200
>       assert client.patch(f"/?name=PersonAge9&update={update}").status_code == 200
E       assert 400 == 200
E        +  where 400 = <Response [400]>.status_code
E        +    where <Response [400]> = <bound method Session.patch of <starlette.testclient.TestClient object at 0x7f75aa0f4f10>>('/?name=PersonAge9&update={"last_updated": "2023-06-22T17:32:11.645713"}')
E        +      where <bound method Session.patch of <starlette.testclient.TestClient object at 0x7f75aa0f4f10>> = <starlette.testclient.TestClient object at 0x7f75aa0f4f10>.patch

tests/api/test_submission_resource.py:118: AssertionError

ServerSelectionTimeout Errors

Occasionally some tests error out because the MemoryStore times out. I suspect this occurs because pymongo-inmemory is downloading and spinning up its own mongod before it can run the tests. It looks like this wasn't a problem after the most recent update, so perhaps it will take care of itself. If the issue persists, I will experiment with having pymongo-inmemory use the built in mongo instance provided on the GH runner instead of downloading its own.

Checklist

  • [x] Google format doc strings added.
  • [x] Code linted with ruff. (For guidance in fixing rule violates, see rule list)
  • [x] Type annotations included. Check with mypy.
  • [x] Tests added for new features/fixes.
  • [x] I have run the tests locally and they passed.

rkingsbury avatar Aug 25 '23 18:08 rkingsbury

+1 I’ve observed some serious differences between mongomock and pymongo that can result in some tedious debugging! I think the idea of using real Mongo is much preferred.

mkhorton avatar Aug 25 '23 20:08 mkhorton

@munrojm @gpetretto I updated the first post on this PR with a specific request for help from each of you on one of the test failures. I'd appreciate any thoughts you have

(adding this comment b/c I'm not sure if editing the post would trigger an email notification).

rkingsbury avatar Aug 29 '23 20:08 rkingsbury

@munrojm @gpetretto I updated the first post on this PR with a specific request for help from each of you on one of the test failures. I'd appreciate any thoughts you have

(adding this comment b/c I'm not sure if editing the post would trigger an email notification).

Just committed a fix for the patch test. Was an issue with how the body data was being passed to the patch method of the test client.

munrojm avatar Aug 29 '23 21:08 munrojm

Thanks @rkingsbury for updating the MemoryStore. The limited number of features supported by mongomock was ineed an issue. Concerning the test that fails, I made a few tests and figured out what is happining. The reason why I introduced those features and that test was because it was failing when the document contains a FloatWithUnit. I added that option since orjson does not handle by default subclasses of native python types, but allows to pass a function to deal with values that cannot serialize. With the mongomock implementation, when passing something like

jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})

the dictionary stored internally still contained an instance of SubFloat. So when update_json_file was executed orjson needed to know how to convert it. It seems that now MongoDB already stores the instance of SubFloat as a simple float. So, when it is queried back to perform update_json_file it is never an instance of SubFloat. In the new implementation I can't find a way to trigger that error anymore, as MongoDB will already deals with those cases before getting to orjson. I am not sure if it would be better to remove the serialization_default option at this point.

gpetretto avatar Aug 30 '23 10:08 gpetretto

I added that option since orjson does not handle by default subclasses of native python types, but allows to pass a function to deal with values that cannot serialize. With the mongomock implementation, when passing something like

jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})

the dictionary stored internally still contained an instance of SubFloat. So when update_json_file was executed orjson needed to know how to convert it. It seems that now MongoDB already stores the instance of SubFloat as a simple float. So, when it is queried back to perform update_json_file it is never an instance of SubFloat. In the new implementation I can't find a way to trigger that error anymore, as MongoDB will already deals with those cases before getting to orjson.

Thanks for investigating! So if I understand correctly, the original problem that prompted #791 is resolved by this PR.

I am not sure if it would be better to remove the serialization_default option at this point.

But it sounds like you're saying that if I were to do this:

a = SubFloat(1.1.)
jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
jsonstore.query({"task_id": 3})

I would get back a float, not a SubFloat, right? Is that a problem? It seems to me there could still be value in the options you added for users that want to be able to serialize / deserialize custom types.

rkingsbury avatar Aug 30 '23 16:08 rkingsbury

Thanks for investigating! So if I understand correctly, the original problem that prompted #791 is resolved by this PR.

Indeed I think that this would have avoided the need for #791. I will try to make a test with the new version and see if there is no issue with that.

But it sounds like you're saying that if I were to do this:

a = SubFloat(1.1.)
jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
jsonstore.query({"task_id": 3})

I would get back a float, not a SubFloat, right? Is that a problem? It seems to me there could still be value in the options you added for users that want to be able to serialize / deserialize custom types.

Yes, you will get back a float. A simple example is:

class SubFloat(float):
    pass

ms = MemoryStore()
ms.connect()

d = {"test": SubFloat(1.1), "normal": 1.1, "task_id": 3}
ms.update(d)

out_d = ms.query_one()
print(out_d["test"].__class__, out_d["normal"].__class__)

The output with the old version is:

<class '__main__.SubFloat'> <class 'float'>

While with the new version is:

<class 'float'> <class 'float'>

In the end it is not really a problem, since this makes the MemoryStore more similar to the MongoStore. However, any object that is not handled by jsanitize (e.g. MSONable, dataclasses, ...) will probably be converted to a simple type during the update() (a string in the worst case). So I don't know if there is any corner case where the object survives after the insertion in MongoDB, but that cannot be handled by orjson by default. I suppose leaving the option will not do much harm (at least will not force me to make a backward incompatible change to use the latest version of maggma), however I couldn't find a way to create a test example to replace the test_jsonstore_orjson_options.

gpetretto avatar Aug 30 '23 22:08 gpetretto