maggma
maggma copied to clipboard
MemoryStore: replace mongomock with pymongo-inmemory
Summary
Major changes:
- replace
mongomock
withpymongo-inmemory
- eliminate a significant amount of bespoke code from
MemoryStore
, which can now inherit nearly all methods fromMongoStore
. -
BREAKING CHANGE -
MemoryStore()
instantiated with the same collection name now point to the same instance in memory. This is in contrast to previous behavior (which was somewhat unintuitive) and leads more intuitive__eq__
behavior (see #788 ) - Modify tests to account for stricter behavior of
MemoryStore
, e.g., must callconnect()
again afterrun()
. - Closes #788
- Closes #830
- Some docstring cleanups
pymongo-inmemory
creates a thin wrapper around MongoClient()
using a downloaded instance of mongod
, so it guaranteed to work and perform exactly like "real" pymongo
, unlike mongomock
which has some limitations and inconstencies.
Performance impact
I did a basic timing test for connecting to a FileStore
(400 files) on a system with slow (HDD) storage. On first connection, pymongo-inmemory
was slower by about 50% (19s vs. 13s), presumably because it has to download and spin up the mongod
instance. On subsecquent fs.connect
calls pymongo-inmemory
was at least as fast as mongomock
(12s vs. 13s).
On a system with faster storage (SSD) and 800 files, pymongo-inmemory
and mongomock
had comparable speed (26-30s for pymongo-inmemory vs. 27s for mongomock)
Todos
If this is work in progress, what else needs to be done?
- [ ] Pending response to issue, move configuration settings to
pyproject.toml
- (I need to open a PR for this; not necessary to complete before merging this PR). - [ ] Resolve remaining CI problems for which I need assistance. See below.
Remaining test failures / CI problems
I need some help (please!) from @munrojm and maybe @gpetretto to resolve the remaining test failures. I'll break them down one by one
test_jsonstore_orjson_options
(@gpetretto )
The changes here prevent the JSONEncodeError
from being raised, for reasons I haven't yet identified.
def test_jsonstore_orjson_options(test_dir):
class SubFloat(float):
pass
with ScratchDir("."):
jsonstore = JSONStore("d.json", read_only=False)
jsonstore.connect()
with pytest.raises(orjson.JSONEncodeError):
> jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
E Failed: DID NOT RAISE <class 'TypeError'>
tests/stores/test_mongolike.py:498: Failed
By updating MemoryStore
to use "real" Mongo, I was able to inherit the update
method from MongoStore
rather than use the custom update
method that had previously been used. That custom update
is now inside MontyStore
. I suspect that a subtle difference between the two is causing this test to fail, but I haven't been able to pin it down. I would appreciate a 2nd set of eyes. I suggest comparing the two update
methods side by side as a starting point.
test_submission_patch
(@munrojm )
For some reason the API is now generating a Bad Response code instead of code 200. This must somehow be related to owner_store
which is a MemoryStore
, but I'm not familiar enough with the API to figure out the connection
def test_submission_patch(owner_store, post_query_op, patch_query_op):
/home/runner/work/maggma/maggma/tests/cli/test_serial.py:5: PytestCollectionWarning: cannot collect test class 'TestBuilder' because it has a __init__ constructor (from: tests/cli/test_serial.py)
endpoint = SubmissionResource(
class TestBuilder(Builder):
store=owner_store,
get_query_operators=[PaginationQuery()],
post_query_operators=[post_query_op],
patch_query_operators=[patch_query_op],
calculate_submission_id=True,
model=Owner,
)
app = FastAPI()
app.include_router(endpoint.router)
client = TestClient(app)
update = json.dumps({"last_updated": "2023-06-22T17:32:11.645713"})
assert client.get("/").status_code == 200
> assert client.patch(f"/?name=PersonAge9&update={update}").status_code == 200
E assert 400 == 200
E + where 400 = <Response [400]>.status_code
E + where <Response [400]> = <bound method Session.patch of <starlette.testclient.TestClient object at 0x7f75aa0f4f10>>('/?name=PersonAge9&update={"last_updated": "2023-06-22T17:32:11.645713"}')
E + where <bound method Session.patch of <starlette.testclient.TestClient object at 0x7f75aa0f4f10>> = <starlette.testclient.TestClient object at 0x7f75aa0f4f10>.patch
tests/api/test_submission_resource.py:118: AssertionError
ServerSelectionTimeout
Errors
Occasionally some tests error out because the MemoryStore
times out. I suspect this occurs because pymongo-inmemory
is downloading and spinning up its own mongod
before it can run the tests. It looks like this wasn't a problem after the most recent update, so perhaps it will take care of itself. If the issue persists, I will experiment with having pymongo-inmemory
use the built in mongo instance provided on the GH runner instead of downloading its own.
Checklist
- [x] Google format doc strings added.
- [x] Code linted with
ruff
. (For guidance in fixing rule violates, see rule list) - [x] Type annotations included. Check with
mypy
. - [x] Tests added for new features/fixes.
- [x] I have run the tests locally and they passed.
+1 I’ve observed some serious differences between mongomock and pymongo that can result in some tedious debugging! I think the idea of using real Mongo is much preferred.
@munrojm @gpetretto I updated the first post on this PR with a specific request for help from each of you on one of the test failures. I'd appreciate any thoughts you have
(adding this comment b/c I'm not sure if editing the post would trigger an email notification).
@munrojm @gpetretto I updated the first post on this PR with a specific request for help from each of you on one of the test failures. I'd appreciate any thoughts you have
(adding this comment b/c I'm not sure if editing the post would trigger an email notification).
Just committed a fix for the patch test. Was an issue with how the body data was being passed to the patch
method of the test client.
Thanks @rkingsbury for updating the MemoryStore. The limited number of features supported by mongomock was ineed an issue.
Concerning the test that fails, I made a few tests and figured out what is happining. The reason why I introduced those features and that test was because it was failing when the document contains a FloatWithUnit
. I added that option since orjson does not handle by default subclasses of native python types, but allows to pass a function to deal with values that cannot serialize. With the mongomock implementation, when passing something like
jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
the dictionary stored internally still contained an instance of SubFloat
. So when update_json_file
was executed orjson needed to know how to convert it. It seems that now MongoDB already stores the instance of SubFloat
as a simple float. So, when it is queried back to perform update_json_file
it is never an instance of SubFloat
. In the new implementation I can't find a way to trigger that error anymore, as MongoDB will already deals with those cases before getting to orjson. I am not sure if it would be better to remove the serialization_default
option at this point.
I added that option since orjson does not handle by default subclasses of native python types, but allows to pass a function to deal with values that cannot serialize. With the mongomock implementation, when passing something like
jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
the dictionary stored internally still contained an instance of
SubFloat
. So whenupdate_json_file
was executed orjson needed to know how to convert it. It seems that now MongoDB already stores the instance ofSubFloat
as a simple float. So, when it is queried back to performupdate_json_file
it is never an instance ofSubFloat
. In the new implementation I can't find a way to trigger that error anymore, as MongoDB will already deals with those cases before getting to orjson.
Thanks for investigating! So if I understand correctly, the original problem that prompted #791 is resolved by this PR.
I am not sure if it would be better to remove the
serialization_default
option at this point.
But it sounds like you're saying that if I were to do this:
a = SubFloat(1.1.)
jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3})
jsonstore.query({"task_id": 3})
I would get back a float
, not a SubFloat
, right? Is that a problem? It seems to me there could still be value in the options you added for users that want to be able to serialize / deserialize custom types.
Thanks for investigating! So if I understand correctly, the original problem that prompted #791 is resolved by this PR.
Indeed I think that this would have avoided the need for #791. I will try to make a test with the new version and see if there is no issue with that.
But it sounds like you're saying that if I were to do this:
a = SubFloat(1.1.) jsonstore.update({"wrong_field": SubFloat(1.1), "task_id": 3}) jsonstore.query({"task_id": 3})
I would get back a
float
, not aSubFloat
, right? Is that a problem? It seems to me there could still be value in the options you added for users that want to be able to serialize / deserialize custom types.
Yes, you will get back a float
. A simple example is:
class SubFloat(float):
pass
ms = MemoryStore()
ms.connect()
d = {"test": SubFloat(1.1), "normal": 1.1, "task_id": 3}
ms.update(d)
out_d = ms.query_one()
print(out_d["test"].__class__, out_d["normal"].__class__)
The output with the old version is:
<class '__main__.SubFloat'> <class 'float'>
While with the new version is:
<class 'float'> <class 'float'>
In the end it is not really a problem, since this makes the MemoryStore
more similar to the MongoStore
. However, any object that is not handled by jsanitize (e.g. MSONable, dataclasses, ...) will probably be converted to a simple type during the update()
(a string in the worst case). So I don't know if there is any corner case where the object survives after the insertion in MongoDB, but that cannot be handled by orjson by default. I suppose leaving the option will not do much harm (at least will not force me to make a backward incompatible change to use the latest version of maggma), however I couldn't find a way to create a test example to replace the test_jsonstore_orjson_options
.