marshmallow-sqlalchemy
marshmallow-sqlalchemy copied to clipboard
Memory usage creep in >=1.4.0
After a fair amount of bissecting to pinpoint this issue, I wanted to surface the fact that our CI unit test suite fails consistently when using >=1.4.0 (in Apache Superset). I'm guessing it's related to the changes around PickleType (?).
Wondering if there's memory leakage around the change, or if it's expected to use more memory, pushing us over the edge of the memory we have in CI.
Thought I'd surface it here as I'm pinning the dependency to marshmallow-sqlalchemy>=1.3.0,<1.4.0 in our repo, and will reference this issue as a comment above our pin rule.
Note that the issue could be from 1.4.1 too as I didn't bissect all the way, but given my experience with pickles I thought the issue is likely related to that.
thank you for reporting! will investigate this soon.
the PickleType fix is an unlikely culprit, since it just ensures that models with a PickleType column properly get a marshmallow fields.Raw field in the corresponding Schema (iirc previously an error was raised).
https://github.com/marshmallow-code/marshmallow-sqlalchemy/pull/657 is something we should double-check, since it adds code that runs on schema class declaration (i.e. in the metaclass).
given that the suspected PR is in 1.4.1, @mistercrunch would you mind trying marshmallow-sqlalchemy==1.4.0 to see if it's affected?
Oh wait, closed it by mistake, please reopen, turns out I can recreate consistently on 1.4.1. See it on this PR https://github.com/apache/superset/pull/32762, here's the CI job: https://github.com/apache/superset/actions/runs/13958571093/job/39075360691?pr=32762
Issue/symptom manifests as a simple Error: The operation was canceled., which from experience is what happens when GHA runs out of memory.
Here's the output of git diff 1.4.1 1.4.0 on your repo, couldn't find anything obvious in the diff...
$ git diff 1.4.1 1.4.0
diff --git a/src/marshmallow_sqlalchemy/fields.py b/src/marshmallow_sqlalchemy/fields.py
index c016eb0..2db28ca 100644
--- a/src/marshmallow_sqlalchemy/fields.py
+++ b/src/marshmallow_sqlalchemy/fields.py
@@ -160,4 +160,4 @@ def get_primary_keys(model: type[DeclarativeMeta]) -> list[MapperProperty]:
def ensure_list(value: Any) -> list:
- return list(value) if is_iterable_but_not_string(value) else [value]
+ return value if is_iterable_but_not_string(value) else [value]
diff --git a/src/marshmallow_sqlalchemy/load_instance_mixin.py b/src/marshmallow_sqlalchemy/load_instance_mixin.py
index 4d5c9ba..0353d62 100644
--- a/src/marshmallow_sqlalchemy/load_instance_mixin.py
+++ b/src/marshmallow_sqlalchemy/load_instance_mixin.py
@@ -8,9 +8,7 @@
from __future__ import annotations
-import importlib.metadata
-from collections.abc import Iterable, Mapping, Sequence
-from typing import Any, Generic, TypeVar, Union, cast
+from typing import TYPE_CHECKING, Any, Generic, TypeVar, cast
import marshmallow as ma
from sqlalchemy.ext.declarative import DeclarativeMeta
@@ -19,16 +17,10 @@ from sqlalchemy.orm.exc import ObjectDeletedError
from .fields import get_primary_keys
-_LoadDataV3 = Union[Mapping[str, Any], Iterable[Mapping[str, Any]]]
-_LoadDataV4 = Union[Mapping[str, Any], Sequence[Mapping[str, Any]]]
-_LoadDataT = TypeVar("_LoadDataT", _LoadDataV3, _LoadDataV4)
-_ModelType = TypeVar("_ModelType", bound=DeclarativeMeta)
-
+if TYPE_CHECKING:
+ from collections.abc import Iterable, Mapping
-def _cast_data(data):
- if int(importlib.metadata.version("marshmallow")[0]) >= 4:
- return cast(_LoadDataV4, data)
- return cast(_LoadDataV3, data)
+_ModelType = TypeVar("_ModelType", bound=DeclarativeMeta)
class LoadInstanceMixin:
@@ -122,7 +114,7 @@ class LoadInstanceMixin:
def load(
self,
- data: _LoadDataT,
+ data: Mapping[str, Any] | Iterable[Mapping[str, Any]],
*,
session: Session | None = None,
instance: _ModelType | None = None,
@@ -144,13 +136,13 @@ class LoadInstanceMixin:
raise ValueError("Deserialization requires a session")
self.instance = instance or self.instance
try:
- return cast(ma.Schema, super()).load(_cast_data(data), **kwargs)
+ return cast(ma.Schema, super()).load(data, **kwargs)
finally:
self.instance = None
def validate(
self,
- data: _LoadDataT,
+ data: Mapping[str, Any] | Iterable[Mapping[str, Any]],
*,
session: Session | None = None,
**kwargs,
@@ -159,7 +151,7 @@ class LoadInstanceMixin:
self._session = session or self._session
if not (self.transient or self.session):
raise ValueError("Validation requires a session")
- return cast(ma.Schema, super()).validate(_cast_data(data), **kwargs)
+ return cast(ma.Schema, super()).validate(data, **kwargs)
def _split_model_kwargs_association(self, data):
"""Split serialized attrs to ensure association proxies are passed separately.
diff --git a/src/marshmallow_sqlalchemy/schema.py b/src/marshmallow_sqlalchemy/schema.py
index 30452c0..ba7b934 100644
--- a/src/marshmallow_sqlalchemy/schema.py
+++ b/src/marshmallow_sqlalchemy/schema.py
@@ -1,11 +1,10 @@
from __future__ import annotations
-import inspect
from typing import TYPE_CHECKING, Any, cast
import sqlalchemy as sa
from marshmallow.fields import Field
-from marshmallow.schema import Schema, SchemaMeta, SchemaOpts, _get_fields
+from marshmallow.schema import Schema, SchemaMeta, SchemaOpts
from .convert import ModelConverter
from .exceptions import IncorrectSchemaTypeError
@@ -119,7 +118,7 @@ class SQLAlchemySchemaMeta(SchemaMeta):
cls_fields,
# Filter out fields generated from foreign key columns
# if include_fk is set to False in the options
- mcs._maybe_filter_foreign_keys(inherited_fields, opts=opts, klass=klass),
+ mcs._maybe_filter_foreign_keys(inherited_fields, opts=opts),
dict_cls,
)
fields.update(mcs.get_declared_sqla_fields(fields, converter, opts, dict_cls))
@@ -160,7 +159,6 @@ class SQLAlchemySchemaMeta(SchemaMeta):
fields: list[tuple[str, Field]],
*,
opts: SQLAlchemySchemaOpts,
- klass: SchemaMeta,
) -> list[tuple[str, Field]]:
if opts.model is not None or opts.table is not None:
if not hasattr(opts, "include_fk") or opts.include_fk is True:
@@ -170,39 +168,13 @@ class SQLAlchemySchemaMeta(SchemaMeta):
for column in sa.inspect(opts.model or opts.table).columns # type: ignore[union-attr]
if column.foreign_keys
}
-
- non_auto_schema_bases = [
- base
- for base in inspect.getmro(klass)
- if issubclass(base, Schema)
- and not issubclass(base, SQLAlchemyAutoSchema)
- ]
-
- def is_declared_field(field: str) -> bool:
- return any(
- field
- in [
- name
- for name, _ in _get_fields(
- getattr(base, "_declared_fields", base.__dict__)
- )
- ]
- for base in non_auto_schema_bases
- )
-
- return [
- (name, field)
- for name, field in fields
- if name not in foreign_keys or is_declared_field(name)
- ]
+ return [(name, field) for name, field in fields if name not in foreign_keys]
return fields
class SQLAlchemyAutoSchemaMeta(SQLAlchemySchemaMeta):
@classmethod
- def get_declared_sqla_fields(
- cls, base_fields, converter: ModelConverter, opts, dict_cls
- ):
+ def get_declared_sqla_fields(cls, base_fields, converter, opts, dict_cls):
fields = dict_cls()
if opts.table is not None:
fields.update(
ok, thanks for narrowing it down. will investigate further within the next few days
sorry for the delay on this. haven't been able to dedicate time to marshmallow these past couple weeks. would definitely appreciate help on this, whether if it's an incomplete PR or an MRE
On my side I likely won't be able to drive this. After running the diff above I have zero intuition as to what could have caused a memory-usage regression. I think I isolated the exact version, and from the diff there's nothing in there that looks like the culprit.
Maybe there's something I'm missing - I thought I'd see at least one thing that points in the direction of "oh this could be it", but can't seem to find it. Any idea what it could be?
On my side one thing I can do is double check that 1.4.1 is the actual issue (and that 1.4.0 is ok), but i think I did that already...
thanks! and no worries if you can't drive this. even double-checking that 1.4.1 is the culprit version is a big help.
i suspect the this PR (the only one in 1.4.1) https://github.com/marshmallow-code/marshmallow-sqlalchemy/pull/657/files#diff-f6ccdfd29f2042e6b843cb0defd082f4a01abf90cdb3d18c9aa01ff7d57e8b19R174-R191 could be the issue because it creates intermediate lists upon class declaration. don't have time atm to do a code dive for a solution but it's a place to start
Submitted a PR, a bit of a shot in the dark, but should improve perf a bit as it's preventing recomputing things in an inner loop. https://github.com/marshmallow-code/marshmallow-sqlalchemy/pull/667
Thanks for the PR! I've released it in 1.4.2. @mistercrunch wanna give it a shot in the superset tests? 👀
Oh 4.1.2 is out, let me try it 🤞 🤞
Oh no! didn't fix the issue as seen here -> https://github.com/apache/superset/pull/33166
Logs -> https://github.com/apache/superset/actions/runs/14523061692/job/40750749107?pr=33166#step:5:126
Pretty sure that Error: Process completed with exit code 143. means GHA container ran out of memory and exited :(
For the record, we have thousands of runs succeeding on 4.1.0 and still pretty certain that something on >= 4.1.0 triggers the issue.