datajoint-python
datajoint-python copied to clipboard
Strain on MySQL resource with "expensive" key-source
For tables with complicated/expensive key-source to be computed. When multiple workers calling populate()
on this table at the same time, the initial key-source
calculation by all of these workers creates spikes in resource utilization on the server side, sometimes detrimental to the server performance.
A key-source filler
solution prepared by Edgar, a decorator on any table with potentially expensive keysource, to create a corresponding key-source filler table where the key-source are computed separately. See gist here.
This is a very interesting solution. It creates an extra "Filler" table that populates the primary key for the target table. In the future, we will find a way to make this a feature that's transparent to the user.
probably should be addressed along with #665
A consistent problem with the utility linked above is that the resulting key_source
table is entirely independent of everything. Consider the following:
import datajoint as dj
schema = dj.schema("my_schema")
ksf = KeySourceFiller(schema)
@schema
class Table1(dj.Lookup):
definition = """
id_1 : int
---
value_1 : int
"""
contents = [(0, 0),(1,1),(2,2),]
@schema
class Table2(dj.Lookup):
definition = """
id_2 : char(1)
---
value_2 : int
"""
contents = [("a", 0),("b", 1),("c", 2),]
@ksf
class TableN(dj.Computed):
definition = """
-> Table1
-> Table2
---
new_value: int
"""
def make(self, key):
v1, v2 = (Table1 * Table2 & key).fetch1("value_1", "value_2")
self.insert1({**key, "new_value": v1*v2})
i.e. we have a computed table which depends on two other tables, with a KeySourceFiller to avoid the arduous task of calculating its key_source.
If I delete the contents of Table1
, I should expect all the contents of TableN
to be deleted with it, which they are.
However, the contents of TableN.key_source
remain intact, and the subsequent command TableN.populate()
will result in a large number of errors like so: DataJointError: fetch1 should only return one tuple. 0 tuples were found
- i.e. there are keys to be computed which point to data (in Table1) that no longer exists
Ideally, any rows in the key_source
table should be purged if their corresponding rows in the main table are deleted (i.e. it should have the same ephemeral behaviour as conventional, computed-on-the-fly key_sources).
There are two obvious ways to do this
- Have the decorator modify the delete behaviour to also delete from the
key_source
- Modify the
key_source
definition such that thekey_source
table has the same foreign keys as the main table, and allow the standard deletion behaviour to take care of it in parallel.
The first is probably more sensible, but the structure of Datajoint is sufficiently complex that I can't figure out how to make it work. The latter might look like below, using the definition-parsing behaviour cribbed from datajoint.declare.prepare_declare
.
The latter has the effect of substantially polluting the diagram, much more so than the current behaviour (as above) where the key source tables appear as unrelated blobs off to the side. Ideally, the Diagram would have some way of excluding this sort of "administrative" table. It can be done manually as: dj.Diagram(schema) - TableN.key_source
, but that remains fairly user-unfriendly, especially if there are more than 1-2 so-decorated tables.
def __call__(self, cls):
context = inspect.currentframe().f_back.f_locals
cls = self._schema(cls, context=context)
name = cls.__name__
filler_name = '{}KeySource'.format(name)
cls.base_key_source = cls.key_source
# defn = '\n'.join(['{}: {}'.format(a.name, a.type) for a in cls.heading.attributes.values() if a.in_key])
defn = ""
for line in re.split(r'\s*\n\s*', cls.definition.strip()):
if line.startswith("->"):
defn = "{}{}\n".format(defn, line)
class KeySource(dj.Computed):
definition = defn
@property
def key_source(self):
return cls().base_key_source
def make(self, key):
self.insert1(key)
KeySource.__name__ = filler_name
self.tables[filler_name] = self._schema(KeySource, context=context)
cls.key_source = self.tables[filler_name]()
return cls
Quick testing code, assuming the above schema:
if __name__ == "__main__":
TableN.key_source.populate()
TableN.populate()
print("Keys to populate (should be 9): {}".format(len(TableN.key_source)))
print("Keys populated (should be 9: {}".format(len(TableN())))
print("Now we delete one of the parent tables, which deletes the entries in the computed table")
dj.config["safemode"] = False
Table1.delete()
dj.config["safemode"] = True
print("Computed length (should be 0): {}".format(len(TableN())))
print("key_source (should be 0): {}".format(len(TableN.key_source)))
schema.drop()
thanks for the detailed feedback @simon-ball . I think whatever 'official' solution is implemented would manage this automatically and hide user complexity (as @dimitri-yatsenko mentions ' In the future, we will find a way to make this a feature that's transparent to the user.' ) , this is great real-world feedback of the workaround to include as we move forward.