datajoint-python Strain on MySQL resource with "expensive" key-source

For tables with complicated/expensive key-source to be computed. When multiple workers calling populate() on this table at the same time, the initial key-source calculation by all of these workers creates spikes in resource utilization on the server side, sometimes detrimental to the server performance.

Mar 30 '20 15:03 ttngu207

A key-source filler solution prepared by Edgar, a decorator on any table with potentially expensive keysource, to create a corresponding key-source filler table where the key-source are computed separately. See gist here.

Mar 30 '20 15:03 ttngu207

This is a very interesting solution. It creates an extra "Filler" table that populates the primary key for the target table. In the future, we will find a way to make this a feature that's transparent to the user.

Apr 09 '20 14:04 dimitri-yatsenko

probably should be addressed along with #665

Apr 14 '20 23:04 ixcat

A consistent problem with the utility linked above is that the resulting key_source table is entirely independent of everything. Consider the following:


import datajoint as dj
schema = dj.schema("my_schema")
ksf = KeySourceFiller(schema)

@schema
class Table1(dj.Lookup):
    definition = """
    id_1 : int
    ---
    value_1 : int
    """
    contents = [(0, 0),(1,1),(2,2),]

@schema
class Table2(dj.Lookup):
    definition = """
    id_2 : char(1)
    ---
    value_2 : int
    """
    contents = [("a", 0),("b", 1),("c", 2),]

@ksf
class TableN(dj.Computed):
    definition = """
    -> Table1
    -> Table2
    ---
    new_value: int
    """
    def make(self, key):
        v1, v2 = (Table1 * Table2 & key).fetch1("value_1", "value_2")
        self.insert1({**key, "new_value": v1*v2})

i.e. we have a computed table which depends on two other tables, with a KeySourceFiller to avoid the arduous task of calculating its key_source.

If I delete the contents of Table1, I should expect all the contents of TableN to be deleted with it, which they are. However, the contents of TableN.key_source remain intact, and the subsequent command TableN.populate() will result in a large number of errors like so: DataJointError: fetch1 should only return one tuple. 0 tuples were found - i.e. there are keys to be computed which point to data (in Table1) that no longer exists

Ideally, any rows in the key_source table should be purged if their corresponding rows in the main table are deleted (i.e. it should have the same ephemeral behaviour as conventional, computed-on-the-fly key_sources).

There are two obvious ways to do this

Have the decorator modify the delete behaviour to also delete from the key_source
Modify the key_source definition such that the key_source table has the same foreign keys as the main table, and allow the standard deletion behaviour to take care of it in parallel.

The first is probably more sensible, but the structure of Datajoint is sufficiently complex that I can't figure out how to make it work. The latter might look like below, using the definition-parsing behaviour cribbed from datajoint.declare.prepare_declare.

The latter has the effect of substantially polluting the diagram, much more so than the current behaviour (as above) where the key source tables appear as unrelated blobs off to the side. Ideally, the Diagram would have some way of excluding this sort of "administrative" table. It can be done manually as: dj.Diagram(schema) - TableN.key_source, but that remains fairly user-unfriendly, especially if there are more than 1-2 so-decorated tables.

def __call__(self, cls):
    context = inspect.currentframe().f_back.f_locals
    cls = self._schema(cls, context=context)
    name = cls.__name__
    filler_name = '{}KeySource'.format(name)
    cls.base_key_source = cls.key_source
#    defn = '\n'.join(['{}: {}'.format(a.name, a.type) for a in cls.heading.attributes.values() if a.in_key])
    defn = ""
    for line in re.split(r'\s*\n\s*', cls.definition.strip()):
        if line.startswith("->"):
            defn = "{}{}\n".format(defn, line)

    class KeySource(dj.Computed):
        definition = defn

        @property
        def key_source(self):
            return cls().base_key_source

        def make(self, key):
            self.insert1(key)

    KeySource.__name__ = filler_name
    self.tables[filler_name] = self._schema(KeySource, context=context)

    cls.key_source = self.tables[filler_name]()

    return cls

Quick testing code, assuming the above schema:

if __name__ == "__main__":
    TableN.key_source.populate()
    TableN.populate()
    print("Keys to populate (should be 9): {}".format(len(TableN.key_source)))    
    print("Keys populated (should be 9: {}".format(len(TableN())))    
    print("Now we delete one of the parent tables, which deletes the entries in the computed table")
    dj.config["safemode"] = False
    Table1.delete()
    dj.config["safemode"] = True   
    print("Computed length (should be 0): {}".format(len(TableN())))   
    print("key_source (should be 0): {}".format(len(TableN.key_source)))
    
    schema.drop()

Apr 23 '21 14:04 simon-ball

thanks for the detailed feedback @simon-ball . I think whatever 'official' solution is implemented would manage this automatically and hide user complexity (as @dimitri-yatsenko mentions ' In the future, we will find a way to make this a feature that's transparent to the user.' ) , this is great real-world feedback of the workaround to include as we move forward.

Apr 23 '21 15:04 ixcat

datajoint-python datajoint-python copied to clipboard

Strain on MySQL resource with "expensive" key-source

datajoint-python
datajoint-python copied to clipboard