arches icon indicating copy to clipboard operation
arches copied to clipboard

Indexing records with large number of tiles (containing domain datatypes) takes excessively long time

Open khodgkinson-he opened this issue 2 years ago • 2 comments

Describe the bug Indexing a resource with a large number of tiles (predominantly containing domain value datatypes) takes an excessive amount of time. Reindexing one record with 3,900 tiles took over eight hours.

To Reproduce Steps to reproduce the behavior:

  1. Create record with very large number of tiles (predominantly containing domain value datatypes).
  2. Reindex record.

Screenshots If applicable, add screenshots to help explain your problem.

Expected behavior Comparable reindexing times between similarly sized records whether they contain Domain/DomainList Data types or not.

Your Arches Information

  • Version used:
  • Operating System and version (desktop or mobile):
  • Browser Name and version:
  • Link to your Arches Install (optional):

Additional context

There seems to be iterations in the datatypes.py append_to_document process for DomainDatatype and DomainListDatatype that are no longer required.

Work is undertaken to deduce the Nodeid/NodeValue, however there has been subsequent refactoring of the code, and these are now supplied as arguments to the procedures.

If this unnecessary work is refactored out then the reindexing completes in a more expected timeframe.

Prima facie - potential fix changes shown below (REMming out unnecessary code, refactoring to use parameter..)


class DomainDataType(BaseDomainDataType):
    
def append_to_document(self, document, nodevalue, nodeid, tile, provisional=False):
        # domain_text = None
        # for tile in document["tiles"]:
        #     for k, v in tile.data.items():
        #         if v == nodevalue:
        #             node = models.Node.objects.get(nodeid=k)
        #             domain_text = self.get_option_text(node, v)
        node = models.Node.objects.get(nodeid=nodeid)
        domain_text = self.get_option_text(node, nodevalue)

        if domain_text not in document["strings"] and domain_text is not None:
            document["strings"].append({"string": domain_text, "nodegroup_id": tile.nodegroup_id, "provisional": provisional})


class DomainListDataType(BaseDomainDataType):
    
def append_to_document(self, document, nodevalue, nodeid, tile, provisional=False):
        domain_text_values = set([])
        # for tile in document["tiles"]:
        #     for k, v in tile.data.items():
        #         if v == nodevalue:
        node = models.Node.objects.get(nodeid=nodeid)
        for value in nodevalue:
            text_value = self.get_option_text(node, value)
            domain_text_values.add(text_value)

        for value in domain_text_values:
            if value not in document["strings"]:
                document["strings"].append({"string": value, "nodegroup_id": tile.nodegroup_id, "provisional": provisional})


Ticket Background

  • Found by: @khodgkinson-he

khodgkinson-he avatar Jul 12 '22 12:07 khodgkinson-he

@chiatt do you think that @khodgkinson-he comments about these simply need refactoring are sound? If so then we can get this done and PR'd into dev/6.1

aj-he avatar Jul 12 '22 12:07 aj-he

@aj-he Yeah, @khodgkinson-he's proposed changes look good and dev/6.1 seems like the right place.

chiatt avatar Jul 12 '22 22:07 chiatt