pathling icon indicating copy to clipboard operation
pathling copied to clipboard

Performance issue

Open liquid36 opened this issue 10 months ago • 3 comments

Hi! I'm working with a synthetic dataset in order to test the tool. I have 1000 patients and 50.000 conditions.

I made this basic request that never ends.

POST http://localhost:9090/fhir/Patient/$aggregate

{
    "resourceType": "Parameters",
    "parameter": [
        {
            "name": "aggregation",
            "valueString": "count()"
        },
        {
            "name": "grouping",
            "valueString": "reverseResolve(Condition.subject).code.coding.where(subsumedBy(http://snomed.info/sct|73211009))"
        }
    ]
}

The problem is Pathling made one request per Condition to check if it belong to <<73211009 adn thats is unviable. How do you deal with this?

liquid36 avatar Jan 30 '25 17:01 liquid36

Hi @liquid36,

Thanks for trying it out!

Which terminology server are you using? It uses https://tx.ontoserver.csiro.au/fhir by default, are you using something different?

There is a configuration option that might help diagnose the problem: pathling.terminology.verboseLogging (https://pathling.csiro.au/docs/server/configuration#terminology-service). Some logging with this option turned on might be helpful.

We have tried many different strategies for making terminology requests, and we found the individual request model to actually work fastest. This is because we can effectively parallelize the requests, cache the results and only make unique requests that we have not made before. Pathling has a client-side cache to facilitate this, and most terminology servers will also have a server-side cache in addition to this.

We have demonstrated that this works effectively on large datasets with tens of thousands of unique SNOMED CT codings. Ontoserver is the terminology server that we prefer to use, and it can service a subsumes request in less than 5 ms.

johngrimes avatar Jan 31 '25 04:01 johngrimes

I'm using Snowstorm. But i just tried with the default ontoserver and it worked better.

How could i parallelize the requests? Deploying an Spark Cluster?

For the amount of data i mention before, i'm getting a response of 3/4 seconds with cache enabled, is it okey?

liquid36 avatar Jan 31 '25 14:01 liquid36

The level of parallelism is controlled by the number of worker threads that Spark uses. By default, it will use a number of threads equal to the number of CPUs on your machine. Yes, scaling out to a cluster will increase the parallelism further, but there will be some overhead as compared to scaling up to more CPUs within a single instance.

Without looking at the actual query that you are trying to run, I couldn't say whether that is a good response time. Is that with a warm cache? Note that Pathling will also respect the caching policies sent back by the terminology server, and tx.ontoserver.csiro.au is a shared public server.

johngrimes avatar Feb 04 '25 04:02 johngrimes

Hope you got everything working, please feel free to re-open the ticket if not.

johngrimes avatar Jun 04 '25 01:06 johngrimes