cumulus-etl icon indicating copy to clipboard operation
cumulus-etl copied to clipboard

Investigate cumulus-etl performance

Open mikix opened this issue 2 years ago • 5 comments

Performance has not been a focus so far, but we should investigate for easy wins and to scope out larger tasks that would help.

From observing it run, it's mostly CPU-bound right now. The investigator should probably do some profiling to see where we are spending that time.

Some thoughts below.

Parallelizing

It can probably be a lot more parallelized, to take advantage of the many cores frequently available in cloud computing. Right now a task on a full set of data is quite slow.

Thoughts:

  • ~~tasks could be run simultaneously~~ (this didn't seem to speed us up as much as hoped, especially if we tried to keep to a constant memory usage)
  • converting i2b2 to FHIR doesn't have inter-dependencies between rows

I2b2 Conversion

~~Hopefully this isn't a huge factor going forward, but it's possible that creating all the FHIR objects and validating them is slowing us down~~.

This was true! We took out the FHIR object creation and got ~50% faster.

cTAKES

Look into improvements that Andy has for a rewritten cTAKES engine.

mikix avatar Dec 22 '22 15:12 mikix

I started looking at where we spend time in the code, but realized that even if I optimize it, it's pointless if we're still just using one core. So that's an obvious first step: using multiple cores for tasks. Breakout issue: https://github.com/smart-on-fhir/cumulus-etl/issues/154

mikix avatar Jan 20 '23 14:01 mikix

While investigating, I discovered that ever since we started using the MS de-id tool, we don't really need the internal validation provided by fhirclient. By skipping that de-serialization and re-serialization, we take 30% of the time as we used to (for the core CPU-bound tables).

PR here: #157

This is such a change in our performance profile, I'm going to do more timing testing. But it no longer seems as clear that using a single-core is our biggest issue. The biggest consumer of wall clock time seems to now be Delta Lake, which does use multiple cores.

So ETL is reading data as fast as it can, and shipping that to Delta Lake, which uses multiple cores. So we are kind of multi-processing already. But still worth investigating if concurrency can improve us further.

mikix avatar Jan 26 '23 14:01 mikix

Just landed #158, which does the same fhirclient purge, but for i2b2.

mikix avatar Jan 26 '23 18:01 mikix

I finished up some investigation into multi-threading in #154 and came to the conclusion that it's not worth it right now (See https://github.com/smart-on-fhir/cumulus-etl/issues/154#issuecomment-1413892945).

The next win that I think is most likely is sending multiple requests to cTAKES at once. My current thinking is that we could change ctakesclient to use asyncio and then leverage that in cumulus to send multiple requests and wait on them. But I have not done any testing there.

mikix avatar Feb 02 '23 15:02 mikix

Another idea: breakout ticket #164 to do bulk download requests in parallel.

mikix avatar Feb 02 '23 16:02 mikix