QuickUMLS Installing takes too much memory

When running python -m quickumls.install on an MRCONSO.RRF file with about 7M rows, the memory footprint continuously grows and some point the process is killed because of using too much memory. The two main culprits I could find are the processed https://github.com/Georgetown-IR-Lab/QuickUMLS/blob/c0b5db059fbef8d70681626a34456ab3d906e5e7/quickumls/install.py#L66 and simstring https://github.com/Georgetown-IR-Lab/QuickUMLS/blob/c0b5db059fbef8d70681626a34456ab3d906e5e7/quickumls/install.py#L113 sets.

I assume they are there to prevent duplicate entries in the SimString and CuiSemType DBs. When using the unqlite database, a check for duplicate entries is implemented on the insert call. So duplicate entries are a non issue. However, I am not sure if the same is true for the SimString database. Is it safe to add a duplicate terms/n-grams to the SimString database or will that break anything? This would then allow removing the memory overhead from the large sets for large UMLS subsets.

Oct 27 '20 15:10 fschlatt

Hi! have you solve that problem?, I have the same :(

Jan 26 '21 16:01 CatalinaZ16

Hi! have you solve that problem?, I have the same :(

Sort of. At the cost of including some duplicates in the SimString database, I was able to reduce the RAM footprint by a significant amount. It now runs for the whole UMLS on my 16G RAM machine. Take a look at my fork of the repository for the fixes.

Feb 12 '21 08:02 fschlatt

Hi Ferdinand,

Thank you so much for following up on this! Would you be willing to make a pull request for this? I would be happy to review it and merge it in the core package.

Best, Luca

On Feb 12, 2021, at 00:28, Ferdinand Schlatt [email protected] wrote:

Hi! have you solve that problem?, I have the same :(

Sort of. At the cost of including some duplicates in the SimString database, I was able to reduce the RAM footprint by a significant amount. It now runs for the whole UMLS on my 16G RAM machine. Take a look at my fork of the repository for the fixes.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Feb 12 '21 15:02 soldni

Hey Luca,

Sure thing. I've also added that the preferred term is returned and applied black formatting to the repo, so there are a couple of additional changes. I'll create a pull request with my entire fork and we can discuss there, which parts are necessary and which are superfluous.

Best, Ferdinand

Feb 12 '21 15:02 fschlatt

Great, I'll try to review over the weekend!

Best, Luca

On Fri, Feb 12, 2021 at 7:33 AM Ferdinand Schlatt [email protected] wrote:

Hey Luca,

Sure thing. I've also added that the preferred term is returned and applied black formatting to the repo, so there are a couple of additional changes. I'll create a pull request with my entire fork and we can discuss there, which parts are necessary and which are superfluous.

Best, Ferdinand

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Georgetown-IR-Lab/QuickUMLS/issues/64#issuecomment-778265311, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA53OIWV5KU4HDUSCCZHWHDS6VC6JANCNFSM4TBAQF5Q .

Feb 13 '21 00:02 soldni

Seems like this from @fschlatt may be the fix https://github.com/Georgetown-IR-Lab/QuickUMLS/commit/76513933f5a311b2d2c4da06b16314f65c646e22. I had to drastically increase my RAM for the install as well.

Feb 22 '21 11:02 jimhavrilla

I ran into this as well. I have 16 GB of memory. Is the recommended approach implementing the changes from the comment above?

Aug 15 '21 22:08 jmugan

I got it to work by being more selective about what I extracted from UMLS.

Aug 16 '21 03:08 jmugan