QuickUMLS
QuickUMLS copied to clipboard
Installing takes too much memory
When running python -m quickumls.install
on an MRCONSO.RRF file with about 7M rows, the memory footprint continuously grows and some point the process is killed because of using too much memory. The two main culprits I could find are the processed https://github.com/Georgetown-IR-Lab/QuickUMLS/blob/c0b5db059fbef8d70681626a34456ab3d906e5e7/quickumls/install.py#L66 and simstring https://github.com/Georgetown-IR-Lab/QuickUMLS/blob/c0b5db059fbef8d70681626a34456ab3d906e5e7/quickumls/install.py#L113 sets.
I assume they are there to prevent duplicate entries in the SimString and CuiSemType DBs. When using the unqlite database, a check for duplicate entries is implemented on the insert call. So duplicate entries are a non issue. However, I am not sure if the same is true for the SimString database. Is it safe to add a duplicate terms/n-grams to the SimString database or will that break anything? This would then allow removing the memory overhead from the large sets for large UMLS subsets.
Hi! have you solve that problem?, I have the same :(
Hi! have you solve that problem?, I have the same :(
Sort of. At the cost of including some duplicates in the SimString database, I was able to reduce the RAM footprint by a significant amount. It now runs for the whole UMLS on my 16G RAM machine. Take a look at my fork of the repository for the fixes.
Hi Ferdinand,
Thank you so much for following up on this! Would you be willing to make a pull request for this? I would be happy to review it and merge it in the core package.
Best, Luca
On Feb 12, 2021, at 00:28, Ferdinand Schlatt [email protected] wrote:
Hi! have you solve that problem?, I have the same :(
Sort of. At the cost of including some duplicates in the SimString database, I was able to reduce the RAM footprint by a significant amount. It now runs for the whole UMLS on my 16G RAM machine. Take a look at my fork of the repository for the fixes.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Hey Luca,
Sure thing. I've also added that the preferred term is returned and applied black formatting to the repo, so there are a couple of additional changes. I'll create a pull request with my entire fork and we can discuss there, which parts are necessary and which are superfluous.
Best, Ferdinand
Great, I'll try to review over the weekend!
Best, Luca
On Fri, Feb 12, 2021 at 7:33 AM Ferdinand Schlatt [email protected] wrote:
Hey Luca,
Sure thing. I've also added that the preferred term is returned and applied black formatting to the repo, so there are a couple of additional changes. I'll create a pull request with my entire fork and we can discuss there, which parts are necessary and which are superfluous.
Best, Ferdinand
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Georgetown-IR-Lab/QuickUMLS/issues/64#issuecomment-778265311, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA53OIWV5KU4HDUSCCZHWHDS6VC6JANCNFSM4TBAQF5Q .
Seems like this from @fschlatt may be the fix https://github.com/Georgetown-IR-Lab/QuickUMLS/commit/76513933f5a311b2d2c4da06b16314f65c646e22. I had to drastically increase my RAM for the install as well.
I ran into this as well. I have 16 GB of memory. Is the recommended approach implementing the changes from the comment above?
I got it to work by being more selective about what I extracted from UMLS.