starcode
starcode copied to clipboard
Large Count File Dataset -- Memory Issues
I am attempting to process a relatively large count file. It is around 10GB and contains ~170 million unique sequences which are 60bp in length.
The computer I am using has 64GB of RAM and a Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz Processor. When I try to run starcode, the process is Killed without an error message being outputted. I am guessing that the program might exceed the memory allowance because when I subset my count file to contain 25million sequences, starcode is able to process it, but utilizes almost 100% of the computer's memory.
I am wondering if there is any solution to this, or if there is a RAM size that you all believe could process this dataset size?
Thank you!
Thanks for reporting the issue. Starcode should terminate gracefully when it runs out of memory, but maybe we made a mistake somewhere. Are you running Starcode in a Docker container or in a cluster? We sometimes observed strange behavior in these contexts. I would be interested in trying to run your sample on our machine to understand better which statement fails, if you are interested.
As for the required memory, it depends on the parameters you use for the run. Allowing more errors means keeping more branches of the prefix tree in memory, so setting distance to a small value is not only faster, it also requires less memory. Depending on your goals, you could try lowering distance and see whether that works for you.
Also, if the sequences consist of a constant region and a barcode, you could try to isolate the barcodes and cluster those only. You can sometimes observe spectacular improvement of performance by reducing the sequence size because it can significantly prune the prefix tree.
Let me know if any of this helps.
Hello,
Thank you for your reply. I am not running this in a Docker container or in a cluster, we are doing it all in wsl2.
I would be interested to see if running these samples on your machine results in a successful run.
Unfortunately, the sequences are already trimmed down as much as they can and just consist of the variable 60bp region. For my application, I also only need --distance of 1, so I don't think we can cut down any memory use with that parameter.
Let me know if there are any other suggestions!
Thank you very much, Andrew
On Fri, Jan 13, 2023 at 5:18 PM Guillaume Filion @.***> wrote:
Thanks for reporting the issue. Starcode should terminate gracefully when it runs out of memory, but maybe we forgot something. Are you running Starcode in a Docker container or in a cluster? We sometimes observed strange behavior in these contexts. I would be interested in trying to run your sample on our machine to understand better which statement fails, if you are interested.
As for the required memory, it depends on the parameters the you use for the run. Allowing more errors means keeping more branches of the prefix tree in memory, so setting tau to a small value is not only faster, it also requires less memory. Depending on your goals, you could try lowering tau and see whether that works for you.
Also, if the sequences consist of a constant region and a barcode, you could try to isolate the barcodes and cluster those only. You can sometimes observe spectacular improvement of performance by reducing the sequence size because it can significantly prune the prefix tree.
Let me know if any of this helps.
— Reply to this email directly, view it on GitHub https://github.com/gui11aume/starcode/issues/44#issuecomment-1382621569, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXNDVNQ3X3VVZ3X2RRX4TMTWSH5FLANCNFSM6AAAAAATZQZWIE . You are receiving this because you authored the thread.Message ID: @.***>
Thanks for clarifying. I'd be happy to look at the data on my machine. Can you contact me by email so that we can set up a way to transfer the data. My address is easy to find on the Internet (like here for instance).