CITE-seq-Count
CITE-seq-Count copied to clipboard
Runtime >96h for 500M reads vs. ~2h for 150M read downsample
Currently trying to run a very large ADT library (~500M reads) for a CITE-Seq project.
First, I tried running on all 500M reads at once, with 8 cores. I got all the messages that would indicate things were going well, including a number of "Processed 1,000,000 reads in xx seconds" messages as well as 8 different messages like so:
Mapping done for process 11476. Processed 63,019,228 reads Mapping done for process 11477. Processed 63,019,228 reads Mapping done for process 11478. Processed 63,019,228 reads Mapping done for process 11479. Processed 63,019,228 reads Mapping done for process 11480. Processed 63,019,228 reads Mapping done for process 11481. Processed 63,019,228 reads Mapping done for process 11482. Processed 63,019,228 reads Mapping done for process 11483. Processed 63,019,229 reads
After that, though, there was no additional progress for an extremely long time (>96h). Eventually, the job failed due to going over the runtime I had specified to the cluster.
After this, I tried downsampling to still having a relatively large number of reads (150M). These jobs ran relatively quickly (~2h) and the output gave sensible results.
Do you have any suggestions for how to possibly get the full set of reads running in a reasonable runtime? The error messages suggest that the issue is occurring downstream of the step where you run each chunk of reads separately. Thus, it seems like just adding more cores will not help. I am assuming that it is an issue involved in checking for UMI uniqueness, which requires the full set of UMIs from all 500M reads. But if this is the case, it is strange that tripling the number of reads would more than triple the runtime.
I have the same problem. The first dataset finished succesfully in ~2 hours.
Processed 1,000,000 reads in 15.72 seconds. Total reads: 93,000,000 in child 42921
Processed 1,000,000 reads in 15.85 seconds. Total reads: 94,000,000 in child 42921
Processed 1,000,000 reads in 13.67 seconds. Total reads: 95,000,000 in child 42921
Processed 1,000,000 reads in 18.88 seconds. Total reads: 96,000,000 in child 42921
Processed 1,000,000 reads in 14.08 seconds. Total reads: 97,000,000 in child 42921
Processed 1,000,000 reads in 17.5 seconds. Total reads: 98,000,000 in child 42921
Processed 1,000,000 reads in 13.27 seconds. Total reads: 99,000,000 in child 42921
Processed 1,000,000 reads in 13.84 seconds. Total reads: 100,000,000 in child 42921
Processed 1,000,000 reads in 18.59 seconds. Total reads: 101,000,000 in child 42921
Processed 1,000,000 reads in 15.98 seconds. Total reads: 102,000,000 in child 42921
Processed 1,000,000 reads in 13.32 seconds. Total reads: 103,000,000 in child 42921
Mapping done for process 42921. Processed 103,021,844 reads
Mapping done
Correcting cell barcodes
Looking for a whitelist
Collapsing cell barcodes
Correcting umis
The second dataset, which is slightly larger, hanged after mapping:
Processed 1,000,000 reads in 12.52 seconds. Total reads: 142,000,000 in child 253987
Processed 1,000,000 reads in 12.58 seconds. Total reads: 143,000,000 in child 253987
Processed 1,000,000 reads in 12.24 seconds. Total reads: 144,000,000 in child 253987
Processed 1,000,000 reads in 12.22 seconds. Total reads: 145,000,000 in child 253987
Processed 1,000,000 reads in 12.29 seconds. Total reads: 146,000,000 in child 253987
Processed 1,000,000 reads in 12.58 seconds. Total reads: 147,000,000 in child 253987
Processed 1,000,000 reads in 12.66 seconds. Total reads: 148,000,000 in child 253987
Mapping done for process 253987. Processed 148,408,557 reads
I'm currently working on a new branch that allows deciding on chunk size for the mapping stage and implements parallelization for UMI correction.
@heathergeiger I actually don't need all the UMIs for the correction as I do a per barcode/TAG correction.
This is highly unstable right now, so expect errors and bugs, but maybe we can do something with this one.
The chunk size choice might also fix the other issue with too many reads per core.
Looking at your log files, this looks more like a problem when the child processes are sending the results back to the main process but never arrive or take too long. I would expect more cores to actually help there.
The branch I mentioned before is also changing the way reads are read from disk.
First, it writes to disk uncompressed files and creates 1 process for each file to read in and then send back the results as usual. The chunk size also means that there is a limit of reads per file which means that it will send new chunks until all the data has been processed even if there is not enough cores to deal with it based on the chunk size.
One of the big bottlenecks is the UMI correction and this is now parallelized which should accelerate the second part by a lot.
Hello @heathergeiger the latest develop branch should offer a big improvement in performance. Can you please try it out?
@Hoohm asking one of the sys admins to help install now!
Update: Installed, submitted with 8 cores. Should I also be specifying a chunk size, or will it do that automatically?
Deleted previous comment about new version not working by the way. I had an error in my path so was still running with the old version. Retrying with the actual new version now.
OK actual new version has still been running for over 24 hours now, on a 371M read sample. This is more than would be expected proportionally based on the runtime for the 150M read downsample. Any ideas?
Just checking in on this again. Like I said, new version doesn't seem to have helped.
Yeah sorry, difficult times right now.
I'm going to start testing on large datasets as well.
I really want this to be possible because I want to allow users to keep going with the new sequencing platforms offering high depth.
Still on it, but as usual, slow :(
@heathergeiger yes, the chunking is automatic
Just a quick check, how much of this big dataset is unmapped?
Ok, new stuff coming in.
Now I don't do any corrections on the unmapped and a whole bunch of other speedups. Could you try the branch feature/barcode_translation
?
I had a similar issue with 150M reads. It seems that the bottleneck was the amount of the available memory in my case. The program hangs with 32GB, but works fine with 64GB (took 4-6 hours). Adding more CPUs while fixing the memory amount at 32GB didn't help (still hangs). I'm still using the 1.4.2-develop
branch (a snapshot taken around Sep 2019).
I also tried the feature/barcode_translation
branch (a snapshot taken on Oct 21, 2020). It runs really fast (approx. 1 hour for the same dataset), but the output is very weird/unusable.
- Percentage mapped: 100
- UMIs corrected: 0
- count matrix is an empty compressed file:
-rw-r--r-- 1 chunj 33 Oct 25 14:10 barcodes.tsv.gz
-rw-r--r-- 1 chunj 114 Oct 25 14:10 features.tsv.gz
-rw-r--r-- 1 chunj 84 Oct 25 14:10 matrix.mtx.gz
I would love to try out the improved versions, but it would be really really helpful if you can create a tag (e.g. 1.5.0-beta.3
) so that people like me can track experiments properly and have the results reproducible).
That's a good idea! Thanks for testing. Happy to see it goes faster, sad that it doesn't output anything :(
I got into a tangled mess of multiple different changes that I have to push through. I'll tag the next beta that I hope works properly. Thanks for the advice
I'm having the same issue as you. The read count is full though.
Working on this bug atm.
ok! I think it's fixed. Can you try out 1.5.0-alpha tag?
https://github.com/Hoohm/CITE-seq-Count/tags
Sorry, I was on leave of absence for a while. I'm back and I'm trying to try out 1.5.0-alpha
, but encountered two issues:
Looks like the following two libraries are not being installed. I manually installed via pip, but I guess they should be part of setup.py
?
- requests
- pooch
After taking care of the dependencies, I was able to run CSC, but it exited out immediately with this error.
Loading whitelist
The header is missing feature_name,sequence. Exiting
The same command line arguments (thus, same dataset, same tag, ...) worked fine with 1.4.2-develop
(which I've been using), but it failed with 1.5.0-alpha
. Any idea?
Thanks, I forgot to add the packages.
Part of the new big release is going to be more structure being enforced on user inputs.
In this case, CSV files now require headers so that CSC can check if anything is missing or wrongly defined.
In your specific case, feature_name and sequence are missing headers.
On Mon, 14 Jun 2021, 22:20 Jaeyoung Chun, @.***> wrote:
Sorry, I was on leave of absence for a while. I'm back and I'm trying to try out 1.5.0-alpha, but encountered two issues:
Looks like the following two libraries are not being installed. I manually installed via pip, but I guess they should be part of setup.py?
- requests
- pooch
After taking care of the dependencies, I was able to run CSC, but it exited out immediately with this error.
Loading whitelist The header is missing feature_name,sequence. Exiting
The same command line arguments (thus, same dataset, same tag, ...) worked fine with 1.4.2-develop (which I've been using), but it failed with 1.5.0-alpha. Any idea?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hoohm/CITE-seq-Count/issues/120#issuecomment-860965898, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2HETOCRCTYC6YEWRWDTSZQAFANCNFSM4MHARP2A .
Never mind. Unlike the previous version, it looks like this version requires the tag file to have the header feature_name,sequence
.
Anyway, the results were the same as before. It took one hour to finish, but the matrix contains nothing:
cat run_report.yaml
Date: 2021-06-15
Running time: 1.0 hour, 27.68 seconds
CITE-seq-Count Version: 1.5.0
Reads processed: 444151085
Percentage mapped: 95
Percentage unmapped: 5
The outputs are less than 100 bytes:
-rw-r--r-- 1 root root 33 Jun 15 01:32 barcodes.tsv.gz
-rw-r--r-- 1 root root 91 Jun 15 01:32 features.tsv.gz
-rw-r--r-- 1 root root 84 Jun 15 01:32 matrix.mtx.gz
Literally, it contains nothing:
$ gunzip -c matrix.mtx.gz
%%MatrixMarket matrix coordinate integer general
%
4 0 0
The last changes I made broke everything again because I'm redesigning the way I deal with translation references and a filtered subset of cells mainly to deal with the confusion around the 10xV3 chemistry.
What would you want to test out atm? I can focus on a smaller fix if you want something specific.
As the title of this issue says, I've been having issues with large FASTQ files (400M+ reads). CSC just got stuck at some point and never terminates (2 days, 3 days, ...)
I'm still using the 1.4.2-develop
branch (a snapshot taken around Sep 2019). My current solution is just adding a huge amount of memory (e.g. 192 GB) just to get it working...
The recent versions (including 1.5.0-alpha
) all gave me the empty count matrix, so unusable at the moment. I guess I will just stick with the snapshot version I have.
Let me know when you have a fix for this issue. I will definitely try. Thanks!
Hi all, thank you very much for the tool and the support!
I have the same problem as described in the issue, one sample with 330 Mio reads, another with 430 Mio.
@Hoohm , is there a version of the development branch I can try?
@hisplan , when you run your large data set with the 1.4.2-develop
branch and a huge amount of data, does the analysis finish successfully?
Thanks!
@aricht Yes, I'm still using 1.4.2-develop
(a snapshot taken around Sep 2019). And CSC does finish successfully. My solution was adding more memory and be patient :-)
Thank you very much for the info! I will keep on testing, then, and try to be really patient :)