CITE-seq-Count Runtime >96h for 500M reads vs. ~2h for 150M read downsample

Currently trying to run a very large ADT library (~500M reads) for a CITE-Seq project.

First, I tried running on all 500M reads at once, with 8 cores. I got all the messages that would indicate things were going well, including a number of "Processed 1,000,000 reads in xx seconds" messages as well as 8 different messages like so:

Mapping done for process 11476. Processed 63,019,228 reads Mapping done for process 11477. Processed 63,019,228 reads Mapping done for process 11478. Processed 63,019,228 reads Mapping done for process 11479. Processed 63,019,228 reads Mapping done for process 11480. Processed 63,019,228 reads Mapping done for process 11481. Processed 63,019,228 reads Mapping done for process 11482. Processed 63,019,228 reads Mapping done for process 11483. Processed 63,019,229 reads

After that, though, there was no additional progress for an extremely long time (>96h). Eventually, the job failed due to going over the runtime I had specified to the cluster.

After this, I tried downsampling to still having a relatively large number of reads (150M). These jobs ran relatively quickly (~2h) and the output gave sensible results.

Do you have any suggestions for how to possibly get the full set of reads running in a reasonable runtime? The error messages suggest that the issue is occurring downstream of the step where you run each chunk of reads separately. Thus, it seems like just adding more cores will not help. I am assuming that it is an issue involved in checking for UMI uniqueness, which requires the full set of UMIs from all 500M reads. But if this is the case, it is strange that tripling the number of reads would more than triple the runtime.

Apr 13 '20 15:04 heathergeiger

I have the same problem. The first dataset finished succesfully in ~2 hours.

Processed 1,000,000 reads in 15.72 seconds. Total reads: 93,000,000 in child 42921
Processed 1,000,000 reads in 15.85 seconds. Total reads: 94,000,000 in child 42921
Processed 1,000,000 reads in 13.67 seconds. Total reads: 95,000,000 in child 42921
Processed 1,000,000 reads in 18.88 seconds. Total reads: 96,000,000 in child 42921
Processed 1,000,000 reads in 14.08 seconds. Total reads: 97,000,000 in child 42921
Processed 1,000,000 reads in 17.5 seconds. Total reads: 98,000,000 in child 42921
Processed 1,000,000 reads in 13.27 seconds. Total reads: 99,000,000 in child 42921
Processed 1,000,000 reads in 13.84 seconds. Total reads: 100,000,000 in child 42921
Processed 1,000,000 reads in 18.59 seconds. Total reads: 101,000,000 in child 42921
Processed 1,000,000 reads in 15.98 seconds. Total reads: 102,000,000 in child 42921
Processed 1,000,000 reads in 13.32 seconds. Total reads: 103,000,000 in child 42921
Mapping done for process 42921. Processed 103,021,844 reads
Mapping done
Correcting cell barcodes
Looking for a whitelist
Collapsing cell barcodes
Correcting umis

The second dataset, which is slightly larger, hanged after mapping:

Processed 1,000,000 reads in 12.52 seconds. Total reads: 142,000,000 in child 253987
Processed 1,000,000 reads in 12.58 seconds. Total reads: 143,000,000 in child 253987
Processed 1,000,000 reads in 12.24 seconds. Total reads: 144,000,000 in child 253987
Processed 1,000,000 reads in 12.22 seconds. Total reads: 145,000,000 in child 253987
Processed 1,000,000 reads in 12.29 seconds. Total reads: 146,000,000 in child 253987
Processed 1,000,000 reads in 12.58 seconds. Total reads: 147,000,000 in child 253987
Processed 1,000,000 reads in 12.66 seconds. Total reads: 148,000,000 in child 253987
Mapping done for process 253987. Processed 148,408,557 reads

Apr 15 '20 13:04 SirKuikka

I'm currently working on a new branch that allows deciding on chunk size for the mapping stage and implements parallelization for UMI correction.

@heathergeiger I actually don't need all the UMIs for the correction as I do a per barcode/TAG correction.

This is highly unstable right now, so expect errors and bugs, but maybe we can do something with this one.

The chunk size choice might also fix the other issue with too many reads per core.

Apr 15 '20 13:04 Hoohm

Looking at your log files, this looks more like a problem when the child processes are sending the results back to the main process but never arrive or take too long. I would expect more cores to actually help there.

The branch I mentioned before is also changing the way reads are read from disk.

First, it writes to disk uncompressed files and creates 1 process for each file to read in and then send back the results as usual. The chunk size also means that there is a limit of reads per file which means that it will send new chunks until all the data has been processed even if there is not enough cores to deal with it based on the chunk size.

One of the big bottlenecks is the UMI correction and this is now parallelized which should accelerate the second part by a lot.

Apr 18 '20 09:04 Hoohm

Hello @heathergeiger the latest develop branch should offer a big improvement in performance. Can you please try it out?

May 03 '20 17:05 Hoohm

@Hoohm asking one of the sys admins to help install now!

Update: Installed, submitted with 8 cores. Should I also be specifying a chunk size, or will it do that automatically?

May 05 '20 21:05 heathergeiger

Deleted previous comment about new version not working by the way. I had an error in my path so was still running with the old version. Retrying with the actual new version now.

May 11 '20 13:05 heathergeiger

OK actual new version has still been running for over 24 hours now, on a 371M read sample. This is more than would be expected proportionally based on the runtime for the 150M read downsample. Any ideas?

May 13 '20 14:05 heathergeiger

Just checking in on this again. Like I said, new version doesn't seem to have helped.

May 22 '20 16:05 heathergeiger

Yeah sorry, difficult times right now.

I'm going to start testing on large datasets as well.

I really want this to be possible because I want to allow users to keep going with the new sequencing platforms offering high depth.

Still on it, but as usual, slow :(

Jun 28 '20 15:06 Hoohm

@heathergeiger yes, the chunking is automatic

Jun 28 '20 15:06 Hoohm

Just a quick check, how much of this big dataset is unmapped?

Aug 01 '20 11:08 Hoohm

Ok, new stuff coming in. Now I don't do any corrections on the unmapped and a whole bunch of other speedups. Could you try the branch feature/barcode_translation?

Sep 20 '20 15:09 Hoohm

I had a similar issue with 150M reads. It seems that the bottleneck was the amount of the available memory in my case. The program hangs with 32GB, but works fine with 64GB (took 4-6 hours). Adding more CPUs while fixing the memory amount at 32GB didn't help (still hangs). I'm still using the 1.4.2-develop branch (a snapshot taken around Sep 2019).

I also tried the feature/barcode_translation branch (a snapshot taken on Oct 21, 2020). It runs really fast (approx. 1 hour for the same dataset), but the output is very weird/unusable.

Percentage mapped: 100
UMIs corrected: 0
count matrix is an empty compressed file:

-rw-r--r-- 1 chunj  33 Oct 25 14:10 barcodes.tsv.gz
-rw-r--r-- 1 chunj 114 Oct 25 14:10 features.tsv.gz
-rw-r--r-- 1 chunj  84 Oct 25 14:10 matrix.mtx.gz

I would love to try out the improved versions, but it would be really really helpful if you can create a tag (e.g. 1.5.0-beta.3) so that people like me can track experiments properly and have the results reproducible).

Oct 26 '20 13:10 hisplan

That's a good idea! Thanks for testing. Happy to see it goes faster, sad that it doesn't output anything :(

I got into a tangled mess of multiple different changes that I have to push through. I'll tag the next beta that I hope works properly. Thanks for the advice

Nov 08 '20 18:11 Hoohm

I'm having the same issue as you. The read count is full though.

Working on this bug atm.

Dec 24 '20 21:12 Hoohm

ok! I think it's fixed. Can you try out 1.5.0-alpha tag?

https://github.com/Hoohm/CITE-seq-Count/tags

Dec 25 '20 10:12 Hoohm

Sorry, I was on leave of absence for a while. I'm back and I'm trying to try out 1.5.0-alpha, but encountered two issues:

Looks like the following two libraries are not being installed. I manually installed via pip, but I guess they should be part of setup.py?

requests
pooch

After taking care of the dependencies, I was able to run CSC, but it exited out immediately with this error.

Loading whitelist
The header is missing feature_name,sequence. Exiting

The same command line arguments (thus, same dataset, same tag, ...) worked fine with 1.4.2-develop (which I've been using), but it failed with 1.5.0-alpha. Any idea?

Jun 14 '21 20:06 hisplan

Thanks, I forgot to add the packages.

Part of the new big release is going to be more structure being enforced on user inputs.

In this case, CSV files now require headers so that CSC can check if anything is missing or wrongly defined.

In your specific case, feature_name and sequence are missing headers.

On Mon, 14 Jun 2021, 22:20 Jaeyoung Chun, @.***> wrote:

Sorry, I was on leave of absence for a while. I'm back and I'm trying to try out 1.5.0-alpha, but encountered two issues:

Looks like the following two libraries are not being installed. I manually installed via pip, but I guess they should be part of setup.py?

requests

pooch

After taking care of the dependencies, I was able to run CSC, but it exited out immediately with this error.

Loading whitelist The header is missing feature_name,sequence. Exiting

The same command line arguments (thus, same dataset, same tag, ...) worked fine with 1.4.2-develop (which I've been using), but it failed with 1.5.0-alpha. Any idea?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hoohm/CITE-seq-Count/issues/120#issuecomment-860965898, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJVO2HETOCRCTYC6YEWRWDTSZQAFANCNFSM4MHARP2A .

Jun 15 '21 06:06 Hoohm

Never mind. Unlike the previous version, it looks like this version requires the tag file to have the header feature_name,sequence.

Anyway, the results were the same as before. It took one hour to finish, but the matrix contains nothing:

cat run_report.yaml
Date: 2021-06-15
Running time: 1.0 hour, 27.68 seconds
CITE-seq-Count Version: 1.5.0
Reads processed: 444151085
Percentage mapped: 95
Percentage unmapped: 5

The outputs are less than 100 bytes:

-rw-r--r-- 1 root root 33 Jun 15 01:32 barcodes.tsv.gz
-rw-r--r-- 1 root root 91 Jun 15 01:32 features.tsv.gz
-rw-r--r-- 1 root root 84 Jun 15 01:32 matrix.mtx.gz

Literally, it contains nothing:

$ gunzip -c matrix.mtx.gz

%%MatrixMarket matrix coordinate integer general
%
4 0 0

Jun 15 '21 11:06 hisplan

The last changes I made broke everything again because I'm redesigning the way I deal with translation references and a filtered subset of cells mainly to deal with the confusion around the 10xV3 chemistry.

What would you want to test out atm? I can focus on a smaller fix if you want something specific.

Jun 21 '21 07:06 Hoohm

As the title of this issue says, I've been having issues with large FASTQ files (400M+ reads). CSC just got stuck at some point and never terminates (2 days, 3 days, ...)

I'm still using the 1.4.2-develop branch (a snapshot taken around Sep 2019). My current solution is just adding a huge amount of memory (e.g. 192 GB) just to get it working...

The recent versions (including 1.5.0-alpha) all gave me the empty count matrix, so unusable at the moment. I guess I will just stick with the snapshot version I have.

Let me know when you have a fix for this issue. I will definitely try. Thanks!

Jun 21 '21 12:06 hisplan

Hi all, thank you very much for the tool and the support!

I have the same problem as described in the issue, one sample with 330 Mio reads, another with 430 Mio.

@Hoohm , is there a version of the development branch I can try?

@hisplan , when you run your large data set with the 1.4.2-develop branch and a huge amount of data, does the analysis finish successfully?

Thanks!

Oct 11 '21 11:10 aricht

@aricht Yes, I'm still using 1.4.2-develop (a snapshot taken around Sep 2019). And CSC does finish successfully. My solution was adding more memory and be patient :-)

Oct 11 '21 14:10 hisplan

Thank you very much for the info! I will keep on testing, then, and try to be really patient :)

Oct 11 '21 14:10 aricht

CITE-seq-Count CITE-seq-Count copied to clipboard

Runtime >96h for 500M reads vs. ~2h for 150M read downsample

CITE-seq-Count
CITE-seq-Count copied to clipboard