Open chromatin predict
Thanks for this great tool. I am trying to use open chromatin predict on our data. I am running on a GPU using the candle version.
I first tried running it on the bam (40G) directly, but it got stuck and didn't write anything - I waited more than 48 hours. I also tried specifying just a chromosome, same issue; and even a small bed, same issue. I then downsampled to a smaller bam (2.7G) and was able to run on a chromosome-by-chromosome basis, though some chromosomes also got stuck. This does not seem to depend on the size of the chromosome.
I am trying to see what reads cause it to hang, but it is not proceeding in sequential order through the bam, at least in terms of what is written out to the bedGraph.
We are also interested in unmapped reads, but the tool finished and returned no reads for this even though there are 6mA calls there - is there a flag I need to send in to be sure it includes unmapped reads?
Hello @nchernia,
Sorry that the command isn't working.
Regarding the program getting stuck. Could you run it with --log <log-filename> and attach it here (or send it via email)? Maybe from that I can tell where it's getting stuck. Does the program consume RAM or GPU resources?
I have one other example internally sounds like it causes a similar problem. Do you think it's possible to subset the BAM to a single small-ish region that reproduces the problem? If so, could you send it to me? Feel free to email me: art.rand [at] nanoporetech.com and we can arrange how to transfer the BAM.
Regarding the unmapped reads, the open chromatin algorithm requires that the reads are mapped to the reference to determine if that region is accessible to the MTase or not. I'm not sure how you would use unmapped reads in this case. Maybe if you elaborate on what you're trying to learn from these reads I can help you more.
Thanks.
Thanks for your response!
I'm now trying on just chr1 with a file subsetted using samtools (it's 3.2GB). It also appears stuck:
using device Cuda(CudaDevice { device: CudaDevice(DeviceId(1)), index: 0 }) loaded model config { "num_features": 12, "num_classes": 2, "hidden_size": 256, "chunk_size": 100, "modified_bases": { "A": [ "a" ] } } loading weights from "/home/neva/dist_modkit_v0.5.0_5120ef7/models/[email protected]/model.mpk" collecting regions of 25675bp (100 bp chunks), super batches of 100 (2567500bp). Stepping 25 bp at a time. 0 records written
The log is also not being written to anymore. Usually when it works, I see records being written. Attached is the log file.
Regarding unmapped, we are interested in tandem repeats, which are often different between individuals and do not map well to the reference.
Hello @nchernia
Would you be willing to share the BAM that causes the problem with me? I've been testing with 30-40x coverage on the whole genome and can't reproduce this problem. You can email me at art.rand[at]nanoporetech.com and I can set up a way to share if you don't have one already.
@nchernia
I may have found where the problem is. Could you try adding the following options to your command:
--super-batch-size 10 --batch-size 64
If that works, you can skip --batch-size 64 and try as well. I should have a fix/warning soon.
Thank you - I ran it with these parameters and it seemed to work. I will try without the super-batch-size and report back. For this run, there's an unusual message at the end.
collecting regions of 1675bp (100 bp chunks), super batches of 10 (16750bp). Stepping 25 bp at a time. 9349538 records written > model received receiving on an empty and disconnected channel write handle got receiving on an empty and disconnected channel
Using just super-batch worked as well. The results are slightly different; there are 8 more lines written in the version with --batch-size 64, and the log file is much bigger (95M vs 6.2M). I got the same message at the end:
> model received receiving on an empty and disconnected channel
> write handle got receiving on an empty and disconnected channel
Original bedGraph head:
chr1 3400 3425 0.6411874
chr1 3425 3450 0.363223
chr1 3450 3475 0.26683313
chr1 3475 3500 0.22840261
chr1 3500 3525 0.095497906
chr1 3525 3550 0.26477274
chr1 3550 3575 0.49079007
chr1 3575 3600 0.708526
chr1 3600 3625 0.92767143
Version without batch-size head:
chr1 3400 3425 0.63424516
chr1 3425 3450 0.37120336
chr1 3450 3475 0.27269077
chr1 3475 3500 0.2238231
chr1 3500 3525 0.0817063
chr1 3525 3550 0.23102966
chr1 3550 3575 0.45643026
chr1 3575 3600 0.68326914
chr1 3600 3625 0.913419
Hello @nchernia,
Ok good, glad it seems to have worked.
there are 8 more lines written in the version with --batch-size 64
Could you expand on this? Are they at the end, beginning, or interspersed?
Could you give me an idea of what all the extra log lines are?
Hi,
They are interspersed; attached are the first 100 lines of running diff on the first 3 fields (diffs.txt) I'm also attaching the two log files, with the bigger one cut to the first 100K lines
Hi @ArtRand
This is happening again on a different file, even with the --super-batch-size 10 --batch-size 64 flags. I will email you to see about sending the file for testing.
Thanks Neva
Sorry, one update. I tried to do it on a chr-by-chr basis to isolate the error and that didn't work, but then I eventually tried only the flag --super-batch-size 10 (without batch-size) and it seems to be working.