DCC icon indicating copy to clipboard operation
DCC copied to clipboard

Question about circFilter.py

Open JunmingH opened this issue 5 years ago • 14 comments

Hi,

I am trying to run DCC but always stuck at circFilter.py The error message is below. Traceback (most recent call last): File "/circu_RNA/DCC-0.4.8/DCC/main.py", line 842, in main() File "/circu_RNA/DCC-0.4.8/DCC/main.py", line 375, in main filt.filter_nonrep(rep_file, indx0, count0) File "/circu_RNA/DCC-0.4.8/DCC/circFilter.py", line 110, in filter_nonrep nonrep = np.column_stack((indx0, count0)) File "/share/pkg.7/python2/2.7.16/install/lib/python2.7/site-packages/numpy/lib/shape_base.py", line 640, in column_stack return _nx.concatenate(arrays, 1) MemoryError

Could you please give me some ideas about this?

BTW is that possible it spends two days on combining those results together? Since each time it takes a very long time to combine, but still, have a memory error. I only use 4 threads and request 252 GB.

Best

JunmingH avatar Mar 16 '20 13:03 JunmingH

Dear @JunmingH,

would you have the complete command line that triggered this error?

Regarding the time constraints - two days seems pretty long. But the time required depends on data and command line used.

Cheers, Tobias

tjakobi avatar Mar 19 '20 22:03 tjakobi

Hi Tobias,

Attached is the script. I ran it with around 300 samples. But there is no chance to run it completly.

/DCC-0.4.7/DCC/main.py @DCC_InputFiles/samplesheet_BM10 -T 2 -D -N -R /ref/GRCh38_Repeats_simpleRepeats_RepeatMasker.gtf -an /ref/gencode.v26.primary_assembly.annotation.gtf -F -M -Nr 1 1 -fg -G -A /ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa -O /MSBB_rep/BM10 -B @DCC_InputFiles/bam_files_BM10 -t MSBB_rep/tmp_dir/BM10

JunmingH avatar Mar 19 '20 22:03 JunmingH

Hi @JunmingH,

I don't think I ever ran DCC on such a large number of sample - therefore the merging might indeed take a lot of time.

However, you might tune some of the parameters to reduce the time:

  • -T does not apply to the merging step, that part is single threaded, but you might want to increase the number of threads to speed up the rest of the computation

  • -Nr 1 1 is a very low threshold that will yield a very high number of false positives, i.e. its enough if a circle is only seen once with one BSJ read in one sample. I would recommend something like --Nr 2,5 or even -Nr 5,10 - meaning 2 reads in at least 5 samples or 5 reads in at least 10. THis also should speed up the merging.

  • The memory error is probably related to the extremely huge spare matrix that contains mostly zeros holding circRNA counts

Let me know if the new parameters can address your issue.

Cheers, Tobias

tjakobi avatar Mar 20 '20 09:03 tjakobi

Hi Tobias,

One of my scripts is successful. It used 400 Gb memory for filter function. But it still stuck at a place.

I am not sure which part it is:

The error message is : "[E::faidx_adjust_position] The sequence "chr1" not found [E::faidx_adjust_position] The sequence "chr1" not found [E::faidx_adjust_position] The sequence "chr1" not found [E::faidx_adjust_position] The sequence "chr1" not found [E::faidx_adjust_position] The sequence "chr1" not found ........." Have you ever see this before?

JunmingH avatar Mar 20 '20 14:03 JunmingH

BTW very appreciate for your help!

JunmingH avatar Mar 20 '20 14:03 JunmingH

Dear @JunmingH,

You may want to create a new index for your genome fasta file.

Try samtools faidx genome.fasta in the folder of the genome file. Maybe the index is corrupted or incomplete.

Cheers, Tobias

tjakobi avatar Mar 21 '20 10:03 tjakobi

Do you mean -A /ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa this file?

JunmingH avatar Mar 21 '20 11:03 JunmingH

Yes, exactly.

tjakobi avatar Mar 21 '20 18:03 tjakobi

Hi Tobias,

Is this warning normal?, I got this when doing linear counting

2020-03-24 09:52:35,105 WARNING: circRNA start position ('chr18', '45196357') does not have mapped read counts, treated as 0

JunmingH avatar Mar 24 '20 14:03 JunmingH

Hi, @tjakobi , I'm stuck with the same part. My error information is as follows:

Filtering by read counts
Count CircSkip junctions
Traceback (most recent call last):
  File "/usr/bin/DCC", line 11, in <module>
    load_entry_point('DCC==0.4.7', 'console_scripts', 'DCC')()
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 472, in main
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 669, in findCircSkipJunction
  File "build/bdist.linux-x86_64/egg/DCC/Circ_nonCirc_Exon_Match.py", line 281, in findcircAdjacent
  File "build/bdist.linux-x86_64/egg/DCC/Circ_nonCirc_Exon_Match.py", line 224, in getAdjacent
ValueError: invalid literal for int() with base 10: '1_3182'
started circRNA detection from file 

I have no idea that whether if my gtf annotation, which was downloaded from NCBI, is not right, my gtf:

##gff-version 3
##source-version rtracklayer 1.46.0
##date 2020-04-13
NC_003977.2     RefSeq  region  1       3182    .       +       .       ID=NC_003977.2:1_3182;Dbxref=taxon:10407;Is_circular=true;gbkey=Src;genome=genomic;mol_type=genomic DNA;strain=ayw;
NC_003977.2     RefSeq  gene    1376    1840    .       +       .       ID=gene-HBVgp3;Dbxref=GeneID:944566;gbkey=Gene;Name=X;gene=X;gene_biotype=protein_coding;locus_tag=HBVgp3;
NC_003977.2     RefSeq  CDS     1376    1840    .       +       0       ID=cds-YP_009173867.1;Dbxref=Genbank:YP_009173867.1,GeneID:944566;gbkey=CDS;Name=YP_009173867.1;gene=X;locus_tag=HBVgp3;Parent=gene-HBVgp3;product=X protein;protein_id=YP_009173867.1;
NC_003977.2     RefSeq  regulatory_region       1592    1602    .       +       .       ID=id-NC_003977.2:1592_1602;gbkey=regulatory;Note=direct repeat sequence 2 (DR2)%3b cis-acting sequence which participates in synthesis of minus-strand DNA
NC_003977.2     RefSeq  regulatory_region       1775    1794    .       +       .       ID=id-NC_003977.2:1775_1794;gbkey=regulatory;Note=phi%3b cis-acting sequence which participates in synthesis of minus-strand DNA
NC_003977.2     RefSeq  sequence_difference     1775    1775    .       -       .       ID=id-NC_003977.2:1775_1775;gbkey=misc_difference;Note=G is A in ref[2]%3b~conflict
NC_003977.2     RefSeq  gene    1816    2454    .       +       .       ID=gene-HBVgp4;Dbxref=GeneID:944568;gbkey=Gene;Name=C;gene=C;gene_biotype=protein_coding;locus_tag=HBVgp4;

But CircCoordinates and CircCount were still generated. Two separate chunks not affect each other ?

Thanks for your time.

gnilihzeux avatar Apr 13 '20 07:04 gnilihzeux

Try ucsc annotation file. This one did not have chromosome number. Junming

JunmingH avatar Apr 13 '20 13:04 JunmingH

@JunmingH Thanks. It's a genome of Hepatitis B virus and there is no information in UCSC.

gnilihzeux avatar Apr 14 '20 00:04 gnilihzeux

Hi, sorry for the delayed response.

There might be an issue with the format of the 9th column, and DCC running into problems gathering data from it. DCC normally uses GTF files and you are providing a GFF3 file, which might result in problems. You might want to try using a GTF formatted file just to rule this possibility out. See https://www.biostars.org/p/99462/ for details on the differences.

tjakobi avatar Apr 17 '20 09:04 tjakobi

Thanks, it is indeed a problem of GTF.

gnilihzeux avatar Apr 23 '20 00:04 gnilihzeux