find_circ icon indicating copy to clipboard operation
find_circ copied to clipboard

What's the maximum Number of scaffolds (".fa" files) for the find_circ.py "-G" option

Open tw7649116 opened this issue 7 years ago • 7 comments

Hello, Marvin Jens,

I'm using your software for plant circRNA identification, however, the "-G (--genome)" option doesn't work with the Error 24 "too many open files", because of the draft reference genome (ten thousands of scaffolds, resulting in ten thousands of file for the option). And also, I found that only provide the single genome fasta file doesn't work also. I looks the option only accept separated chromosome fasta files.

I'm wondering is there any solutions to solve this? Or, could you have some suggestions?

Thank you.

tw7649116 avatar Aug 05 '17 02:08 tw7649116

Hi,

for a large number of scaffolds a single fasta file is absolutely preferred. This should definitely work! Could you please elaborate on how this also "doesn't work"? What is the exact command you run and what are the errors you run into? Also, how is the fasta file you provide formatted? Could you give me the first few lines as an example?

Best, -Marvin

On 04.08.2017 22:26, tw7649116 wrote:

Hello, Marvin Jens,

I'm using your software for plant circRNA identification, however, the "-G (--genome)" option doesn't work with the Error 24 "too many open files", because of the draft reference genome (ten thousands of scaffolds, resulting in ten thousands of file for the option). And also, I found that only provide the single genome fasta file doesn't work also. I looks the option only accept separated chromosome fasta files.

I'm wondering is there any solutions to solve this? Or, could you have some suggestions?

Thank you.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/marvin-jens/find_circ/issues/12, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ95r8-k80jNQl0Dwpa6ZX00ajVE2vbwks5sU9LTgaJpZM4OuVr9.

marvin-jens avatar Aug 07 '17 01:08 marvin-jens

Dear Marvin, Thanks for you kind reply. If I use the single reference file with "-G /path/ref.fa" and the error is "WARNING:root:Could not access '/path/ref.fa/Sc0000540.fa'. Switching to dummy mode (only Ns)" It looks the command treated the "ref.fa" as a directory, so cannot access the files. I tried in this step, but I don't know what is the right way to provide a single reference file instead of multi-fasta files in the command.

Once I separated the reference into many ".fa" files, the command works if I only put ~3000 fasta files. However, error happened in find_circ.py with "mmap.error: [Errno 24] Too many open files" and "(ERR): bowtie2-align died with signal 13 (PIPE)" in bowtie2 if I put more than ~3400 fasta files.

The reference format is the normal fa fomat like this:

Sc0000000 CTGAAATCGCGAGGTCCGCTCGAGCGTGGGAATAGACGCTCGAGTGGGGACTCTACATTG Sc0000002 CGCTCGAACACTCGATGATCATCCATGTTTTGTGTTCGTCTTAAGTTTCCGGAGGCCTCA

The whole command is like the examples in manual: bowtie2 -p 20 --reorder --mm -M 20 --score-min=C,-15,0 -q -x path/ref -U unmapped.fq 2> test.log | python find_circ.py -G ~/path/ref.fa (or path to the scaffold fasta files) -p test -s test.sites.log > test.sites.bed 2> test.sites.reads

Thanks very much!

tw7649116 avatar Aug 07 '17 05:08 tw7649116

Dear Marvin, I am having the exact same issue. "If I use the single reference file with "-G /path/ref.fa" and the error is "WARNING:root:Could not access '/path/ref.fa/Sc0000540.fa'. Switching to dummy mode (only Ns)" It looks the command treated the "ref.fa" as a directory, so cannot access the files. " Even when I download individual chromosome fasta files, it still treats those as directories. How can I provide a single fasta here?

marvel479 avatar Mar 24 '20 11:03 marvel479

Hi Marvel,

Could it also be the missing '>' symbol in front of the fasta headers? Please see the earlier messages in this thread for a simple way to add them with sed.

Best, Marvin On Mar 24, 2020, at 07:52, marvel479 <[email protected]mailto:[email protected]> wrote:

Dear Marvin, I am having the exact same issue. "If I use the single reference file with "-G /path/ref.fa" and the error is "WARNING:root:Could not access '/path/ref.fa/Sc0000540.fa'. Switching to dummy mode (only Ns)" It looks the command treated the "ref.fa" as a directory, so cannot access the files. " Even when I download individual chromosome fasta files, it still treats those as directories. How can I provide a single fasta here?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/marvin-jens/find_circ/issues/12#issuecomment-603194856, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACPXTL2IHYC72ZR2WBDOBCDRJCNHBANCNFSM4DVZLL6Q.

marvin-jens avatar Mar 24 '20 12:03 marvin-jens

That worked. Thanks Marvin.

marvel479 avatar Mar 25 '20 03:03 marvel479

Dear Marvin, @marvin-jens

Thank you so much for providing such a great tool.

Recently I was trying to use it to analyze some of my own circ sequencing data, but I encountered the same problem as above: WARNING:root:Could not access 'home/yj2/lab_project/00_reference/01_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa'. Switching to dummy mode (only Ns)

The code I use is as follows: bowtie2 -p 16 --score-min=C,-15,0 --reorder --mm \ -q -U $sample \ -x ~/lab_project/00_reference/01_human/01_index/02_bowtie2/GRCh38 | \ ~/softwave/find_circ-master/find_circ.py \ --genome=~/lab_project/00_reference/01_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa \ --prefix="$base". \ --name="$base" \ --stats=./"$base".stats.txt \ --reads=./"$base".spliced_reads.fa \ > "$base".splice_site.bed The format of the reference genome is as follows

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

At the same time, I tried to split the entire reference genome into each chromosome, and named each chromosome in the "chr*.fa" format. The format of each fasta file is as follows:

>1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

When the entire folder is used as the input of ‘-G’, the above error also occurred.

I'm wondering are there any solutions to solve this? Or, could you have some suggestions?

Thank you.

jingyu9603 avatar Dec 01 '23 02:12 jingyu9603

Dear jingyu,

It looks like somehow the '~' may weirdly not be properly resolved, as 'home/yj2/lab_project/00_reference/01_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa' Should start with a slash : '/home/...' . Can you try calling find_circ.py with the full, absolute path --genome=/home/... ? Let me know if this helps. A folder with contigs is also possible and should work very similarly. Number of contigs is practically unlimited.

Best,   Marvin

Am 1. Dez. 2023, 03:38, um 03:38, jingyu9603 @.***> schrieb:

Dear Marvin,

Thank you so much for providing such a great tool.

Recently I was trying to use it to analyze some of my own circ sequencing data, but I encountered the same problem as above: WARNING:root:Could not access 'home/yj2/lab_project/00_reference/01_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa'. Switching to dummy mode (only Ns)

The code I use is as follows: bowtie2 -p 16 --score-min=C,-15,0 --reorder --mm \ -q -U $sample \ -x ~/lab_project/00_reference/01_human/01_index/02_bowtie2/GRCh38 | \ ~/softwave/find_circ-master/find_circ.py \ --genome=~/lab_project/00_reference/01_human/Homo_sapiens.GRCh38.dna.primary_assembly.fa \ --prefix="$base". \ --name="$base" \ --stats=./"$base".stats.txt \ --reads=./"$base".spliced_reads.fa \ > "$base".splice_site.bed The format of the reference genome is as follows

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

At the same time, I tried to split the entire reference genome into each chromosome, and named each chromosome in the "chr*.fa" format. The format of each fasta file is as follows:

>1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

When the entire folder is used as the input of ‘-G’, the above error also occurred.

I'm wondering are there any solutions to solve this? Or, could you have some suggestions?

Thank you.

-- Reply to this email directly or view it on GitHub: https://github.com/marvin-jens/find_circ/issues/12#issuecomment-1835361653 You are receiving this because you commented.

Message ID: @.***>

marvin-jens avatar Dec 04 '23 19:12 marvin-jens