3d-dna icon indicating copy to clipboard operation
3d-dna copied to clipboard

Finilizing :tail: +: invalid number of bytes

Open xinkwu opened this issue 5 years ago • 27 comments

Hi,

I used 3d-dna to assembly genome, there were some strange outputs. 1. The *FINAL.fasta file contain a lot of empty contigs, such as NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

HiC_scaffold_7 HiC_scaffold_8 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 2. Some warnings appeared during sealing step. 3. In the Finilizing step, there were a lot of warings like "tail: +: invalid number of bytes". I think this may cause the first problem.

Thanks.

... ############### Starting sealing: ... -s flag was triggered, will attempt to place back only singleton debris contigs/scaffolds and those less than 15000 :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :) -p flag was triggered. Running with GNU Parallel support parameter set to true. :) -q flag was triggered, starting calculations for 1 threshold mapping quality :) -i flag was triggered, building mapq without ...Remapping contact data from the original contig set to assembly :( Assembly file does not match cprops file. Exiting! :( Assembly file does not match cprops file. Exiting! ...Building track files :( Assembly file does not match cprops file. Exiting! ...Building the hic file temp.DpsePacBio.rawchrom.asm_mnd.txt does not exist or does not contain any reads. :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" ############### Finilizing output: ... -s flag was triggered, treating all contigs/scaffolds shorter than 15000 as unattempted ... -l flag was triggered. Output will appear with headers of the form DpsePacBio_hic_scaffold_# Analyzing the merged assembly ...trimming N overhangs ...adding gaps tail: +: invalid number of bytes tail: +: invalid number of bytes tail: +: invalid number of bytes ...

xinkwu avatar Sep 20 '18 13:09 xinkwu

command bash ${dir}/run-asm-pipeline.sh ${juicer}/references/DpsePacBio.fasta ${juicer}/aligned/merged_nodups.txt

xinkwu avatar Sep 20 '18 13:09 xinkwu

Hi,

Can you please check your bash version: it needs to be >=4.

Best, Olga

On Sep 20, 2018, at 8:12 AM, kai [email protected] wrote:

command bash ${dir}/run-asm-pipeline.sh ${juicer}/references/DpsePacBio.fasta ${juicer}/aligned/merged_nodups.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

dudcha avatar Sep 20 '18 14:09 dudcha

Hi,

Actually never mind: I see another error upstream that is more important:

:( Assembly file does not match cprops file. Exiting!

I will need to investigate this. Do you think you could share a few files?

Thanks, Olga

On Sep 20, 2018, at 9:08 AM, Olga Dudchenko [email protected] wrote:

Hi,

Can you please check your bash version: it needs to be >=4.

Best, Olga

On Sep 20, 2018, at 8:12 AM, kai [email protected] wrote:

command bash ${dir}/run-asm-pipeline.sh ${juicer}/references/DpsePacBio.fasta ${juicer}/aligned/merged_nodups.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

dudcha avatar Sep 20 '18 14:09 dudcha

Thank you for your reply. my bash version is "GNU bash, version 4.2.46(1)-release (x86_64-redhat-linux-gnu)" what files do you need me to provide?

Is it a problem caused by my draft genome?

xinkwu avatar Sep 20 '18 14:09 xinkwu

Hi,

I am going to try to roll out an updated version which includes some changes to seal I have been meaning to put out for a while: let us see if the problem persists. I will write when it is available.

Thanks, Olga

dudcha avatar Sep 21 '18 09:09 dudcha

Hi,

Try the 180922 branch: it has some bug fixes in seal that might resolve this issue. You can test if the problem gets resolved by running the -s seal on your split output.

Let me know if the issue persists, Olga

dudcha avatar Sep 23 '18 00:09 dudcha

Hi Olga,

I tried your new branch. No warnings like ":( Assembly file does not match cprops file. Exiting!“ this time. But the log file still had some mistakes, it seemed that "finalize" had not been triggered. And the FINAL.fasta file contains many empty contigs. I had run 3d-dna on anther genome a few days ago, and it went on well. Could you tell me what characteristics will induce this problem?

Thank you, Kai

Here are the outputs: ########################## version: 180922 -s|--stage flag was triggered, fast-forwarding to "seal" pipeline section. ############### Starting sealing: ... -i flag was triggered, will attempt to place back only debris contigs/scaffolds and those less than 15000 :| Warning: no explicit bundle size was listed. Will use the same one as listed for false positive size threshold: this is the most typical scenario. :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :) -p flag was triggered. Running with GNU Parallel support parameter set to true. :) -q flag was triggered, starting calculations for 1 threshold mapping quality :) -i flag was triggered, building mapq without :) -c flag was triggered, will remove temporary files after completion ...Remapping contact data from the original contig set to assembly ...Building track files ...Building the hic file Not including fragment map Start preprocess Writing header Writing body .. Writing footer

Finished preprocess HiC file version: 8

Calculating norms for zoom BP_2500000 Calculating norms for zoom BP_1000000 Calculating norms for zoom BP_500000 Calculating norms for zoom BP_250000 Calculating norms for zoom BP_100000 Calculating norms for zoom BP_50000 Calculating norms for zoom BP_25000 Calculating norms for zoom BP_10000 Calculating norms for zoom BP_5000 Calculating norms for zoom BP_1000 Writing expected Writing norms Finished writing norms :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" ############### Finilizing output: ... -s flag was triggered, treating all contigs/scaffolds shorter than 15000 as unattempted. ... -l flag was triggered. Output will appear with headers of the form DpsePacBio_hic_scaffold_#. 3d-dna_0922/3d-dna/finalize/finalize-output.sh: illegal option -- g


./finalize-output.sh -c <number_of_chromosomes> -s <tiny_threshold> -g <gap_size> -l


xinkwu avatar Sep 24 '18 06:09 xinkwu

Hey Kai,

There were a few more small commits concerning this from today. Please pull and run -s finalize.

Thanks, Olga

dudcha avatar Sep 24 '18 06:09 dudcha

Hi Olga,

I tried your new version and the result seemed good. But the N50 of the output contigs was not as high as I imagined. It finally generated 700+ contigs, three times what it was before. In addition, 3d-dna produced a lot of short fragments. I think the default parameters might be probably too sensitive for my genome (200M), and overly detected too many misjoins. Do you have any suggestions for this problem?

Thanks, Kai

xinkwu avatar Sep 26 '18 02:09 xinkwu

Hey Kai,

Since this is not a bug, if you can, please post to aidenlab.org/forum.html where we do our best to offer general user support. We have a few threads to this effect there that might be of help.

Briefly, you want to take a look at the .0.wig and .0.bed tracks in JBAT to see what might be the problem and if the suspect misjoin annotations make sense: the Cookbook has a few typical scenarios listed and what the tracks should look like..

Hope this helps, Olga

On Sep 25, 2018, at 9:45 PM, kai [email protected] wrote:

Hi Olga,

I tried your new version and the result seemed good. But the N50 of the output contigs was not as high as I imagined. It finally generated 700+ contigs, three times what it was before. In addition, 3d-dna produced a lot of short fragments. I think the default parameters might be probably too sensitive for my genome (200M), and overly detected too many misjoins. Do you have any suggestions for this problem?

Thanks, Kai

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

dudcha avatar Sep 26 '18 03:09 dudcha

Hi Olga, I get the same problems and errors when I run run-asm-pipeline-post-review.sh to update an assembly modified in Juicebox. The FINAL.fasta file has only Ns, and the final.fasta reports the error (in the .fasta file!) "The length of fasta does not match that suggested by the cprops file. Exiting!"

I got the .hic and .assembly files. I can load the .hic file in Juicebox but not its associated .assembly file.

In the log file I get the same error as above Analyzing the merged assembly ...trimming N overhangs ...adding gaps tail: +: invalid number of bytes tail: +: invalid number of bytes tail: +: invalid number of bytes

I'm working with the latest version on 3d-dna. Any insights?

atigano avatar Feb 15 '19 01:02 atigano

Hi atigano,

I am not sure which same issue you are referring to in this thread. One (turns out) common reason for the "not matching" error may be carriage return characters added during review in WinOS. (This was safeguarded against in later JBAT but no new build is currently available). Try doing

cat assembly_file | sed 's/\r$//' > new_assembly_file

(you can check if it is worthwhile by doing something like diff assembly_file <(cat assembly_file | sed 's/\r$//')

Let me know if this changes things for you. If now I will ask for more questions to investigate further.

Best, Olga

dudcha avatar Feb 16 '19 19:02 dudcha

Hi Olga,

I am facing the same problem. However in my case, the final.fasta is made properly, where as the FINAL.fasta terminates prematurely after 1 chromosome and spits out the 'tail:...' error. Pls could you advice what can be done to debug this? I am using the latest version of juicebox as well as 3d-dna..

Thanks in advance! Regards Rahul

rahulvrane avatar Mar 02 '19 04:03 rahulvrane

Not sure if this helps - but managed to track it down to the following line in finalize-output.sh

bash ${pipeline}/finalize/remove-N-overhangs-from-asm.sh ${cprops} ${asm} ${fasta}

${pipeline}/edit/edit-fasta-according-to-new-cprops.awk

The no_overhangs.fasta generated here was severely truncated.. If I skip this part - it works (obv with loads of N's - but I can fix it with another round of long-read polishing. However I did observe that this caused a downstream problem in

bash ${pipeline}/finalize/construct-fasta-from-asm.sh temp.cprops temp.asm temp.fasta | awk -f ${pipeline}/utils/wrap-fasta-sequence.awk - > ${label}.FINAL.fasta

The tail error was emerging in the wrap-fasta-sequence.awk (avoided using a samtools faidx FASTA CTG# | seqtk seq -r) or (samtools faidx FASTA CTG# )

If there is anything else I can do to help with this - Pls do let me know.

rahulvrane avatar Mar 04 '19 00:03 rahulvrane

Hi Rahul,

I do not know if it relates to you, but some issues with finalize output may be related to SIGPIPE signal handler, see discussion here: http://aidenlab.org/forum.html?place=msg%2F3d-genomics%2FcqAmJHbXCzE%2FeaXK63syBwAJ

I cannot understand from your comments what is going on and where you originally see a problem, so I cannot comment. I am happy to help but we'd need to start from the beginning, with full stdout and stderr, check of line separators in .review.assembly etc.

Thanks! Olga

dudcha avatar Mar 04 '19 01:03 dudcha

@dudcha Hi Olga:

I met the same error in the finalizing step tail: invalid number of bytes: ‘+’ I tried dd insertion into lines 62 and 64 of 3d-dna/finalize/construct-fasta-from-asm.shhttps://github.com/theaidenlab/3d-dna/issues/29 and also tried changing the code to `tail -c +${index[${contig}]} ${fasta} | awk '$0~/>/{exit}1' | awk -f ${pipeline}/utils/reverse-fasta.awk -

tail -c +${index[${contig}]} ${fasta} > tmp.fasta && awk '$0~/>/{exit}1' tmp.fasta | awk -f ${pipeline}/utils/reverse-fasta.awk` http://aidenlab.org/forum.html?place=msg%2F3d-genomics%2FcqAmJHbXCzE%2FeaXK63syBwAJ , neither of the attempts solved the tail error. I checked the output and the hic file looks fine in juicebox however the size of FINAL.fasta is smaller than the input genome.fasta (If I understand correctly, after scaffolding and gap filling the FINAL.fasta should be larger than the input). May I ask is this decrease in FINAL.fasta related to the tail error and how should I fix that tail error?

The exact version of softwares I'm using are listed below: 3ddna, 180922 GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu) sort (GNU coreutils) 8.31 GNU Awk 5.0.1, API: 2.0 GNU parallel 20190822

Thx!

zijiangyang avatar Sep 20 '19 09:09 zijiangyang

Hi zijiangyang,

Are you running after review? Note that an earlier version of JBAT on Win machine added ^M carriage return which could cause this (https://github.com/theaidenlab/3d-dna/issues/23#issuecomment-464373556). Given your comments it seems you've explored the possibility of this being related to SIGPIPE.

Olga

dudcha avatar Sep 23 '19 23:09 dudcha

Hi Olga,

I am also facing similar issue and getting the below error

tail: +: invalid number of bytes

This error gets printed only once and the program exits.

Should I replace the lines 62 and 64 of 3d-dna/finalize/construct-fasta-from-asm.sh with the below lines : tail -c +${index[${contig}]} ${fasta} | awk '$0~/>/{exit}1' | awk -f ${pipeline}/utils/reverse-fasta.awk -

tail -c +${index[${contig}]} ${fasta} > tmp.fasta && awk '$0~/>/{exit}1' tmp.fasta | awk -f ${pipeline}/utils/reverse-fasta.awk

and run the finalize stage again. Will it solve the problem.

Regards, Karthic

Karthickrishnan avatar Oct 13 '19 07:10 Karthickrishnan

same issue. Still not resolved ?

ptranvan avatar Jan 07 '21 12:01 ptranvan

Hello Patrick,

Moretti much all of these issues boil down to people not using the right fasta or the right assembly when running post-review. Please check your input.

Olga

dudcha avatar Jan 07 '21 20:01 dudcha

Hi all,

I was having exactly the same issue when finalizing the pipeline. The hic file was ok, but error "tail: +: invalid number of bytes" and truncated fasta file. I was using latest commits, bash version was fine... I was about to check on the "tail" command on the script when I realised I was using the wrong input fasta file. I was using the final.fasta coming out of 3D-DNA. Instead, the correct fasta to use is the same input fasta that is given as input for Juicer and 3D. With that, no errors and final.fasta written correctly.

Hope this helps, Alessia

matryoskina avatar Jan 13 '21 21:01 matryoskina

Hi i meet the same question, and the log is this: <<<<< Running 3d-dna review ... /public1/home/chenwy/app/3d-dna/3d-dna/run-asm-pipeline-post-review.sh -r /public1/home/chenwy/app/juicer/work/New/new_file.assembly /public1/home/chenwy/app/juicer/references/test.asm.fasta.fasta /public1/home/chenwy/app/juicer/work/test1/aligned/merged_nodups.txt -r|--review flag was triggered, treating file /public1/home/chenwy/app/juicer/work/New/new_file.assembly as a JB4A review file for draft fasta in arguments. ############### Finilizing output: :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" :) -p flag was triggered. Running with GNU Parallel support parameter set to true. :) -q flag was triggered, starting calculations for 1 threshold mapping quality :) -i flag was triggered, building mapq without :) -c flag was triggered, will remove temporary files after completion ...Remapping contact data from the original contig set to assembly ...Building track files ...Building the hic file Not including fragment map Start preprocess Writing header Writing body .. Writing footer

Finished preprocess HiC file version: 8

Calculating norms for zoom BP_2500000 Calculating norms for zoom BP_1000000 Calculating norms for zoom BP_500000 Calculating norms for zoom BP_250000 Calculating norms for zoom BP_100000 Calculating norms for zoom BP_50000 Calculating norms for zoom BP_25000 Calculating norms for zoom BP_10000 Calculating norms for zoom BP_5000 Calculating norms for zoom BP_1000 Writing expected Writing norms Finished writing norms :| Warning: No input for label1 was provided. Default for label1 is ":::fragment_" :| Warning: No input for label2 was provided. Default for label2 is ":::debris" ... -s flag was triggered, treating all contigs/scaffolds shorter than 15000 as unattempted. ... -l flag was triggered. Output will appear with headers of the form test.asm.fasta_hic_scaffold_#. ... -g flag was triggered, making gap size between scaffolded draft sequences to be equal to 500. Analyzing the merged assembly ...trimming N overhangs ...adding gaps tail: +: invalid number of bytes

how can i do for that. thanks

phil622 avatar Jul 31 '21 09:07 phil622

I am having the same error 'tail: +: invalid number of bytes' error and the fasta file is several MB smaller that the original fasta. I am using using bash 4.4.23(1)-release to run run-asm-pipeline-post-review.sh. Do you have any suggestions for debugging this?

aclum avatar Feb 28 '22 23:02 aclum

I am having the same error 'tail: +: invalid number of bytes' error and the fasta file is several MB smaller that the original fasta. I am using using bash 4.4.23(1)-release to run run-asm-pipeline-post-review.sh. Do you have any suggestions for debugging this?

See my comment on top, maybe that helps :)

matryoskina avatar Mar 04 '22 10:03 matryoskina

Hi Olga,

I am facing the same problem. However in my case, the final.fasta is made properly, where as the FINAL.fasta terminates prematurely after 1 chromosome and spits out the 'tail:...' error. Pls could you advice what can be done to debug this? I am using the latest version of juicebox as well as 3d-dna..

Thanks in advance! Regards Rahulh Hi Rahulh, I am facing the same problem. Have you solve this problem. I would appreciate it if you could provide some advices.

Regards, wy

wy1150685961 avatar Aug 16 '22 04:08 wy1150685961

Usually this happens when you pass the wrong edit file after JBAT. Are you running this after JBAT? -Olga

dudcha avatar Aug 19 '22 20:08 dudcha

Hi all,

I was having exactly the same issue when finalizing the pipeline. The hic file was ok, but error "tail: +: invalid number of bytes" and truncated fasta file. I was using latest commits, bash version was fine... I was about to check on the "tail" command on the script when I realised I was using the wrong input fasta file. I was using the final.fasta coming out of 3D-DNA. Instead, the correct fasta to use is the same input fasta that is given as input for Juicer and 3D. With that, no errors and final.fasta written correctly.

Hope this helps, Alessia

I think this should be made more obvious in the documentation! Thanks!

hazmup avatar Jun 07 '24 17:06 hazmup