TransPi
TransPi copied to clipboard
Error in process busco4_dist
Hi, I apologize for my frequent contacts.
When the runninfg of SOS_busco.py in process busco4_dist, I got following error,
Command error:
Traceback (most recent call last):
File "/mnt/data/software/TransPi/bin/SOS_busco.py", line 38, in <module>
busco_df = pd.read_csv(input_busco_file, sep=',',header=0,names=['Busco_id','Status','Sequence','Score','Length'])
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 686, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 458, in _read
data = parser.read(nrows)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 1186, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python3.8/site-packages/pandas/io/parsers.py", line 2145, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 918, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 905, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 7 fields in line 51, saw 8
I think this is a problem for SOS_busco.py input file(In my case, Read_R_all_busco4.tsv).
Most of lines of my Read_R_all_busco4.tsv have 6 commas (7 columns), like this.
0at38820,Duplicated,SOAP.k25.scaffold27258,8202.3,4167,https://www.orthodb.org/v10?query=0at38820,sacsin
However, some lines of my file have 7 or 8 commas ( 8 or 9 columns) like this.
121at38820,Complete,SOAP.k25.scaffold11722,3027.5,1446,https://www.orthodb.org/v10?query=121at38820,Zinc finger, RING-type
I think that this difference in the number of commas (columns) is the cause of this pandas error.
SOS_busco.py doesn't seem to use columns 6 onwards in the input file. If so, we can remove columns 6 onwards before SOS_busco.py. https://github.com/PalMuc/TransPi/blob/899d16028e2d84e746c8c0dda1c6ba9ebcca050e/TransPi.nf#L1591-L1592
This is an example of my suggestion for revising.
cat $transpi_tsv | grep -v "#" | tr "\\t" "," >>$all_busco
awk -F',' 'OFS="," {print $1,$2,$3,$4,$5}' $all_busco > some.csv
SOS_busco.py -input_file_busco some.csv -input_file_fasta $assembly -min ${params.minPerc} -kmers ${params.k}
rm -rf some.csv
I hope this helps you. Thank you.
Hello @HarukiNakamura,
No worries. Thanks for finding issues and providing suggestions to TransPi. We appreciate it.
You are right, the last column will cause issues since the name has a comma and SOS_busco.py
will fail. I think the easiest solution is what you suggested. I will do a test and modify the code. Thanks!
Best, Ramón
Pinging @n-conci
this works:
1517 cat full_table_*.tsv | grep -v "#" | tr "\t" "," | cut -d ',' -f1-5 >.busco_names.txt
1591 cat $transpi_tsv | grep -v "#" | tr "\t" "," | cut -d ',' -f1-5 >>$all_busco