xpore
xpore copied to clipboard
Genome mapping for Arabidopsis in Xpore
Dear Ploy,
We are stuck at the genome mapping part in Xpore. I know this is a big favor to ask but would it be possible to help us to look at the script and the error message? Any advice would be greatly appreciated. Thank you so much.
xpore-dataprep \
--eventalign ../R00122/nanopolish/R00122.nanopolish.xpore.tsv
--summary ../R00122/nanopolish/R00122.nanopolish.xpore.summary.txt
--out_dir ../R00122/dataprep
--customised_genome
--reference_name Arabidopsis_thaliana.TAIR10.dna.toplevel
--annotation_name Arabidopsis_thaliana.TAIR10.49.gtf
--gtf_path_or_url ../ref/Arabidopsis_thaliana.TAIR10.49.gtf
--n_processes 32
--genome
The error message:
/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:112: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
Hi Annie,
I think that is only a warning thrown out by pandas and not an error message, the program should still be running. Did you see an increase in the number of lines / file size from data.json?
Hi Chris,
Thank you but the final data.json file is actually empty. We are not sure what is going on. Do you spot any mistakes in the command?
Kind regards, Annie
so during the daraprep xpore will index your eventalign.txt first before it writes something to data.json. Can you check the eventalign.index file? Try wc -l eventalign.index to see if it has anything in there and maybe do it a couple of times to see if the number of lines increases
Hi Chris, We have re-run the analysis following the command posted previously. We can obtain the eventalign.index file (51.7Mb) and this is the message that we get :
xx/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:112: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
chunk_split['line_length'] = np.array(lines)
Traceback (most recent call last):
File "xx/envs/xpore/bin/xpore-dataprep", line 8, in
Hi Annie, thanks for running it again
It seems that the indexing works successfully, and the error seems to indicate that you did not provide the fasta file using the option --transcript_fasta_paths_or_urls
Also in the next run, since you have generated the index file, you can supply the --skip_eventalign_indexing argument as well to skip the indexing
Thanks, Chris. We have re-run the analysis according to your suggestion. The resulting data.json file is only 235 kb containing 40+ genes! When we did the mapping to transcriptome, the file size for data.json is 3.4 GB. Here is the log for the run:
xpore-dataprep
--eventalign xx/R00122/nanopolish/R00122.nanopolish.xpore.tsv
--summary xx/R00122/nanopolish/R00122.nanopolish.xpore.summary.txt
--out_dir xx/R00122/dataprep
--customised_genome
--reference_name Arabidopsis_thaliana.TAIR10.dna.toplevel
--annotation_name Arabidopsis_thaliana.TAIR10.49.gtf
--gtf_path_or_url /ref/Arabidopsis_thaliana.TAIR10.49.gtf
--transcript_fasta_paths_or_urls /ref-transcript-fa/reference_transcriptome.fa
--n_processes 32
--genome
running customised genome /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:112: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) INFO:pyensembl.sequence_data:Loaded sequence dictionary from /ref-transcript-fa/reference_transcriptome.fa.pickle
Hi @Annie-GW, thanks for showing the log of the mapping to genome (using gene id and genome coordinates)! To map to transcriptome (using transcript id and transcriptome coordinates) you don't need the --genome
parameter and the --customised_genome
and related fasta and gtf parameters:
xpore-dataprep \
--eventalign ../R00122/nanopolish/R00122.nanopolish.xpore.tsv \
--summary ../R00122/nanopolish/R00122.nanopolish.xpore.summary.txt \
--out_dir ../R00122/dataprep \
--n_processes 32
Thanks!
Thanks, Yuk Kei. We have no problem mapping to transcriptome. It is the mapping to genome part that we are stuck.
Hi Annie, which reference fasta and annotation gtf are you using? I'm not sure whether non-ensembl references and annotations could cause the problem.
There are several things helpful to check upstream of xpore-dataprep
for the mapping to genome:
- [ ] using a cDNA fasta for
minimap2
alignment
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-50/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
- [ ] using the alignment parameters
minimap2 -ax map-ont -uf -t 3 --secondary=no
indicated in thexpore
manual: - [ ] checking whether
nanopolish eventalign.txt
is truncated
hi yuukiiwa. thanks for your suggestion. we tried it again followed your 3 steps. we find when the index finished, the program just output an empty data.json file and exit.
log shows below.
running customised genome
INFO:pyensembl.sequence_data:Loaded sequence dictionary from Arabidopsis_thaliana.TAIR10.cdna.all.fa.pickle
command shows below
xpore-dataprep \
--eventalign nanopolish/R001.nanopolish.xpore.tsv \
--summary nanopolish/R001.nanopolish.xpore.summary.txt \
--out_dir dataprep \
--skip_eventalign_indexing \
--n_processes 32 \
--customised_genome \
--reference_name Arabidopsis_thaliana.TAIR10.cdna.all.fa \
--annotation_name Arabidopsis_thaliana.TAIR10.50.gtf \
--gtf_path_or_url ref/Arabidopsis_thaliana.TAIR10.50.gtf \
--transcript_fasta_paths_or_urls ref/Arabidopsis_thaliana.TAIR10.cdna.all.fa \
--genome
looking forward to get your response. thanks
Hi @6ge6, sorry for the delayed reply! I think I know where the problem is now:
Here is a line from the Arabidopsis gtf
file (transcript_id "AT1G01010.1"
):
1 araport11 transcript 3631 5899 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
and here is a line from the Arabidopsis cDNA fasta
file (transcript ID includes version):
>AT1G19090.1 cdna chromosome:TAIR10:1:6589835:6592762:1 gene:AT1G19090 gene_biotype:nontranslating_CDS transcript_biotype:nontranslating_CDS gene_symbol:RKF2 description:receptor-like serine/threonine kinase 2 [Source:TAIR;Acc:AT1G19090]
Here is a line from the Human gtf
file (transcript_id "ENST00000456328"; transcript_version "2"
):
1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
and here is a line from the Human cDNA fasta
file (transcript ID includes version):
>ENST00000434970.2 cdna chromosome:GRCh38:14:22439007:22439015:1 gene:ENSG00000237235.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD2 description:T-cell receptor delta diversity 2 [Source:HGNC Symbol;Acc:HGNC:12255]
As xPore
was developed using human samples, to facilitate the transcriptome to genome mapping, we removed the transcript version, which does not make a difference in processing human samples. However, in your case, our removing the transcript version causes it not being to match with the gtf
transcript_id
entry, which contains the transcript version. We will definitely have to change this, and hopefully we can give you an update a week from now.
Thank you so much for using xPore
on Arabidopsis! Or not we wouldn't have known that its incompatibility with the Arabidopsis reference/annotation.
Hi @yuukiiwa , Great, thanks for your help. looking forward your new version. Regards,
Hi @6ge6, thank you for waiting!
I have updated the xpore implementation on my fork, which is now compatible with Arabidopsis and Human Ensembl references. It is now pull requesting into the GoekeLab/xpore
, but will take a while to get the changes reviewed. I have tested the new implementation on my branch, and it produces the same results on the demo test-dataset. @chrishendra93 and I think that it should be fine if you use the xpore on my fork for now as it will eventually be merged into GoekeLab/xpore
.
To install the xpore
on my fork, first you have to clone it:
git clone https://github.com/yuukiiwa/xpore.git
Then, install it:
sudo python3 setup.py install
To run, xpore-dataprep
using Arabidopsis references:
xpore-dataprep \
--eventalign nanopolish/arabidopsis_eventalign.txt \
--summary nanopolish/summary.txt \
--out_dir arabidopsis_dataprep \
--genome --gtf_path_or_url Arabidopsis_thaliana.TAIR10.50.gtf --transcript_fasta_paths_or_urls Arabidopsis_thaliana.TAIR10.cdna.all.fa
I have tried making a mock demo eventalign.txt
using Arabidopsis transcript ids to test whether using Arabidopsis fasta
and gtf
works, and here is the dataprep
directory:
ls -lh arabidopsis_dataprep
total 2024
-rw-r--r-- 1 yukkei staff 104B May 27 11:33 data.index
-rw-r--r-- 1 yukkei staff 979K May 27 11:33 data.json
-rw-r--r-- 1 yukkei staff 109B May 27 11:33 data.log
-rw-r--r-- 1 yukkei staff 62B May 27 11:33 data.readcount
-rw-r--r-- 1 yukkei staff 5.3K May 27 11:33 eventalign.index
Thank you! Hope this helps!
Hi, I encounter the same problem while performing version 2.1.
So, only the version of @yuukiiwa 's fork can deal with the Arabidopsis data?
Thanks!