xpore icon indicating copy to clipboard operation
xpore copied to clipboard

Genome mapping for Arabidopsis in Xpore

Open Annie-GW opened this issue 3 years ago • 14 comments

Dear Ploy,

We are stuck at the genome mapping part in Xpore. I know this is a big favor to ask but would it be possible to help us to look at the script and the error message? Any advice would be greatly appreciated. Thank you so much.

xpore-dataprep \ --eventalign ../R00122/nanopolish/R00122.nanopolish.xpore.tsv
--summary ../R00122/nanopolish/R00122.nanopolish.xpore.summary.txt
--out_dir ../R00122/dataprep
--customised_genome
--reference_name Arabidopsis_thaliana.TAIR10.dna.toplevel
--annotation_name Arabidopsis_thaliana.TAIR10.49.gtf
--gtf_path_or_url ../ref/Arabidopsis_thaliana.TAIR10.49.gtf
--n_processes 32
--genome

The error message:

/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:112: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

Annie-GW avatar Apr 23 '21 00:04 Annie-GW

Hi Annie,

I think that is only a warning thrown out by pandas and not an error message, the program should still be running. Did you see an increase in the number of lines / file size from data.json?

chrishendra93 avatar Apr 23 '21 03:04 chrishendra93

Hi Chris,

Thank you but the final data.json file is actually empty. We are not sure what is going on. Do you spot any mistakes in the command?

Kind regards, Annie

Annie-GW avatar Apr 23 '21 04:04 Annie-GW

so during the daraprep xpore will index your eventalign.txt first before it writes something to data.json. Can you check the eventalign.index file? Try wc -l eventalign.index to see if it has anything in there and maybe do it a couple of times to see if the number of lines increases

chrishendra93 avatar Apr 23 '21 05:04 chrishendra93

Hi Chris, We have re-run the analysis following the command posted previously. We can obtain the eventalign.index file (51.7Mb) and this is the message that we get :

xx/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:112: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) Traceback (most recent call last): File "xx/envs/xpore/bin/xpore-dataprep", line 8, in sys.exit(main()) File "xx/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py", line 653, in main reference_name=reference_name, UnboundLocalError: local variable 'reference_name' referenced before assignment

Annie-GW avatar Apr 23 '21 07:04 Annie-GW

Hi Annie, thanks for running it again

It seems that the indexing works successfully, and the error seems to indicate that you did not provide the fasta file using the option --transcript_fasta_paths_or_urls

Also in the next run, since you have generated the index file, you can supply the --skip_eventalign_indexing argument as well to skip the indexing

chrishendra93 avatar Apr 23 '21 09:04 chrishendra93

Thanks, Chris. We have re-run the analysis according to your suggestion. The resulting data.json file is only 235 kb containing 40+ genes! When we did the mapping to transcriptome, the file size for data.json is 3.4 GB. Here is the log for the run:

xpore-dataprep
--eventalign xx/R00122/nanopolish/R00122.nanopolish.xpore.tsv
--summary xx/R00122/nanopolish/R00122.nanopolish.xpore.summary.txt
--out_dir xx/R00122/dataprep
--customised_genome
--reference_name Arabidopsis_thaliana.TAIR10.dna.toplevel
--annotation_name Arabidopsis_thaliana.TAIR10.49.gtf
--gtf_path_or_url /ref/Arabidopsis_thaliana.TAIR10.49.gtf
--transcript_fasta_paths_or_urls /ref-transcript-fa/reference_transcriptome.fa
--n_processes 32
--genome

running customised genome /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:61: PerformanceWarning: indexing past lexsort depth may impact performance. pos_end += eventalign_result.loc[index]['line_length'].sum() /anaconda3/envs/xpore/lib/python3.8/site-packages/xpore/scripts/dataprep.py:112: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy chunk_split['line_length'] = np.array(lines) INFO:pyensembl.sequence_data:Loaded sequence dictionary from /ref-transcript-fa/reference_transcriptome.fa.pickle

Annie-GW avatar Apr 24 '21 08:04 Annie-GW

Hi @Annie-GW, thanks for showing the log of the mapping to genome (using gene id and genome coordinates)! To map to transcriptome (using transcript id and transcriptome coordinates) you don't need the --genome parameter and the --customised_genome and related fasta and gtf parameters:

xpore-dataprep \
--eventalign ../R00122/nanopolish/R00122.nanopolish.xpore.tsv \
--summary ../R00122/nanopolish/R00122.nanopolish.xpore.summary.txt \ 
--out_dir ../R00122/dataprep \
--n_processes 32

Thanks!

yuukiiwa avatar Apr 27 '21 01:04 yuukiiwa

Thanks, Yuk Kei. We have no problem mapping to transcriptome. It is the mapping to genome part that we are stuck.

Annie-GW avatar Apr 27 '21 02:04 Annie-GW

Hi Annie, which reference fasta and annotation gtf are you using? I'm not sure whether non-ensembl references and annotations could cause the problem. There are several things helpful to check upstream of xpore-dataprep for the mapping to genome:

  • [ ] using a cDNA fasta for minimap2 alignment
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-50/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz
  • [ ] using the alignment parameters minimap2 -ax map-ont -uf -t 3 --secondary=no indicated in the xpore manual:
  • [ ] checking whether nanopolish eventalign.txt is truncated

yuukiiwa avatar Apr 27 '21 03:04 yuukiiwa

hi yuukiiwa. thanks for your suggestion. we tried it again followed your 3 steps. we find when the index finished, the program just output an empty data.json file and exit.

log shows below.

running customised genome
INFO:pyensembl.sequence_data:Loaded sequence dictionary from Arabidopsis_thaliana.TAIR10.cdna.all.fa.pickle

command shows below

xpore-dataprep \
            --eventalign nanopolish/R001.nanopolish.xpore.tsv \
            --summary nanopolish/R001.nanopolish.xpore.summary.txt \
            --out_dir dataprep \
            --skip_eventalign_indexing \
            --n_processes 32 \
            --customised_genome \
            --reference_name Arabidopsis_thaliana.TAIR10.cdna.all.fa \
            --annotation_name Arabidopsis_thaliana.TAIR10.50.gtf \
            --gtf_path_or_url ref/Arabidopsis_thaliana.TAIR10.50.gtf \
            --transcript_fasta_paths_or_urls ref/Arabidopsis_thaliana.TAIR10.cdna.all.fa \
            --genome

looking forward to get your response. thanks

6ge6 avatar May 18 '21 07:05 6ge6

Hi @6ge6, sorry for the delayed reply! I think I know where the problem is now:

Here is a line from the Arabidopsis gtf file (transcript_id "AT1G01010.1"):

1	araport11	transcript	3631	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

and here is a line from the Arabidopsis cDNA fasta file (transcript ID includes version):

>AT1G19090.1 cdna chromosome:TAIR10:1:6589835:6592762:1 gene:AT1G19090 gene_biotype:nontranslating_CDS transcript_biotype:nontranslating_CDS gene_symbol:RKF2 description:receptor-like serine/threonine kinase 2 [Source:TAIR;Acc:AT1G19090]

Here is a line from the Human gtf file (transcript_id "ENST00000456328"; transcript_version "2"):

1	havana	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";

and here is a line from the Human cDNA fasta file (transcript ID includes version):

>ENST00000434970.2 cdna chromosome:GRCh38:14:22439007:22439015:1 gene:ENSG00000237235.2 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRDD2 description:T-cell receptor delta diversity 2 [Source:HGNC Symbol;Acc:HGNC:12255]

As xPore was developed using human samples, to facilitate the transcriptome to genome mapping, we removed the transcript version, which does not make a difference in processing human samples. However, in your case, our removing the transcript version causes it not being to match with the gtf transcript_id entry, which contains the transcript version. We will definitely have to change this, and hopefully we can give you an update a week from now.

Thank you so much for using xPore on Arabidopsis! Or not we wouldn't have known that its incompatibility with the Arabidopsis reference/annotation.

yuukiiwa avatar May 20 '21 09:05 yuukiiwa

Hi @yuukiiwa , Great, thanks for your help. looking forward your new version. Regards,

6ge6 avatar May 21 '21 06:05 6ge6

Hi @6ge6, thank you for waiting! I have updated the xpore implementation on my fork, which is now compatible with Arabidopsis and Human Ensembl references. It is now pull requesting into the GoekeLab/xpore, but will take a while to get the changes reviewed. I have tested the new implementation on my branch, and it produces the same results on the demo test-dataset. @chrishendra93 and I think that it should be fine if you use the xpore on my fork for now as it will eventually be merged into GoekeLab/xpore. To install the xpore on my fork, first you have to clone it:

git clone https://github.com/yuukiiwa/xpore.git

Then, install it:

sudo python3 setup.py install

To run, xpore-dataprep using Arabidopsis references:

xpore-dataprep \
--eventalign nanopolish/arabidopsis_eventalign.txt \
--summary nanopolish/summary.txt \
--out_dir arabidopsis_dataprep \
--genome --gtf_path_or_url Arabidopsis_thaliana.TAIR10.50.gtf --transcript_fasta_paths_or_urls Arabidopsis_thaliana.TAIR10.cdna.all.fa

I have tried making a mock demo eventalign.txt using Arabidopsis transcript ids to test whether using Arabidopsis fasta and gtf works, and here is the dataprep directory:

ls -lh arabidopsis_dataprep 
total 2024
-rw-r--r--  1 yukkei  staff   104B May 27 11:33 data.index
-rw-r--r--  1 yukkei  staff   979K May 27 11:33 data.json
-rw-r--r--  1 yukkei  staff   109B May 27 11:33 data.log
-rw-r--r--  1 yukkei  staff    62B May 27 11:33 data.readcount
-rw-r--r--  1 yukkei  staff   5.3K May 27 11:33 eventalign.index

Thank you! Hope this helps!

yuukiiwa avatar May 27 '21 03:05 yuukiiwa

Hi, I encounter the same problem while performing version 2.1.

So, only the version of @yuukiiwa 's fork can deal with the Arabidopsis data?

Thanks!

YCCHEN23 avatar Jan 22 '22 11:01 YCCHEN23