create_cisTarget_databases icon indicating copy to clipboard operation
create_cisTarget_databases copied to clipboard

pig cisTarget database issues!

Open Samaneh14 opened this issue 3 years ago • 15 comments

Hi,

Regarding the @tropfenameimer comment on the build of Gallus gallus cisTarget database issue #4, I tried to take similar steps for building pig cisTarget database.

First, I got the regulatory fasta file through Ensembl/BioMart Ensembl_biomart Then, I made the TF motifs (JASPAR2022_CORE_vertebrates_non-redundant_pfms_jaspar.txt) in clusterbuster format and run the create_cistarget_motif_databases.py by the inputs. However, the job stopped with Traceback (most recent call last): File "~/stg_000??/SCENIC/create_cisTarget_databases/create_cistarget_motif_databases.py", line 504, in <module> main() File "~/stg_000??/SCENIC/create_cisTarget_databases/create_cistarget_motif_databases.py", line 289, in main region_or_gene_ids = RegionOrGeneIDs.get_region_or_gene_ids_from_fasta( File "~/stg_000??/SCENIC/create_cisTarget_databases/cistarget_db.py", line 150, in get_region_or_gene_ids_from_fasta region_id = line[1:].split(maxsplit=1)[0] IndexError: list index out of range

I don't know the reason for the upper errors!!! Would you please help me to fix the issue?

and also the last error ValueError: Error: region ID "SEPTIN3|ENSSSCG00000000040" is not unique in FASTA file

I prepared the fasta file by both gene name and gene ensemble ID, because some genes are not annotated in pig and just have the ID. So, just to unify the gene names as the input for SCENIC in where Gene ID is considered when the gene name is absent, I used both gene name and gene ID. However, by this most of the headers have both of them. How can I remove part of the headers to have just gene name in case of both gene name and gene ID.

Thanks for your time and help.

Best regards Samaneh

Samaneh14 avatar Jan 10 '22 13:01 Samaneh14

You need to add -g, --genes argument: -g '\|ENSSSCG[0-9]+$', but the original FASTA IDs still need to be unique, so probably you need to include the transcript ID too to get a unique FASTA ID. In that case, something like this might work (if pig Ensembl transcript IDs start with ENSSSCT):

-g '\|ENSSSCG[0-9]+\|ENSSSCT[0-9.]+$'

ghuls avatar Jan 10 '22 14:01 ghuls

Thanks @ghuls for your very fast reply. I will try it soon.

Samaneh14 avatar Jan 10 '22 14:01 Samaneh14

Hi, Another question, My rnaseq data are from S.scrofa11.1.97 while through Ensembl/BioMart just dataset105 is available for fasta file. I wonder if liftover can be applicable in this case! Thanks in advance!

Samaneh14 avatar Jan 10 '22 15:01 Samaneh14

You don't need liftOver as long as your expression matrix you generated for your RNAseq data has the same gene names as the ones you are generating now in the current FASTA file. Else you will need to convert gene symbols from the S.scrofa11.1.97 annotation to the Ensembl v105 version.

ghuls avatar Jan 10 '22 15:01 ghuls

Hi, Not sure to discuss this issue here, but just as following this already open issue, please let me ask how to solve the challenge to translate the motif-to-TF annotations for pig, I usedJASPAR2022_CORE_vertebrates_non-redundant_pfms_jaspar.txtand motifs-v9-nr.hgnc-m0.001-o0.0.tbl, but running SCENIC with vsn stopped within the last step (single_sample_scenic:SCENIC_APPEND:scenic:CISTARGET__MOTIF), because "None of ['features'] are in the columns". Thanks in advance. Samaneh

Samaneh14 avatar Jan 18 '22 17:01 Samaneh14

You need to convert the HGNC gene symbols in motifs-v9-nr.hgnc-m0.001-o0.0.tbl to the pig gene symbols that are in your database.

ghuls avatar Jan 19 '22 09:01 ghuls

They are partially the same (i.e, most of the human genes with the same symbol are found in pig). Should they be exactly replaced?

Samaneh14 avatar Jan 19 '22 10:01 Samaneh14

Are the gene IDs in your database exactly the same as in human (also same casing)?

import cistarget_db

ctx_rankings_genes_vs_motifs_feather_file = 'your_file.*.feather'

gene_ids = cistarget_db.CisTargetDatabase.get_all_region_or_gene_ids_and_motif_or_track_ids_from_db(ctx_rankings_genes_vs_motifs_feather_file)[1]

print(gene_ids)

gene_ids.ids

ghuls avatar Jan 19 '22 11:01 ghuls

Thanks for the hint. Some pig genes don't have orthologs in humans. Seems, I should back to the pig fasta file and just include the orthologs with human and redo the downstream steps.

Samaneh14 avatar Jan 19 '22 12:01 Samaneh14

That is probably not the best idea, unless you also want to change your input expression matrix every time.

ghuls avatar Jan 19 '22 12:01 ghuls

You need to convert the HGNC gene symbols in motifs-v9-nr.hgnc-m0.001-o0.0.tbl to the pig gene symbols that are in your database.

Thank you o much for the suggestion. I also face the problem. Do you know how to convert the HGNC gene symbols in motifs-v9-nr.hgnc-m0.001-o0.0.tbl to the other species' gene symbols. Sorry for the stupid question. I tried several times but did not find a good solution. I will very very appreciate you if you can help me. Thanks in advance lee

frucelee avatar Mar 06 '22 07:03 frucelee

You can do that with Ensembl Biomart.

Example for human to mouse mapping:

# Map Ensembl gene IDs to HGNC or MGI symbols.

    # Filename: ensg_to_hgnc.ensembl${ensembl_version}.tsv
    #   * Dataset:
    #       - Ensembl Genes 105
    #       - Human genes (GRCh38.p13)
    #   * Attributes:
    #       - Gene stable ID
    #       - HGNC symbol
    #   * Unique results only
    #   * Dataset 68005 / 68005 Genes
    #   * URL: http://www.ensembl.org/biomart/martview/92afc86ff2bede11ba9cdae04563ea92?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id|hsapiens_gene_ensembl.default.feature_page.hgnc_symbol&FILTERS=&VISIBLEPANEL=resultspanel
    ENSG_TO_HGNC_FILENAME = os.path.join(GENE_CONVERSION_DIR, 'ensg_to_hgnc.ensembl105.tsv')


    # Filename: ensmusg2mgi_ensembl${ensembl_version}.tsv
    #   * Dataset:
    #       - Ensembl Genes 105
    #       - Mouse genes (GRCm39)
    #   * Attributes:
    #       - Gene stable ID
    #       - MGI symbol
    #   * Unique results only
    #   * Dataset 55414 / 55414 Genes
    #   * URL: http://www.ensembl.org/biomart/martview/92afc86ff2bede11ba9cdae04563ea92?VIRTUALSCHEMANAME=default&ATTRIBUTES=mmusculus_gene_ensembl.default.feature_page.ensembl_gene_id|mmusculus_gene_ensembl.default.feature_page.mgi_symbol&FILTERS=&VISIBLEPANEL=resultspanel
    ENSMUSG_TO_MGI_FILENAME = os.path.join(GENE_CONVERSION_DIR, 'ensmusg_to_mgi.ensembl105.tsv')

# Get Human Ensemble Gene ID to Mouse Ensemble Gene ID mapping and take results with confidence = 1.
    # Orthology Tables between species in Ensembl gene IDs:
    #   - Ensembl BioMart homepage:  http://www.ensembl.org/biomart/martview/
    #   - Version:                   105
    #   - Downloaded on:             2022/01/25
    #
    # Orthology Table files can be downloaded manually:
    #
    #   Filename: ensg_to_ensmusg.ensembl${ensembl_version}.tsv
    #     * Dataset:
    #         - Ensembl Genes 105
    #         - Human genes (GRCh38.p13)
    #     * Filters:
    #         - MULTI SPECIES COMPARISONS:
    #             * Homologue filters: Orthologous Mouse Genes
    #             * Only
    #     * Attributes:
    #         - Homologues
    #         - GENE: Ensembl:
    #             * Gene stable ID
    #         - ORTHOLOGUES: Mouse Orthologues:
    #             * Mouse gene stable ID
    #             * Mouse homology type
    #             * Last common ancestor with Mouse
    #             * %id. target Mouse gene identical to query gene
    #             * %id. query gene identical to target Mouse gene
    #             * Mouse orthology confidence [0 low, 1 high]
    #     * Unique results only
    #     * Dataset 19621 / 68005 Genes
    #     * URL: http://www.ensembl.org/biomart/martview/92afc86ff2bede11ba9cdae04563ea92?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.homologs.ensembl_gene_id|hsapiens_gene_ensembl.default.homologs.mmusculus_homolog_ensembl_gene|hsapiens_gene_ensembl.default.homologs.mmusculus_homolog_orthology_type|hsapiens_gene_ensembl.default.homologs.mmusculus_homolog_subtype|hsapiens_gene_ensembl.default.homologs.mmusculus_homolog_perc_id|hsapiens_gene_ensembl.default.homologs.mmusculus_homolog_perc_id_r1&FILTERS=hsapiens_gene_ensembl.default.filters.with_mmusculus_homolog.only&VISIBLEPANEL=attributepanel

ghuls avatar Mar 07 '22 12:03 ghuls

You need to convert the HGNC gene symbols in motifs-v9-nr.hgnc-m0.001-o0.0.tbl to the pig gene symbols that are in your database.

Thank you o much for the suggestion. I also face the problem. Do you know how to convert the HGNC gene symbols in motifs-v9-nr.hgnc-m0.001-o0.0.tbl to the other species' gene symbols. Sorry for the stupid question. I tried several times but did not find a good solution. I will very very appreciate you if you can help me. Thanks in advance lee

Hello, did you succeed with your endeavor? I am not sure how ENSEMBL biomart can help us since the JASPAR DB contains different species and no ENSEMBL names. I am trying to set up the DB for scenic in Zebrafish. Did you find another solution? Would be highly grateful for an update. Best, Jo

JoGraesslin avatar Nov 22 '22 16:11 JoGraesslin

@JoGraesslin Hi, were you able to set up the database for zebrafish? I am trying to run scenic+ with a zebrafish dataset and I would appreciate the help a lot. Thanks in advance.

mtrebelo avatar May 22 '23 11:05 mtrebelo

Hi, did you completed pig cisTargetdb?

WhenFlyWang avatar Jun 13 '24 13:06 WhenFlyWang