Dark_and_Camouflaged_genes icon indicating copy to clipboard operation
Dark_and_Camouflaged_genes copied to clipboard

Problems while running 05_CREATE_BED_FILE

Open LiShuhang-gif opened this issue 1 year ago • 4 comments

Hello, I tried to use your script to detect the camo regions, but I encountered the following error when I ran 05_CREATE_BED_FILE (extract_camo_regions.py):

Wed Jul 26 20:08:19 CST 2023 python extract_camo_regions.py
Traceback (most recent call last):
  File "extract_camo_regions.py", line 113, in <module>
    main(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5])       
  File "extract_camo_regions.py", line 94, in main
    group_pos = [regions[region_id]]
KeyError: 'DDX11L1_1::chr1:11869-12227'

Actually, I have no idea about this, and I don't know if there's something wrong with the format that KeyError points out? Looking forward to your reply!

LiShuhang-gif avatar Jul 27 '23 14:07 LiShuhang-gif

Hi @LiShuhang-gif ,

A couple of questions, What is the reference genome(s) you are working with? Is it human data?

It is possible that the reference you are using and the format your data is in are incompatible. This could be the case if, for example, your data is formatted like 1:11869-12227 and the reference genome is formatted with chr1:11869-12227. In order for the pipeline to work, there are certain files whose format must match.

We also have a good number of the .bed files already created. You can see if they match your data. They are located on the nextflow-pipeline branch (https://github.com/mebbert/Dark_and_Camouflaged_genes/tree/nextflow-pipeline/camo_bed_files)

This pipeline is not under active development or support, but we will try to help as much as we can.

Thank you! Maddy

mpage21 avatar Jul 28 '23 13:07 mpage21

Hi, actually I'm using hg38 human genome as reference. I will try your suggestion to check my current format and if there is any progress I will leave a message here. Thanks!

LiShuhang-gif avatar Aug 28 '23 02:08 LiShuhang-gif

Hello, I have another question now. Do the dark and canmouflage regions vary greatly between populations? Can I merge the bed file I got with the bed file you provided to get a more complete set of dark regions? Or, to be more specific, can I use the bed file of the dark region obtained from one population to screen for SNPS in another population? Thanks!

LiShuhang-gif avatar Dec 21 '23 03:12 LiShuhang-gif

Hello @LiShuhang-gif,

That's a great question.

Since the camouflaged regions are mostly determined by the reference (and not the population), it shouldn't make a big difference, but we haven't systematically assessed this by population. There may be more variation in the dark-by-depth regions (e.g., if for some reason certain populations really don't have that gene/region present in their genome), but camouflaged regions occur when there are duplications in the reference genome (regardless of whether there are duplications in the individual's/population's genome).

This is why we say this method is really a band-aid solution for genomics. What we really need is to construct each individual's genome structure rather than imposing a single genome's (or even a pangenome's) structure on the individual(s).

I hope that helps. Conceptually, the idea is simple, but it gets a bit complicated as you get into the weeds.

mebbert avatar Dec 21 '23 16:12 mebbert