xpclr icon indicating copy to clipboard operation
xpclr copied to clipboard

The format of gdistkey file?

Open xiekunwhy opened this issue 2 years ago • 5 comments

Hi,

Could you explain what is --gdistkey and provide an example format? And what is the difference between --gdistkey file and --map file?

Best, Kun

xiekunwhy avatar Sep 09 '21 07:09 xiekunwhy

The map file is the same format as required by the original tool: there is an example in the fixtures folder: https://github.com/hardingnj/xpclr/blob/master/fixture/mapfile.snp

The --gdistkey is used when reading from VCF or zarr (via scikit-allele vcf2zarr). For example you may have a key "gen_dist" that is in the INFO section.

hardingnj avatar Sep 09 '21 08:09 hardingnj

Hey,

I have added the "gen_dist" in the INFO section of my VCF file, but it still said "No genetic distance provided; using rrate of 1e-08/bp". Do you know what's wrong with my VCF? my cmd is:

xpclr -Sa ../a.txt -Sb ../b.txt -C A01 -I ../bra_only_phased_gendist.vcf -O a_A01.txt --phased --gdistkey gen_dist

and my vcf looks like:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sam1 sam2 sam3
A01     2977    A01:2977        C       T       .       PASS    gen_dist=0.111     GT      0|0     0|0     0|1 

anhong11 avatar Apr 08 '22 23:04 anhong11

Hey all,

I ran into the same issue. It looks like gdistkey is set to None when the VCF is loaded, regardless of what key is passed. In other words, the load_vcf_format_data() function in util.py doesn't use the gdistkey passed at the command line, unlike the hdf5 and zarr loading.

James-S-Santangelo avatar Apr 26 '22 14:04 James-S-Santangelo

Yes, I found this too. I change util.py a little bit, so the script can load gen_dist from the column QUAL:

def load_vcf_wrapper(path, seqid, samples, samples_path):

    callset = allel.read_vcf(
        path,
        region=seqid,
        fields=['variants/POS', 'variants/QUAL','calldata/GT', 'samples'],
        tabix="tabix",
        samples=samples)

    assert "samples" in callset.keys(), "None of the samples provided in {0!r} are found in {1!r}".format(
        samples_path, path)

    p = allel.SortedIndex(callset["variants/POS"])
    g = allel.GenotypeArray(callset['calldata/GT'])
    m = allel.SortedIndex(callset["variants/QUAL"])

    return p, g, m


def load_vcf_format_data(vcf_fn, chrom, s1, s2, gdistkey=None):

    #    geno1, geno2, pos = q, q, q
    samples1 = get_sample_ids(s1)
    samples2 = get_sample_ids(s2)
    pos1, geno1, cm1 = load_vcf_wrapper(vcf_fn, chrom, samples1, s1)
    pos2, geno2, cm2 = load_vcf_wrapper(vcf_fn, chrom, samples2, s2)

    assert np.array_equal(pos1, pos2), "POS fields not the same"
    assert geno1.shape[0] == pos1.shape[0], "For samples 1, genotypes do not match positions"
    assert geno2.shape[0] == pos2.shape[0], "For samples 2, genotypes do not match positions"
    assert geno1.shape[1] == len(samples1)
    assert geno2.shape[1] == len(samples2)
    
    if gdistkey is not None:
        gdist = cm1
    else:
        gdist = None
    
    return geno1, geno2, pos1, cm1

anhong11 avatar Apr 26 '22 14:04 anhong11

Thanks all. Agree this seems like a bug- should be fixable by incorporating the code from the load hdf5 function into the VCF function.

hardingnj avatar Apr 27 '22 08:04 hardingnj