tskit icon indicating copy to clipboard operation
tskit copied to clipboard

Reinstate check for chromosome name in read_hapmap

Open hyanwong opened this issue 4 years ago • 0 comments

For simplicity, code to check for consistency of the chromosome name was removed from the new msprime.read_hapmap() code. However, it seems a good idea to check, by default, that if there is a column in the hapmap file with the chromosome name, it contains the same string. The reason I want this is that (a) it is common to munge multiple chromosomes together in the same file, and we should check that the user is not doing this by mistake (normally this would be caught because the new chromosome would start again at a low value, so break the ordering constraint, but this isn't guaranteed, and is fragile to rely upon for e.g. microchromosomes in birds) (b) it is useful to have the chromosome name, and we had thought that we might use it in future (although we are not doing so at the moment) and (c) @grahamgower suggests (and I agree) that we should try to be pretty strict about the format we are defaulting to in read_hapmap(), so that we don't get into the bind we had before of making files that will "work" in stdpopsim, but are not in any conventional format.

So that I don't lose the code, this was previously done in intervals.py: "read_hapmap" as follows

    if 0 not in column:
        column[0] = ("chrom", object)
    ... do the np.readtxt() thing
    if "chrom" in data and len(np.unique(data["chrom"])) > 1:
        warnings.warn(
            f"Different chromosome names found in the first column of {fileobj}. "
            "Check that the file contains data for a single chromosome."
        )

and the test case was

    def test_read_hapmap_string_warning(self):
        hapfile = io.StringIO(
            """\
            HEADER
            chr1 0 x 0
            chr1 1 x 0.000001 x
            chr10 2 x 0.000006 x x x"""
        )
        with warnings.catch_warnings(record=True) as w:
            warnings.simplefilter("always")
            rm = msprime.read_hapmap(hapfile)
            assert len(w) == 1
            assert "chromosome names" in str(w[0].message)
        np.testing.assert_array_equal(rm.position, [0, 1, 2])
        assert np.allclose(rm.rate, [1e-8, 5e-8])

hyanwong avatar Jan 15 '21 13:01 hyanwong