bigsnpr icon indicating copy to clipboard operation
bigsnpr copied to clipboard

fix rsid-based `snp_asGeneticPos()`

Open timo-cpr opened this issue 3 years ago • 13 comments

spline() will complain about non-monotonous inputs otherwise. Tested on the provided list of hm3 variants map.rds (https://ndownloader.figshare.com/files/25503788)

timo-cpr avatar May 20 '21 13:05 timo-cpr

Hi Florian

Thanks for your ongoing work on the package, we're using ldpred2 quite a bit (manuscripts forthcoming).

I was glad to see the added functionality for estimating genetic distances based on rsIDs, as I'd been working on hg38 builds for a while and wondering about the hg19-based genetic maps.

However, when trying to evaluate and implement the rsID-based mapping, I ran into some issues. I'll try and demonstrate with a hopefully reproducible example based on the list of hapmap3 variants you provide on figshare. I hope you agree that this is a suitable example dataset as a) it is already available online and b) it's your recommended set of variants that any analysis should filter on anyway.

library(dplyr)
library(ggplot2)
library(bigsnpr)

# Feel free to change these two. This is the fully reproducilbe version.
test_variants <- readRDS(url("https://ndownloader.figshare.com/files/25503788"))
gen_map_dir <- "./genetic_map_testing"

if (! dir.exists(gen_map_dir)) {
  dir.create(gen_map_dir)
}

dist_rsid <- snp_asGeneticPos(test_variants$chr,
                              test_variants$pos_hg38,
                              rsid = test_variants$rsid,
                              dir = gen_map_dir)

# Error in { : task 1 failed - "'y' must be increasing or decreasing"

A look around the function code for snp_asGeneticPos() revealed that this error message is likely to come from the stats::spline() function used to extrapolate genetic distances for variants whose rsID could not be matched.

There is a bit more code below, but the gist of it is that adding ties = "ordered" to the stats::spline() call solves the issue. You're welcome to read on, but I'm afraid there is no big eureka moment, I basically stumbled upon the solution and think it works as a fix.

Code continued: As the error message is a bit vague, I was curious whether the error could be narrowed down to just one chromosome (snp_asGeneticPos() splits the chromosomes internally anyway):

for (current_chromosome in 1:22) {
  cat("Current bigsnpr function - Chromosome", current_chromosome, "\n")
  
  sub_df <- filter(test_variants, chr == current_chromosome)
  
  try(expr = 
        {
          sub_df$dist_rsid <- snp_asGeneticPos(sub_df$chr,
                                               sub_df$pos_hg38,
                                               rsid = sub_df$rsid,
                                               dir = gen_map_dir)
        }
  )
}

# Current bigsnpr function - Chromosome 1 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function - Chromosome 2 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function - Chromosome 3 
# Current bigsnpr function - Chromosome 4 
# Current bigsnpr function - Chromosome 5 
# Current bigsnpr function - Chromosome 6 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function - Chromosome 7 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function - Chromosome 8 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function - Chromosome 9 
# Current bigsnpr function - Chromosome 10 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function - Chromosome 11 
# Current bigsnpr function - Chromosome 12 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function - Chromosome 13 
# Current bigsnpr function - Chromosome 14 
# Current bigsnpr function - Chromosome 15 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function - Chromosome 16 
# Current bigsnpr function - Chromosome 17 
# Current bigsnpr function - Chromosome 18 
# Current bigsnpr function - Chromosome 19 
# Current bigsnpr function - Chromosome 20 
# Current bigsnpr function - Chromosome 21 
# Current bigsnpr function - Chromosome 22 

The error does not occur every single time, but only for chromosomes 1, 2, 6, 7, 8, 10, 12, and 15. Since the error is caused by input vectors for stats::spline() not being monotonous, my first attempt at fixing things was to sort the genomic positions before running snp_asGeneticPos():

for (current_chromosome in 1:22) {
  cat("Current bigsnpr function with sorted input - Chromosome", current_chromosome, "\n")
  
  sub_df <- filter(test_variants, chr == current_chromosome) %>% arrange(pos_hg38)
  
  try(expr = 
        {
          sub_df$dist_rsid <- snp_asGeneticPos(sub_df$chr,
                                               sub_df$pos_hg38,
                                               rsid = sub_df$rsid,
                                               dir = gen_map_dir)
        }
  )
}

# Current bigsnpr function with sorted input - Chromosome 1 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function with sorted input - Chromosome 2 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function with sorted input - Chromosome 3 
# Current bigsnpr function with sorted input - Chromosome 4 
# Current bigsnpr function with sorted input - Chromosome 5 
# Current bigsnpr function with sorted input - Chromosome 6 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function with sorted input - Chromosome 7 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function with sorted input - Chromosome 8 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function with sorted input - Chromosome 9 
# Current bigsnpr function with sorted input - Chromosome 10 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function with sorted input - Chromosome 11 
# Current bigsnpr function with sorted input - Chromosome 12 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function with sorted input - Chromosome 13 
# Current bigsnpr function with sorted input - Chromosome 14 
# Current bigsnpr function with sorted input - Chromosome 15 
# Error in { : task 1 failed - "'y' must be increasing or decreasing"
# Current bigsnpr function with sorted input - Chromosome 16 
# Current bigsnpr function with sorted input - Chromosome 17 
# Current bigsnpr function with sorted input - Chromosome 18 
# Current bigsnpr function with sorted input - Chromosome 19 
# Current bigsnpr function with sorted input - Chromosome 20 
# Current bigsnpr function with sorted input - Chromosome 21 
# Current bigsnpr function with sorted input - Chromosome 22

No changes here. I figured I'd have to dig a bit deeper, so I tried to directly run the internal code from snp_asGeneticPos() and see what's going on:

# Rename variables to match names inside snp_asGeneticPos() function
pos <- test_variants$pos_hg38
rsid <- test_variants$rsid

for (chr in 1:22) {
  cat("Code snippet from bigsnpr function - Chromosome", chr, "\n")
  
  try(expr = 
        {
          ind.chr <- which(test_variants$chr == chr)
          
          # Files need to be downloaded! The earlier `snp_asGeneticPos()` call with `dir = gen_map_dir` 
          # should have taken care of the download
          map.chr <- bigreadr::fread2(paste0(gen_map_dir, "/chr", chr, ".OMNI.interpolated_genetic_map"), showProgress = FALSE)
          
          # by rsid, lines 296-305
          ind <- match(rsid[ind.chr], map.chr$V1)
          new_pos <- map.chr$V3[ind]
          
          indNA <- which(is.na(ind))
          if (length(indNA) > 0) {
            pos.chr <- pos[ind.chr]
            new_pos[indNA] <- suppressWarnings(
              stats::spline(pos.chr, new_pos, xout = pos.chr[indNA], method = "hyman")$y)
          }        
        }
  )
}

# Code snippet from bigsnpr function - Chromosome 1 
# Error in stats::spline(pos.chr, new_pos, xout = pos.chr[indNA], method = "hyman") : 
#   'y' must be increasing or decreasing
# Code snippet from bigsnpr function - Chromosome 2 
# Error in stats::spline(pos.chr, new_pos, xout = pos.chr[indNA], method = "hyman") : 
#   'y' must be increasing or decreasing
# Code snippet from bigsnpr function - Chromosome 3 
# Code snippet from bigsnpr function - Chromosome 4 
# Code snippet from bigsnpr function - Chromosome 5 
# Code snippet from bigsnpr function - Chromosome 6 
# Error in stats::spline(pos.chr, new_pos, xout = pos.chr[indNA], method = "hyman") : 
#   'y' must be increasing or decreasing
# Code snippet from bigsnpr function - Chromosome 7 
# Error in stats::spline(pos.chr, new_pos, xout = pos.chr[indNA], method = "hyman") : 
#   'y' must be increasing or decreasing
# Code snippet from bigsnpr function - Chromosome 8 
# Error in stats::spline(pos.chr, new_pos, xout = pos.chr[indNA], method = "hyman") : 
#   'y' must be increasing or decreasing
# Code snippet from bigsnpr function - Chromosome 9 
# Code snippet from bigsnpr function - Chromosome 10 
# Error in stats::spline(pos.chr, new_pos, xout = pos.chr[indNA], method = "hyman") : 
#   'y' must be increasing or decreasing
# Code snippet from bigsnpr function - Chromosome 11 
# Code snippet from bigsnpr function - Chromosome 12 
# Error in stats::spline(pos.chr, new_pos, xout = pos.chr[indNA], method = "hyman") : 
#   'y' must be increasing or decreasing
# Code snippet from bigsnpr function - Chromosome 13 
# Code snippet from bigsnpr function - Chromosome 14 
# Code snippet from bigsnpr function - Chromosome 15 
# Error in stats::spline(pos.chr, new_pos, xout = pos.chr[indNA], method = "hyman") : 
#   'y' must be increasing or decreasing
# Code snippet from bigsnpr function - Chromosome 16 
# Code snippet from bigsnpr function - Chromosome 17 
# Code snippet from bigsnpr function - Chromosome 18 
# Code snippet from bigsnpr function - Chromosome 19 
# Code snippet from bigsnpr function - Chromosome 20 
# Code snippet from bigsnpr function - Chromosome 21 
# Code snippet from bigsnpr function - Chromosome 22

This confirmed that the error comes from stats::spline(). Looking at the function code for stats::spline(), I was able to see how monotony is checked:

stats::spline

# function (x, y = NULL, n = 3 * length(x), method = "fmm", xmin = min(x), 
#     xmax = max(x), xout, ties = mean) 
# {
#     method <- pmatch(method, c("periodic", "natural", "fmm", 
#         "hyman"))
#     if (is.na(method)) 
#         stop("invalid interpolation method")
#     x <- regularize.values(x, y, ties, missing(ties))
#     y <- x$y
#     x <- x$x
#     nx <- length(x)
#     if (is.na(nx)) 
#         stop(gettextf("invalid value of %s", "length(x)"), domain = NA)
#     if (nx == 0) 
#         stop("zero non-NA points")
#     if (method == 1L && y[1L] != y[nx]) {
#         warning("spline: first and last y values differ - using y[1] for both")
#         y[nx] <- y[1L]
#     }
#     if (method == 4L) {
#         dy <- diff(y)
#         if (!(all(dy >= 0) || all(dy <= 0))) 
#             stop("'y' must be increasing or decreasing")
#     }
#     if (missing(xout)) 
#         xout <- seq.int(xmin, xmax, length.out = n)
#     else n <- length(xout)
#     if (n <= 0L) 
#         stop("'spline' requires n >= 1")
#     xout <- as.double(xout)
#     z <- .Call(C_SplineCoef, min(3L, method), x, y)
#     if (method == 4L) 
#         z <- spl_coef_conv(hyman_filter(z))
#     list(x = xout, y = .Call(C_SplineEval, xout, z))
# }
# <bytecode: 0x7fc5a7b29fd8>
# <environment: namespace:stats>

The relevant snippet is

if (method == 4L) {
    dy <- diff(y)
    if (!(all(dy >= 0) || all(dy <= 0))) 
        stop("'y' must be increasing or decreasing")
}

However, the values for x and y provided to stats::spline() are parsed through the function call x <- regularize.values(x, y, ties, missing(ties)). I found it really difficult to find documentation for this function, but either way, running the whole thing with ties = "ordered" added fixes the errors:

tracked_results <- tibble(chr = as.integer(),
                          pos = as.integer(),
                          rsid = as.character(),
                          na_rsid = as.logical(),
                          dist_rsid = as.numeric())

for (chr in 1:22) {
  cat("Fixed snippet from bigsnpr function - Chromosome", chr, "\n")
  
  try(expr = 
        {
          ind.chr <- which(test_variants$chr == chr)
          
          map.chr <- bigreadr::fread2(paste0(gen_map_dir, "/chr", chr, ".OMNI.interpolated_genetic_map"), showProgress = FALSE)
          
          # by rsid, lines 296-305
          ind <- match(rsid[ind.chr], map.chr$V1)
          new_pos <- map.chr$V3[ind]
          
          indNA <- which(is.na(ind))
          if (length(indNA) > 0) {
            pos.chr <- pos[ind.chr]
            new_pos[indNA] <- suppressWarnings(
              stats::spline(pos.chr, new_pos, xout = pos.chr[indNA], method = "hyman", ties = "ordered")$y)
          }
          
          # Log results and missing rsids
          tracked_results <- bind_rows(tracked_results, tibble(chr = chr,
                                                               pos = pos[ind.chr],
                                                               rsid = rsid[ind.chr],
                                                               na_rsid = is.na(ind),
                                                               dist_rsid = new_pos))
        }
  )
}

# Fixed snippet from bigsnpr function - Chromosome 1 
# Fixed snippet from bigsnpr function - Chromosome 2 
# Fixed snippet from bigsnpr function - Chromosome 3 
# Fixed snippet from bigsnpr function - Chromosome 4 
# Fixed snippet from bigsnpr function - Chromosome 5 
# Fixed snippet from bigsnpr function - Chromosome 6 
# Fixed snippet from bigsnpr function - Chromosome 7 
# Fixed snippet from bigsnpr function - Chromosome 8 
# Fixed snippet from bigsnpr function - Chromosome 9 
# Fixed snippet from bigsnpr function - Chromosome 10 
# Fixed snippet from bigsnpr function - Chromosome 11 
# Fixed snippet from bigsnpr function - Chromosome 12 
# Fixed snippet from bigsnpr function - Chromosome 13 
# Fixed snippet from bigsnpr function - Chromosome 14 
# Fixed snippet from bigsnpr function - Chromosome 15 
# Fixed snippet from bigsnpr function - Chromosome 16 
# Fixed snippet from bigsnpr function - Chromosome 17 
# Fixed snippet from bigsnpr function - Chromosome 18 
# Fixed snippet from bigsnpr function - Chromosome 19 
# Fixed snippet from bigsnpr function - Chromosome 20 
# Fixed snippet from bigsnpr function - Chromosome 21 
# Fixed snippet from bigsnpr function - Chromosome 22 

For a quick sanity check, I plot hg38 positions versus the estimated genetic distances:

tracked_results %>% 
  ggplot(aes(x = pos, y = dist_rsid)) +
  geom_point() +
  facet_wrap(facets = vars(chr), scales = "free", nrow = 5) +
  labs(x = "hg38 position", y = "distance based on rsIDs")

Rplot

There is something weird going on in chromosome 7, this is due to the liftover to hg38 returning nucleotide positions for an alternative scaffold (I see the same in my own liftover results of your previous hm3 list). While this looks off when plotting against hg38 positions, the genetic distances here are correct I'd say. We can have a look at these outliers:

test_variants %>% 
  mutate(liftover_diff = abs(pos_hg38 - pos)) %>% 
  arrange(desc(liftover_diff))

## A tibble: 1,054,330 x 11
#     chr       pos a0    a1    rsid      af_UKBB    ld  pos_hg17  pos_hg18 pos_hg38 liftover_diff
#   <int>     <int> <chr> <chr> <chr>       <dbl> <dbl>     <int>     <int>    <int>         <int>
# 1     7 142275743 C     T     rs949425    0.102 16.7  141531051 141724336   339802     141935941
# 2     7 142274854 G     A     rs2854536   0.297 16.0  141531940 141725225   340691     141934163
# 3     7 142274092 A     G     rs2854540   0.768 17.9  141532702 141725987   341453     141932639
# 4     7 142274025 T     C     rs2854541   0.739 18.0  141532769 141726054   341520     141932505
# 5     7 142273430 C     T     rs361487    0.132  6.33 141533702 141726987   342115     141931315
# 6     7 142273128 T     C     rs361488    0.736 18.1  141534004 141727289   342417     141930711
# 7     7 142272619 A     C     rs2855882   0.227 15.5  141534513 141727798   342926     141929693
# 8     7 142261439 G     A     rs2734060   0.169 20.8  141545697 141738982   354112     141907327
# 9     7 142261012 A     G     rs2734062   0.831 20.8  141546124 141739409   354539     141906473
#10     7 142260736 G     A     rs2734063   0.904 19.0  141546400 141739685   354815     141905921
## … with 1,054,320 more rows

When checking the first rsID listed here, we can see that the hg38 position in dbsnp is more in line with the positions in all the other builds: https://www.ncbi.nlm.nih.gov/snp/rs949425

timo-cpr avatar May 20 '21 13:05 timo-cpr

Thanks for this investigation. If I understand correctly, the problem arises because the genetic positions are not monotonic with the hg38 positions.

Then the question is: do we really want to use the hg38 positions at all? Another solution would be to simply do linear interpolation. But there is a bit of an issue with the edges.

privefl avatar May 20 '21 14:05 privefl

Hi again! Yes you are correct, the underlying reason must be non-monotony of the genetic positions and the hg38 positions.

I have looked at this issue a bit more and decided to implement the suggested fix (adding ties = "ordered" to the spline() call) in my own pipeline, however it turns out that the suggested fix does not actually fix the problem when I tested it on my real data (I'm not sure why that is, to be honest, as this is after hm3 filtering).

As for alternatives to the spline interpolation - what do you think about using a k-NN solution for the variants that need it? Find the closest variant (in hg38 position space) that was successfully matched via rsID and use the genetic distance of that closest neighbor. Unless I'm missing something, it seems like estimating the genetic distances without rsID works exactly like this already (albeit directly on the hg19 positions in the genetic map files).

Also, maybe worth mentioning that the number of variants any fix would apply to is rather small; the majority of variants can be succesfully matched through their rsID (chr 6 has the most non-matching variants):

library(tidyverse)
library(bigsnpr)

# Reproducible version, can point to local copies instead
test_variants <- readRDS(url("https://ndownloader.figshare.com/files/25503788"))
gen_map_dir <- "./genetic_map_testing"

all_gen_pos <- tibble(V1 = as.character(),
                      V2 = as.numeric(),
                      V3 = as.numeric())

# Loading all genetic map files, takes 2-3 minutes with just one core
for (chr in 1:22) {
  all_gen_pos <- bind_rows(all_gen_pos,
                           bigreadr::fread2(paste0(gen_map_dir, "/chr", chr, ".OMNI.interpolated_genetic_map")))
}

(check_rsid_matches <- test_variants %>% left_join(all_gen_pos, by = c("rsid" = "V1")) %>% 
  rename(gen_dist = V3) %>% 
  select(-V2))

# # A tibble: 1,054,330 x 11
#      chr    pos a0    a1    rsid       af_UKBB    ld pos_hg17 pos_hg18 pos_hg38 gen_dist
#    <int>  <int> <chr> <chr> <chr>        <dbl> <dbl>    <int>    <int>    <int>    <dbl>
#  1     1 752721 A     G     rs3131972    0.841  3.69   792584   742584   817341   0.0157
#  2     1 754182 A     G     rs3131969    0.870  3.73   794045   744045   818802   0.0167
#  3     1 760912 C     T     rs1048488    0.840  3.69   800775   750775   825532   0.0212
#  4     1 768448 G     A     rs12562034   0.106  1.40   808311   758311   833068   0.0263
#  5     1 779322 A     G     rs4040617    0.128  3.68   819185   769185   843942   0.0336
#  6     1 838555 C     A     rs4970383    0.245  3.57   878418   828418   903175   0.160 
#  7     1 846808 C     T     rs4475691    0.198  5.20   886671   836671   911428   0.188 
#  8     1 853954 C     A     rs1806509    0.608  4.45   893817   843817   918574   0.194 
#  9     1 854250 A     G     rs7537756    0.210  5.70   894113   844113   918870   0.195 
# 10     1 864938 G     A     rs2340587    0.762  5.10   904801   854801   929558  NA     
# # … with 1,054,320 more rows

check_rsid_matches %>% 
  group_by(chr) %>% 
  summarise(matched = sum(!is.na(gen_dist)),
            no_match = sum(is.na(gen_dist))) %>% 
  ungroup() %>% 
  mutate(match_ratio = matched / (matched + no_match)) %>% 
  print(n = 22)

#   # A tibble: 22 x 4
#      chr matched no_match match_ratio
#    <int>   <int>    <int>     <dbl>
#  1     1   86924      221     0.997
#  2     2   87281      240     0.997
#  3     3   73257      157     0.998
#  4     4   65215      275     0.996
#  5     5   66174      156     0.998
#  6     6   66194     4852     0.932
#  7     7   57888      148     0.997
#  8     8   56962      122     0.998
#  9     9   48355      120     0.998
# 10    10   56412      137     0.998
# 11    11   53825      140     0.997
# 12    12   51375      117     0.998
# 13    13   40019      111     0.997
# 14    14   35023       70     0.998
# 15    15   31678       90     0.997
# 16    16   32321      110     0.997
# 17    17   28442      329     0.989
# 18    18   31410       87     0.997
# 19    19   19748       55     0.997
# 20    20   27688       40     0.999
# 21    21   15066       47     0.997
# 22    22   15411       38     0.998

timo-cpr avatar May 26 '21 10:05 timo-cpr

The thing with kNN is that it is okay because the data is so dense in the genetic map, which might not be the case when considering only the variants matched by rsid. Give me a few days to think about it (I'll get back to you by Friday max).

In the meantime, could you try your code with linear interpolation instead? (function approx() I think).

Thanks.

privefl avatar May 26 '21 11:05 privefl

Yes that's true, I was about to look at the location of the variants that cannot be matched by rsID. If they are all in the same region (suspect that might be the case for chr 6) then kNN is even more problematic.

Will report some findings!

timo-cpr avatar May 26 '21 11:05 timo-cpr

What about using method = "fmm"? It does not need monotonic positions.

privefl avatar May 28 '21 08:05 privefl

My apologies, I've been preoccupied with other tasks. I can test spline(method = "fmm") this afternoon if you like, but obviously it should fix the non-monotony problem.

So the only thing I would actually test then is whether or not the interpolated results look reasonable.

timo-cpr avatar May 28 '21 09:05 timo-cpr

Yes, please tell me if results with this look reasonable.

privefl avatar May 28 '21 10:05 privefl

Hi again! I have now been able to test linear interpolation as well as the "fmm" spline interpolation. Both of them fix the issue with non-monotony that is triggered by using the "hyman" spline interpolation.

Quick code snippet for running the "fmm" interpolation adn inspecting the results:

library(tidyverse)
library(bigsnpr)

test_variants <- readRDS(url("https://ndownloader.figshare.com/files/25503788"))
gen_map_dir <- "../ldpred2_resources/genetic_maps/" # change to point to your local copies

# Rename variables to match names inside snp_asGeneticPos() function
pos <- test_variants$pos_hg38
rsid <- test_variants$rsid

fmm_res <- tibble(chr = as.integer(),
                  pos = as.integer(),
                  rsid = as.character(),
                  na_rsid = as.logical(),
                  dist_rsid = as.numeric())

for (chr in 1:22) {
  cat("Testing 'fmm' interpolation - Chromosome", chr, "\n")
  
  try(expr = 
        {
          ind.chr <- which(test_variants$chr == chr)
          
          map.chr <- bigreadr::fread2(paste0(gen_map_dir, "/chr", chr, ".OMNI.interpolated_genetic_map"), showProgress = FALSE)
          
          # by rsid, lines 296-305
          ind <- match(rsid[ind.chr], map.chr$V1)
          new_pos <- map.chr$V3[ind]
          
          indNA <- which(is.na(ind))
          if (length(indNA) > 0) {
            pos.chr <- pos[ind.chr]
            new_pos[indNA] <- suppressWarnings(
              stats::spline(pos.chr, new_pos, xout = pos.chr[indNA], method = "fmm")$y)
          }
          
          # Log results and missing rsids
          fmm_res <- bind_rows(fmm_res, tibble(chr = chr,
                                               pos = pos[ind.chr],
                                               rsid = rsid[ind.chr],
                                               na_rsid = is.na(ind),
                                               dist_rsid = new_pos))
        }
  )
}

fmm_res %>% 
  ggplot(aes(x = pos, y = dist_rsid)) +
  geom_segment(data = filter(fmm_res, na_rsid), aes(xend = pos, yend = 0), alpha = 0.1, colour = "red") +
  geom_line() +
  facet_wrap(facets = vars(chr), scales = "free", nrow = 5) +
  labs(x = "hg38 position", y = "genetic distance", title = "Estimating genetic distances via rsIDs & 'fmm' interpolation", subtitle = "red areas represent interpolated variants")

genmap_interpol

I don't know how critically you'd like to inspect the estimated distances, but from a high level, I would say these are fine - although two data points stand out. I'm showing the "fmm" results because they present these two outlying data points: 1) the very first variant on chromosome 9 ends up with an estimated genetic distance of -50; 2) the very last variant on chromosome 19 has a weird hg38 position (way too high) and ends up being estimated lower than the other rightmost variants on chr 19 (it's this SNP: https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?do_not_redirect&rs=rs34501900).

For the pipeline we are using, I think I'm likely to either a) use the linear interpolation, as approx() allows handling cases that fall outside the interpolation interval with the rule argument (return either NAs or the values at interval edge) or b) hack a solution where I just use the corresponding hg19 positions for the hg38 variants I want to convert to genetic distances (they are right there in the hm3 variant list, after all...)

timo-cpr avatar May 31 '21 13:05 timo-cpr

Thanks for reporting.

I'll try when I have some time this week.

privefl avatar May 31 '21 13:05 privefl

Dear developer and users,

I am having the same issue:

POS2 <- snp_asGeneticPos(CHR, POS, rsid = RSID, dir=".") Error in { : task 4 failed - "'y' must be increasing or decreasing"

How should I modify the code?

Many thanks!!

Paloma

pjordab avatar Aug 31 '21 21:08 pjordab

I'll try to look at this again next week.

privefl avatar Jul 30 '22 04:07 privefl

So, I think there are two okay solutions:

  1. Use a linear interpolation: approx(pos.chr, new_pos, xout = pos.chr[indNA], rule = 2)$y
  2. Do not try to interpolate positions corresponding to rsids that have not been matched; these variants may not be easy to map and therefore we may not have a high confidence in them -> maybe better just to remove them (getting NAs for the genetic pos).

Honestly I would play the safe card by picking option 2. @timo-cpr What's your opinion on this?

privefl avatar Jul 31 '22 07:07 privefl