MungeSumstats icon indicating copy to clipboard operation
MungeSumstats copied to clipboard

Join results in more than 2^31 rows for format_sumstats

Open Snigireva opened this issue 1 year ago • 3 comments

1. Bug description

Hi! I run this code to standardize the summary statistics:

data = fread('C:/Folder/trait_qc.sumstats.csv.gz')
reformatted <- MungeSumstats::format_sumstats(path=data,  ref_genome="GRCh37", compute_z = TRUE, return_data = TRUE)

Any idea of what to do with that?

Console output


Formatted summary statistics will be saved to ==>  C:\Users\P70~1\AppData\Local\Temp\RtmpQNDQJX\file371020c95a84.tsv.gz
Standardising column headers.
First line of summary statistics file: 
SNP	CHR	BP	PVAL	A1	A2	N	Z	BETA	SE	NSTUDY	
Summary statistics report:
   - 45,984,943 rows
   - 23,134,502 unique variants
   - 114,938 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
1,391,650 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Found  Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 24,304,912 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 240 seconds.
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Data (for the first 50 rows)

df = structure(list(SNP = c("rs367896724", "rs145", "rs534229142", "rs537182", "rs376342519", "rs5586", "rs575272151", "rs544419", "rs5611", "rs54", "rs62635286", "rs62", "rs53173", "rs538791886", "rs558318514", "rs55476", "rs574697788", "rs554", "rs546169444", "rs7", "rs54194", "rs6682385", "rs199856693", "rs3982632", "rs576", "rs2758118", "rs2758118", "rs53363", "rs564", "rs374", "rs2691317", "rs2691315", "rs5575142", "rs541172944", "rs548165136", "rs755466349", "rs539235482", "rs199745162", "rs578", "rs564", "rs533", "rs8", "rs545414834", "rs54", "rs532819925", "rs1", "rs5677884", "rs553572247", "rs539322794", "rs542415"), CHR = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), BP = c(10177L, 10352L, 10511L, 10539L, 10616L, 10642L, 11008L, 11012L, 11063L, 13110L, 13116L, 13118L, 13273L, 13289L, 13445L, 13483L, 13494L, 13550L, 14464L, 14599L, 14604L, 14930L, 14933L, 15211L, 15245L, 15274L, 15274L, 15585L, 15644L, 15774L, 15777L, 15820L, 15903L, 16071L, 16142L, 16226L, 16542L, 16949L, 17641L, 18643L, 18849L, 30923L, 46285L, 47159L, 47267L, 49298L, 49315L, 49343L, 49554L, 50891L ), PVAL = c(0.942, 0.682, 0.891, 0.393, 0.383, 0.297, 0.474, 0.474, 0.848, 0.729, 0.545, 0.545, 0.778, 0.0499, 0.109, 0.00465, 0.591, 0.0709, 0.643, 0.328, 0.328, 0.333, 0.901, 0.141, 0.116, 0.201, 0.259, 0.289, 0.689, 0.836, 0.35, 0.0248, 0.333, 0.565, 0.46, 0.497, 0.206, 0.595, 0.773, 0.197, 0.205, 0.684, 0.155, 0.69, 0.821, 0.311, 0.806, 0.745, 0.972, 0.394), A1 = c("AC", "TA", "A", "A", "CCGCCGTTGCAAAGGCGCGCCG", "A", "G", "G", "G", "A", "G", "G", "C", "C", "G", "C", "G", "A", "T", "A", "G", "A", "A", "T", "T", "A", "G", "A", "A", "A", "G", "T", "GC", "A", "A", "A", "A", "C", "A", "A", "C", "G", "A", "C", "G", "T", "A", "C", "G", "C"), A2 = c("A", "T", "G", "C", "C", "G", "C", "C", "T", "G", "T", "A", "G", "CCT", "C", "G", "A", "G", "A", "T", "A", "G", "G", "G", "C", "T", "A", "G", "G", "G", "A", "G", "G", "G", "G", "AG", "C", "A", "G", "G", "G", "T", "ATAT", "T", "T", "C", "T", "T", "A", "T"), N = c(8160L, 8160L, 361237L, 16026L, 372627L, 361266L, 8160L, 8160L, 357928L, 363969L, 8160L, 8160L, 3701L, 378761L, 357928L, 357928L, 358181L, 367239L, 6832L, 8160L, 8160L, 8160L, 358725L, 8160L, 362555L, 3701L, 3701L, 369481L, 362738L, 364049L, 362923L, 2373L, 8160L, 375575L, 367282L, 26547L, 357680L, 364788L, 357928L, 361989L, 368762L, 3701L, 359800L, 364512L, 361256L, 10040L, 362387L, 362834L, 6832L, 367281L), Z = c(0.0727563581760374, -0.409735480321281, 0.137038959961148, -0.854189500094597, 0.872382030909752, 1.04288836267464, -0.715985989610205, -0.715985989610205, 0.19167090224842, 0.346456061065837, -0.605269414941509, -0.605269414941509, 0.281926329587061, -1.96082020683793, 1.60270409055176, -2.83033010490082, 0.537387465090095, 1.80611742223106, -0.463508393356937, 0.978150286262472, 0.978150286262472, -0.968088845878538, -0.124398198069055, 1.47207731715937, 1.57178681650986, 1.27870772031991, 1.1287578451833, 1.06031789670761, 0.400212511707879, -0.207012623385187, -0.93458929107348, -2.24450387316539, 0.968088845878538, -0.575430768607773, -0.738846849185214, 0.679217595655219, 1.26464113566108, 0.531604424103706, 0.288453003564521, -1.29014591650869, -1.26743441691691, 0.407010876264466, -1.42209043212232, 0.398855065642337, -0.226258980439831, 1.01312595979589, 0.245589523422081, -0.325239256402395, 0.0351000017727088, 0.852385797957575), BETA = c(0.00198916, -0.0109805, 0.00765789, -0.149708, 0.0225852, 0.148159, -0.0281357, -0.028136, 0.103634, 0.00314893, -0.0212581, -0.0212581, 0.0161786, -0.0745136, 0.139501, -0.0774387, 0.0209628, 0.0577324, -0.0191033, 0.0330887, 0.0330901, -0.025562, -0.00126148, 0.0439155, 0.0906229, 0.0540921, 0.0478291, 0.0255675, 0.0135413, -0.00585945, -0.0164868, -0.119141, 0.0259418, -0.183099, -0.0257248, 0.0400081, 0.182568, 0.00773019, 0.0147548, -0.0327346, -0.0154651, 0.0315515, -0.0640722, 0.0034205, -0.0238865, 0.0309572, 0.0157055, -0.0169812, 0.00182556, 0.0274896), SE = c(0.0274895, 0.0268163, 0.0558682, 0.175335, 0.0258707, 0.141956, 0.0392787, 0.0392787, 0.542386, 0.00908721, 0.0351191, 0.0351191, 0.0574542, 0.0380054, 0.0869389, 0.0273598, 0.0389586, 0.0319694, 0.0412681, 0.0338204, 0.0338204, 0.0264114, 0.0100911, 0.0298549, 0.0576995, 0.0423158, 0.0423857, 0.0241328, 0.033891, 0.0282659, 0.0176259, 0.0530988, 0.0268215, 0.317943, 0.0348059, 0.0589221, 0.144412, 0.0145595, 0.0512095, 0.0253839, 0.0122108, 0.0776434, 0.0450702, 0.00857457, 0.105857, 0.0305461, 0.0639575, 0.0521867, 0.0527002, 0.0322444), NSTUDY = c(5L, 5L, 2L, 5L, 7L, 2L, 5L, 5L, 2L, 5L, 5L, 5L, 4L, 8L, 2L, 2L, 3L, 4L, 4L, 5L, 5L, 5L, 4L, 5L, 3L, 4L, 4L, 4L, 3L, 4L, 4L, 3L, 5L, 2L, 4L, 7L, 2L, 6L, 2L, 4L, 6L, 4L, 3L, 6L, 2L, 7L, 3L, 3L, 4L, 4L)), row.names = c(NA, -50L), class = c("data.table", "data.frame"))

3. Session info

R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.utf8 [2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods
[8] base

other attached packages: [1] GenomeInfoDb_1.34.9 IRanges_2.32.0 S4Vectors_0.36.2
[4] BiocGenerics_0.44.0 data.table_1.14.8

Snigireva avatar Aug 17 '23 10:08 Snigireva