mtag ValueError: cannot reindex from a duplicate axis

Not sure if this is on my end or if this (perhaps?) has to do with the new n_value or p_name options. This is my first time not adding an N column or renaming the P column from the BOLT-LMM output. After the 3 trait files get munged, emitting mean chi^2, GC estimates, etc, the script panics due to ValueError: cannot reindex from a duplicate axis.

Does any clear issue stand out? If not, I can paste the full log and start manipulating columns to try to narrow down whether this is actually related to the new n_value or p_name settings.

Error:

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
Munging of Trait 3 complete. SNPs remaining:	 8647669
<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Trait 3: Dropped 9225 SNPs for duplicate values in the "snp_name" column
Dropped 1351905 SNPs due to strand ambiguity, 7286539 SNPs remain in intersection after merging trait1
Dropped 0 SNPs due to strand ambiguity, 7286539 SNPs remain in intersection after merging trait2
Dropped 0 SNPs due to strand ambiguity, 7286539 SNPs remain in intersection after merging trait3
... Merge of GWAS summary statistics complete. Number of SNPs:	 7286539
cannot reindex from a duplicate axis
Traceback (most recent call last):
  File "mtag.py", line 1557, in <module>
    mtag(args)
  File "mtag.py", line 1330, in mtag
    Zs , Ns ,Fs, res_temp, DATA, N_raw = extract_gwas_sumstats(DATA,args,list(np.arange(args.P)))
  File "mtag.py", line 526, in extract_gwas_sumstats
    Ns = DATA.filter(items=n_cols).as_matrix()
  File "/Users/jamesp/anaconda2/envs/mtag/lib/python2.7/site-packages/pandas/core/generic.py", line 3900, in filter
    **{name: [r for r in items if r in labels]})
  File "/Users/jamesp/anaconda2/envs/mtag/lib/python2.7/site-packages/pandas/util/_decorators.py", line 187, in wrapper
    return func(*args, **kwargs)
  File "/Users/jamesp/anaconda2/envs/mtag/lib/python2.7/site-packages/pandas/core/frame.py", line 3566, in reindex
    return super(DataFrame, self).reindex(**kwargs)
  File "/Users/jamesp/anaconda2/envs/mtag/lib/python2.7/site-packages/pandas/core/generic.py", line 3689, in reindex
    fill_value, copy).__finalize__(self)
  File "/Users/jamesp/anaconda2/envs/mtag/lib/python2.7/site-packages/pandas/core/frame.py", line 3496, in _reindex_axes
    fill_value, limit, tolerance)
  File "/Users/jamesp/anaconda2/envs/mtag/lib/python2.7/site-packages/pandas/core/frame.py", line 3521, in _reindex_columns
    allow_dups=False)
  File "/Users/jamesp/anaconda2/envs/mtag/lib/python2.7/site-packages/pandas/core/generic.py", line 3810, in _reindex_with_indexers
    copy=copy)
  File "/Users/jamesp/anaconda2/envs/mtag/lib/python2.7/site-packages/pandas/core/internals.py", line 4414, in reindex_indexer
    self.axes[axis]._can_reindex(indexer)
  File "/Users/jamesp/anaconda2/envs/mtag/lib/python2.7/site-packages/pandas/core/indexes/base.py", line 3576, in _can_reindex
    raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
Analysis terminated from error at Thu Nov 29 20:44:11 2018
Total time elapsed: 12.0m:16.14s

Command:

python mtag.py \
  --sumstats trait1,trait2, trait3 \
  --out mtag.out \
  --n_value 100,100,100 \
  --p_name P_BOLT_LMM \
  --snp_name SNP \
  --chr_name CHR \
  --bpos_name BP \
  --beta_name BETA \
  --se_name SE \
  --a1_name ALLELE1 \
  --a2_name ALLELE0 \
  --eaf_name A1FREQ \
  --n_min 0.0 \
  --info_min 0.3 \
  --cores 1 \
  --use_beta_se \
  --n_approx \
  --stream_stdout \
  --fdr

Nov 30 '18 01:11 carbocation

Hi @carbocation ,

Do you still have the N column present in the input sumstats even though you're using --n_value?

I will tweak the codes so that it prioritizes the --n_value flag and ignores the existing N column in the input. But I just wanted to make sure this is indeed the issue. Let me know if removing the N column solves the problem (temporarily).

Thanks, Hui

Dec 03 '18 18:12 huilisabrina

In the above code, the input file had no N column (it was pure BOLT-LMM output).

Dec 03 '18 18:12 carbocation

Could you quickly check and send the first few lines of each sumstats you're using (just via head in bash)?

Thanks, Hui

Dec 03 '18 18:12 huilisabrina

Sure thing. Each file:

1

SNP     CHR     BP      GENPOS  ALLELE1 ALLELE0 A1FREQ  INFO    CHISQ_LINREG    P_LINREG        BETA    SE      CHISQ_BOLT_LMM_INF      P_BOLT_LMM_INF  CHISQ_BOLT_LMM  P_BOLT_LMM
rs367896724     1       10177   0       A       AC      0.599484        0.467935        0.0815533       7.8E-01 0.146417        0.417527        0.122974        7.3E-01 0.187453        6.7E-01
rs201106462     1       10352   0       T       TA      0.607518        0.447895        2.38133 1.2E-01 0.660126        0.430198        2.35459 1.2E-01 2.37463 1.2E-01
1:10616_CCGCCGTTGCAAAGGCGCGCCG_C        1       10616   0       CCGCCGTTGCAAAGGCGCGCCG  C       0.00537706      0.468098        1.22665 2.7E-01 3.37521 2.94865 1.31025 2.5E-01 1.29    2.6E-01

2

SNP     CHR     BP      GENPOS  ALLELE1 ALLELE0 A1FREQ  INFO    CHISQ_LINREG    P_LINREG        BETA    SE      CHISQ_BOLT_LMM_INF      P_BOLT_LMM_INF  CHISQ_BOLT_LMM  P_BOLT_LMM
rs367896724     1       10177   0       A       AC      0.599484        0.467935        0.0661935       8.0E-01 -0.0153067      0.245892        0.00387504      9.5E-01 0.0001763       9.9E-01
rs201106462     1       10352   0       T       TA      0.607518        0.447895        3.64409 5.6E-02 0.516429        0.253354        4.15493 4.2E-02 3.98652 4.6E-02
1:10616_CCGCCGTTGCAAAGGCGCGCCG_C        1       10616   0       CCGCCGTTGCAAAGGCGCGCCG  C       0.00537706      0.468098        0.186749        6.7E-01 1.13393 1.73653 0.426389        5.1E-01  0.389551        5.3E-01

3

SNP     CHR     BP      GENPOS  ALLELE1 ALLELE0 A1FREQ  INFO    CHISQ_LINREG    P_LINREG        BETA    SE      CHISQ_BOLT_LMM_INF      P_BOLT_LMM_INF  CHISQ_BOLT_LMM  P_BOLT_LMM
rs367896724     1       10177   0       A       AC      0.599484        0.467935        0.641061        4.2E-01 0.000588464     0.000945475     0.387382        5.3E-01 0.379562        5.4E-01
rs201106462     1       10352   0       T       TA      0.607518        0.447895        0.663884        4.2E-01 -0.000941531    0.000974169     0.934116        3.3E-01 0.907045        3.4E-01
1:10616_CCGCCGTTGCAAAGGCGCGCCG_C        1       10616   0       CCGCCGTTGCAAAGGCGCGCCG  C       0.00537706      0.468098        0.20568 6.5E-01 0.00148604      0.00667712      0.0495318.2E-01  0.0552681       8.1E-01

Dec 03 '18 19:12 carbocation

Thanks! Hmm I still can't replicate your problem. I'll track down this problem eventually, but if this is urgent, could you try removing all the unused columns in the input (i.e. P_BOLT_LMM_INF, CHISQ_BOLT_LMM_INF, CHISQ_BOLT_LMM, CHISQ_LINREG, P_LINREG). I'm sure this will work. The --p_name and --n_value are functioning fine as long as the inputs are well formatted. I'll let you know once I've tried more things.

Dec 03 '18 19:12 huilisabrina

Thanks for your help! For now, I've just gone back to the old approach which still works fine (adding an N column, renaming the P_BOLT_LMM column to P, and discarding irrelevant fields).

Dec 03 '18 19:12 carbocation

I seem to recall an issue previously where the function sometimes mis-identifies which column is which when they are not specified and when there are extraneous columns in the file. (e.g., if there is a column that has an N in it but is not the sample size column.) I don't remember if we ever resolved this.

Dec 04 '18 00:12 paturley

@paturley This hasn't come up during my time with mtag... but I think I just replicated this error that @carbocation is seeing. The problem has to do with the presence of multiple p-value columns - i.e. in addition to the target P_BOLT_LMM, there is also P_BOLT_LMM_INF in the input, so both of them are identified and the software is confused which one to use. I will try to see how to resolve this. I think this didn't pop up sooner because usually the input files contain more distinguishable column names. A general theme of enhancement I should consider from now on is how to streamline bolt and mtag better!

Dec 04 '18 20:12 huilisabrina

Hello,

Need help for a similar problem.

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging of Trait 2 complete. SNPs remaining: 921802 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Dropped 4 SNPs due to strand ambiguity, 933905 SNPs remain in intersection after merging trait1 Dropped 0 SNPs due to strand ambiguity, 908163 SNPs remain in intersection after merging trait2 ... Merge of GWAS summary statistics complete. Number of SNPs: 908163 cannot reindex from a duplicate axis Traceback (most recent call last): File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/mtag.py", line 1557, in mtag(args) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/mtag.py", line 1330, in mtag Zs , Ns ,Fs, res_temp, DATA, N_raw = extract_gwas_sumstats(DATA,args,list(np.arange(args.P))) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/mtag.py", line 526, in extract_gwas_sumstats Ns = DATA.filter(items=n_cols).as_matrix() File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/generic.py", line 2389, in filter [r for r in items if r in axis_values]}) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/frame.py", line 2741, in reindex **kwargs) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/generic.py", line 2229, in reindex fill_value, copy).finalize(self) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/frame.py", line 2682, in _reindex_axes limit, tolerance) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/frame.py", line 2707, in _reindex_columns allow_dups=False) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/generic.py", line 2341, in _reindex_with_indexers copy=copy) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/internals.py", line 3586, in reindex_indexer self.axes[axis]._can_reindex(indexer) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/indexes/base.py", line 2293, in _can_reindex raise ValueError("cannot reindex from a duplicate axis") ValueError: cannot reindex from a duplicate axis Analysis terminated from error at Tue Apr 9 15:48:14 2019 Total time elapsed: 19.15s

As for my sumstats looks fine (I've formatted it according to mtag Github page) sumstats 1 snpid chr bpos a1 a2 freq z pval n rs3094315 1 752566 A G 0.8313 -0.912087912088 0.3617 80610 rs3131972 1 752721 A G 0.1712 1.01470588235 0.3102 80610 rs3131969 1 754182 A G 0.1456 1.0635451505 0.2875 80610 rs1048488 1 760912 T C 0.8237 -0.741007194245 0.4587 80610

sumstats 2 snpid chr bpos a1 a2 freq z pval n rs3094315 1 752566 A G 0.849630143319 -0.268320891765 0.787 150064 rs3131972 1 752721 A G 0.157940360610264 0.226337945774 0.8229 150064 rs3131969 1 754182 A G 0.143377253814147 -0.0127401275584 0.9903 150064 rs1048488 1 760912 T C 0.8275543227 -0.399863453576 0.6871 150064

At these stages, I don't have any idea, where it went wrong, I hope you can help to direct me. (at least, I think its not column name problem, like discussed above)

Thank you very much,

Restu

Apr 09 '19 05:04 restuadi311

Hi @restuadi311 ,

The sumstats look fine to me, so as long as the specification of the column names (implied in your command) aligns with the format of your data, mtag should run through. Can you attach the full log file or send the commands that you used?

Thanks, Hui

Apr 09 '19 14:04 huilisabrina

Hi Hui,

Thank you very much for your speedy reply, (If you think, you'll need some dummy files to check, please let me know, I'll upload it at google drive)

here below, is the log files :

(virtualenv_mtag) (base) [r.restuadi@delta008 pipe_auto2]$ python "$software_path"/mtag/mtag.py --sumstats "$inputs_dir"/"$ref_trait"_mtag_hm3_maf01.txt,"$inputs_dir"/"$i"_mtag_hm3_maf01.txt --out "$temp_dir"/"$i"_mtag_hm3_maf01 --n_min 0.0 --stream_stdout

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> <> <> MTAG: Multi-trait Analysis of GWAS <> Version: 1.0.8 <> (C) 2017 Omeed Maghzian, Raymond Walters, and Patrick Turley <> Harvard University Department of Economics / Broad Institute of MIT and Harvard <> GNU General Public License v3 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> <> Note: It is recommended to run your own QC on the input before using this program. <> Software-related correspondence: [email protected] <> All other correspondence: [email protected] <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Calling ./mtag.py
--p-name pval
--stream-stdout
--n-min 0.0
--n-value 80610,257828
--sumstats /shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/pipe_auto2/als_mtag_hm3_maf01.txt,/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/pipe_auto2/CP_mtag_hm3_maf01.txt
--out /shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/pipe_auto2/CP_mtag_hm3_maf01

Beginning MTAG analysis... MTAG will use the Z column for analyses. Read in Trait 1 summary statistics (933910 SNPs) from /shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/pipe_auto2/als_mtag_hm3_maf01.txt ... <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging Trait 1 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><>< <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Interpreting column names as follows: snpid: Variant ID (e.g., rs number) n: Sample size a1: a1, interpreted as ref allele for signed sumstat. pval: p-Value a2: a2, interpreted as non-ref allele for signed sumstat. z: Directional summary statistic as specified by --signed-sumstats.

Reading sumstats from provided DataFrame into memory 10000000 SNPs at a time. Read 933910 SNPs from --sumstats file. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= None. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with SE <0 or NaN values. Removed 0 SNPs with out-of-bounds p-values. Removed 1 variants that were not SNPs. Note: strand ambiguous SNPs were not dropped. 933909 SNPs remain. Adding uniform sample size 80610 to summary statistics. Removed 0 SNPs with duplicated rs numbers (933909 SNPs remain). Removed 0 SNPs with N < 0.0 (933909 SNPs remain). Median value of SIGNED_SUMSTAT was 0.0, which seems sensible. Dropping snps with null values

Metadata: Mean chi^2 = 1.082 Lambda GC = 1.05 Max chi^2 = 130.165 42 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Tue Apr 9 15:48:00 2019 Total time elapsed: 4.55s <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging of Trait 1 complete. SNPs remaining: 933909 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Read in Trait 2 summary statistics (921802 SNPs) from /shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/pipe_auto2/CP_mtag_hm3_maf01.txt ... <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging Trait 2 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><>< <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Interpreting column names as follows: snpid: Variant ID (e.g., rs number) n: Sample size a1: a1, interpreted as ref allele for signed sumstat. pval: p-Value a2: a2, interpreted as non-ref allele for signed sumstat. z: Directional summary statistic as specified by --signed-sumstats.

Reading sumstats from provided DataFrame into memory 10000000 SNPs at a time. Read 921802 SNPs from --sumstats file. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= None. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with SE <0 or NaN values. Removed 0 SNPs with out-of-bounds p-values. Removed 0 variants that were not SNPs. Note: strand ambiguous SNPs were not dropped. 921802 SNPs remain. Adding uniform sample size 257828 to summary statistics. Removed 0 SNPs with duplicated rs numbers (921802 SNPs remain). Removed 0 SNPs with N < 0.0 (921802 SNPs remain). Median value of SIGNED_SUMSTAT was 0.0, which seems sensible. Dropping snps with null values

Metadata: Mean chi^2 = 2.175 Lambda GC = 1.827 Max chi^2 = 124.961 2868 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Tue Apr 9 15:48:08 2019 Total time elapsed: 4.81s <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging of Trait 2 complete. SNPs remaining: 921802 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Dropped 4 SNPs due to strand ambiguity, 933905 SNPs remain in intersection after merging trait1 Dropped 0 SNPs due to strand ambiguity, 908163 SNPs remain in intersection after merging trait2 ... Merge of GWAS summary statistics complete. Number of SNPs: 908163 cannot reindex from a duplicate axis Traceback (most recent call last): File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/mtag.py", line 1557, in mtag(args) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/mtag.py", line 1330, in mtag Zs , Ns ,Fs, res_temp, DATA, N_raw = extract_gwas_sumstats(DATA,args,list(np.arange(args.P))) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/mtag.py", line 526, in extract_gwas_sumstats Ns = DATA.filter(items=n_cols).as_matrix() File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/generic.py", line 2389, in filter [r for r in items if r in axis_values]}) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/frame.py", line 2741, in reindex **kwargs) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/generic.py", line 2229, in reindex fill_value, copy).finalize(self) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/frame.py", line 2682, in _reindex_axes limit, tolerance) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/frame.py", line 2707, in _reindex_columns allow_dups=False) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/generic.py", line 2341, in _reindex_with_indexers copy=copy) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/core/internals.py", line 3586, in reindex_indexer self.axes[axis]._can_reindex(indexer) File "/shares/compbio/Group-Wray/restuadi/project/Multitraits_prediction/Jan2019run/software/mtag/virtualenv_mtag/lib/python2.7/site-packages/pandas/indexes/base.py", line 2293, in _can_reindex raise ValueError("cannot reindex from a duplicate axis") ValueError: cannot reindex from a duplicate axis Analysis terminated from error at Tue Apr 9 15:48:14 2019 Total time elapsed: 19.15s (virtualenv_mtag) (base) [r.restuadi@delta008 pipe_auto2]$

Thank you very much

Apr 10 '19 00:04 restuadi311

Hi @restuadi311 ,

I tried but still could not replicate the error you're getting. Are you using the latest version of the software (re-pulled the repo just to make sure)? If this still doesn't work, feel free to upload your data to a shareable location and I'll take a look).

Thanks, Hui

Apr 11 '19 16:04 huilisabrina

Hi @huilisabrina

Yep, I've tried the newest MTAG version and still got the same trouble. I've stored the dummy to try here : https://drive.google.com/open?id=1G9555DjSweGHw8yIVneDYh0KZePKI24m

Please let me know, if there is a problem.

Thank you very much,

Restu

Apr 15 '19 00:04 restuadi311

Hi @restuadi311 ,

Thanks for sharing your files! Sorry for the delay. This was due to a bug that was a bit hard to find. I just fixed it in the latest edits. Please re-pull the repo and try again. Let me know if this still doesn't work!

Best, Hui

Apr 17 '19 20:04 huilisabrina

Hi Hui,

Very sorry for the late reply, just come back to work from a nice and long sabbatical. Thank you very much for the fix, it's working well now.

Restu

May 08 '19 04:05 restuadi311

Hi @huilisabrina ,

I have also encountered a similar problem. I don't know why it happened.

Error:

<><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> <> <> MTAG: Multi-trait Analysis of GWAS <> Version: 1.0.8 <> (C) 2017 Omeed Maghzian, Raymond Walters, and Patrick Turley <> Harvard University Department of Economics / Broad Institute of MIT and Harvard <> GNU General Public License v3 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> <> Note: It is recommended to run your own QC on the input before using this program. <> Software-related correspondence: [email protected] <> All other correspondence: [email protected] <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Calling ./mtag.py
--p-name pval
--stream-stdout
--n-min 0.0
--sumstats MAGIC1000G_FI_EUR_MTAG1.tsv,MAGIC1000G_FG_EUR_MTAG1.tsv
--out ./new

Beginning MTAG analysis... MTAG will use the Z column for analyses. Read in Trait 1 summary statistics (32635792 SNPs) from MAGIC1000G_FI_EUR_MTAG1. tsv ... <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging Trait 1 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><>< <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Interpreting column names as follows: snpid: Variant ID (e.g., rs number) n: Sample size a1: a1, interpreted as ref allele for signed sumstat. pval: p-Value a2: a2, interpreted as non-ref allele for signed sumstat. z: Directional summary statistic as specified by --signed-sumstats. se: Standard errors of BETA coefficients

Reading sumstats from provided DataFrame into memory 10000000 SNPs at a time. Read 32635792 SNPs from --sumstats file. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= None. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with SE <0 or NaN values. Removed 0 SNPs with out-of-bounds p-values. Removed 2398490 variants that were not SNPs. Note: strand ambiguous SNPs were no t dropped. 30237302 SNPs remain. Removed 0 SNPs with duplicated rs numbers (30237302 SNPs remain). Removed 0 SNPs with N < 0.0 (30237302 SNPs remain). Median value of SIGNED_SUMSTAT was 0.0, which seems sensible. Dropping snps with null values

Metadata: Mean chi^2 = 1.032 Lambda GC = 0.998 Max chi^2 = 169.73 2179 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Thu Sep 1 10:26:53 2022 Total time elapsed: 6.0m:43.97s <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging of Trait 1 complete. SNPs remaining: 30237332 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Trait 1: Dropped 30 SNPs for duplicate values in the "snp_name" column Read in Trait 2 summary statistics (34064006 SNPs) from MAGIC1000G_FG_EUR_MTAG1. tsv ... <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging Trait 2 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><>< <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Interpreting column names as follows: snpid: Variant ID (e.g., rs number) n: Sample size a1: a1, interpreted as ref allele for signed sumstat. pval: p-Value a2: a2, interpreted as non-ref allele for signed sumstat. z: Directional summary statistic as specified by --signed-sumstats. se: Standard errors of BETA coefficients

Reading sumstats from provided DataFrame into memory 10000000 SNPs at a time. WARNING: 6 SNPs had P outside of (0,1]. The P column may be mislabeled. Read 34064006 SNPs from --sumstats file. Removed 0 SNPs with missing values. Removed 0 SNPs with INFO <= None. Removed 0 SNPs with MAF <= 0.01. Removed 0 SNPs with SE <0 or NaN values. Removed 6 SNPs with out-of-bounds p-values. Removed 2442984 variants that were not SNPs. Note: strand ambiguous SNPs were no t dropped. 31621016 SNPs remain. Removed 0 SNPs with duplicated rs numbers (31621016 SNPs remain). Removed 0 SNPs with N < 0.0 (31621016 SNPs remain). Median value of SIGNED_SUMSTAT was -0.00151976, which seems sensible. Dropping snps with null values

Metadata: Mean chi^2 = 1.044 Lambda GC = 0.998 Max chi^2 = 1477.304 6354 Genome-wide significant SNPs (some may have been removed by filtering).

Conversion finished at Thu Sep 1 10:37:45 2022 Total time elapsed: 7.0m:8.03s <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> Munging of Trait 2 complete. SNPs remaining: 31621050 <><><<>><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

Trait 2: Dropped 34 SNPs for duplicate values in the "snp_name" column Dropped 4623481 SNPs due to strand ambiguity, 25613821 SNPs remain in intersection after merging trait1 Dropped 2 SNPs due to inconsistent allele pairs from phenotype 2. 25449696 SNPs remain. Dropped 0 SNPs due to strand ambiguity, 25449696 SNPs remain in intersection after merging trait2 ... Merge of GWAS summary statistics complete. Number of SNPs: 25449696 cannot reindex from a duplicate axis Traceback (most recent call last): File "/home/gyu/mtag/mtag.py", line 1577, in mtag(args) File "/home/gyu/mtag/mtag.py", line 1346, in mtag Zs , Ns ,Fs, res_temp, DATA, N_raw = extract_gwas_sumstats(DATA,args,list(np.arange(args.P))) File "/home/gyu/mtag/mtag.py", line 530, in extract_gwas_sumstats Ns = DATA.filter(items=n_cols).as_matrix() File "/home/gyu/anaconda3/envs/python27/lib/python2.7/site-packages/pandas/core/generic.py", line 4570, in filter **{name: [r for r in items if r in labels]}) File "/home/gyu/anaconda3/envs/python27/lib/python2.7/site-packages/pandas/util/_decorators.py", line 197, in wrapper return func(*args, **kwargs) File "/home/gyu/anaconda3/envs/python27/lib/python2.7/site-packages/pandas/core/frame.py", line 3809, in reindex return super(DataFrame, self).reindex(**kwargs) File "/home/gyu/anaconda3/envs/python27/lib/python2.7/site-packages/pandas/core/generic.py", line 4356, in reindex fill_value, copy).finalize(self) File "/home/gyu/anaconda3/envs/python27/lib/python2.7/site-packages/pandas/core/frame.py", line 3736, in _reindex_axes fill_value, limit, tolerance) File "/home/gyu/anaconda3/envs/python27/lib/python2.7/site-packages/pandas/core/frame.py", line 3761, in _reindex_columns allow_dups=False) File "/home/gyu/anaconda3/envs/python27/lib/python2.7/site-packages/pandas/core/generic.py", line 4490, in _reindex_with_indexers copy=copy) File "/home/gyu/anaconda3/envs/python27/lib/python2.7/site-packages/pandas/core/internals/managers.py", line 1224, in reindex_indexer self.axes[axis]._can_reindex(indexer) File "/home/gyu/anaconda3/envs/python27/lib/python2.7/site-packages/pandas/core/indexes/base.py", line 3087, in _can_reindex raise ValueError("cannot reindex from a duplicate axis") ValueError: cannot reindex from a duplicate axis Analysis terminated from error at Thu Sep 1 10:44:04 2022 Total time elapsed: 25.0m:5.0s

The form of my GWAS summary data is like this:

trait1 snpid chr bpos a1 a2 freq beta se pval sample_size n z rs147324274 10 100000012 A G NA -0.0553 0.1863 0.8544 11047 196991 -0.296833 rs571272521 10 10000011 A G NA 0.14 0.1724 0.4606 7428 196991 0.812065 rs144804129 10 100000122 A T 0.003 -0.1099 0.1202 0.2684 28015 196991 -0.914309 rs6602381 10 10000018 A G 0.6 -0.0054 0.0022 0.03577 124123 196991 -2.45455 rs539340063 10 100000259 A G NA 0.6904 0.3295 0.03067 7428 196991 2.0953 rs147936544 10 100000274 A G 0.001 0.1089 0.1737 0.6802 8834.04 196991 0.626943 rs189891329 10 10000033 A G NA 0.872 0.6428 0.3275 444.998 196991 1.35657 rs188626770 10 100000430 T G NA 0.1228 0.2166 0.5612 9556.95 196991 0.566944 rs547178188 10 100000439 T C NA -0.0356 0.2591 0.9092 7428 196991 -0.137399

trait2 snpid chr bpos a1 a2 freq beta se pval sample_size n z rs147324274 10 100000012 A G NA -0.0257 0.1854 0.9613 26201.1 196991 -0.138619 rs571272521 10 10000011 A G NA 0.032 0.1405 0.8781 8729 196991 0.227758 rs144804129 10 100000122 A T 0.003 -0.0077 0.1082 0.9647 30955 196991 -0.0711645 rs6602381 10 10000018 A G 0.6 1e-04 0.0019 0.6219 165515 196991 0.0526316 rs539340063 10 100000259 A G NA -0.0393 0.3007 0.9276 8729 196991 -0.130695 rs147936544 10 100000274 A G 0.001 -0.179 0.1683 0.5381 16684 196991 -1.06358 rs188626770 10 100000430 T G NA 0.4309 0.2266 0.1228 9556.95 196991 1.90159 rs547178188 10 100000439 T C NA -0.0388 0.2368 0.9578 8729 196991 -0.163851 10_100000554_D_I 10 100000554 D I NA -0.0041 0.0083 0.6165 8729 196991 -0.493976

I have also updated the MTAG using git pull

Thank you very much!

Sep 01 '22 03:09 CharlesLambert70

Judging from stackoverflow, it looks like the most common cause of this is duplicate index values. Do you have duplicate rsIDs in your data?

MTAG does do some filtering and makes attempts at data cleaning, but it's not 100% comprehensive. If it's not duplicate rsIDs, it could also just be a need for some data QCing. It looks like you have a bunch of NA frequencies in the example data you included, and in the log file, it's dropping a lot of items that it's having a hard time with ("Removed 2442984 variants that were not SNPs", for example).

I'd try checking for duplicate rsIDs first, but if that doesn't work, then maybe some other data QC could help.

Sep 02 '22 14:09 JonJala

Many thanks!

Sep 08 '22 02:09 CharlesLambert70

mtag mtag copied to clipboard

ValueError: cannot reindex from a duplicate axis

mtag
mtag copied to clipboard