peddy icon indicating copy to clipboard operation
peddy copied to clipboard

KeyError in PCA.py

Open skgttjo opened this issue 5 years ago • 3 comments

Hi Brent,

I cloned the peddy package yesterday and have got this error:

Traceback (most recent call last): File "/home/torme/anaconda2/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/home/torme/anaconda2/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/home/torme/Desktop/TOOLS/peddy/peddy/main.py", line 14, in sys.exit(cli()) File "/home/torme/anaconda2/lib/python2.7/site-packages/click/core.py", line 716, in call return self.main(*args, **kwargs) File "/home/torme/anaconda2/lib/python2.7/site-packages/click/core.py", line 696, in main rv = self.invoke(ctx) File "/home/torme/anaconda2/lib/python2.7/site-packages/click/core.py", line 889, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/torme/anaconda2/lib/python2.7/site-packages/click/core.py", line 534, in invoke return callback(*args, **kwargs) File "peddy/cli.py", line 207, in peddy in ("ped_check", "het_check", "sex_check")]): File "peddy/cli.py", line 43, in run prefix=prefix, **kwargs) File "peddy/peddy.py", line 878, in het_check pca_df, background_pca_df = pca(pca_plot, sitesfile, gt_types, sites) File "peddy/pca.py", line 46, in pca idxs = np.array([kgsites_index[s] for s in sites]) KeyError: u'1:900505:G:C'

I get output files ped_check and ped_check_rel-difference, but no files for sex or PCA.

I get an error at position 1:900505 - I'm not sure how peddy runs in terms of order of variants in the file, but this is not the first variant position in my VCF. (However perhaps for speed the program does not run through chronologically so not sure this is relevant anyway.) This position looks fine in my VCF (is present, has the same ref and alt alleles as in 1000Genomes data). Is there anything obvious that would cause this error?

Sorry if this is an obvious oversight on my part... The output that I did get looks beautiful, so thank you!

Many thanks, Tatiana

skgttjo avatar Jul 20 '18 16:07 skgttjo

I wonder if you are getting a mix of peddy modules. the new version has a different set of sites and so maybe you're somehow getting the old PCA module? I suspect this is the case because you're running it out of the source directory.

this should be fixed with local imports in peddy, but, pending that fix, you can also check that you only have 1 peddy module available.

thanks for reporting.

brentp avatar Jul 20 '18 16:07 brentp

Hi Brent,

Thank you for getting back to me. Just to let you know (in case it helps trouble shoot) I removed all things Peddy and reinstalled and got the same error as above. I then ran on a different VCF, in case it was a problem with the original VCF file, and I got the same error up til the last line - instead of the error in line 46, in pca - I got: File "peddy/pca.py", line 60, in pca clf = make_pipeline(PCA(n_components=4, whiten=True, copy=True, svd_solver="randomized"), TypeError: init() got an unexpected keyword argument 'svd_solver'

Does this seem like a problem at my end (problem with installation or VCF)? As I don't want to waste your time!

All the best

skgttjo avatar Jul 23 '18 15:07 skgttjo

you must have a very old version of scikit-learn. I would update that, re-run and report the error (if there is one).

brentp avatar Jul 23 '18 15:07 brentp

Which version of scikit-learn should I choose for this bug

ruizgo avatar May 18 '23 03:05 ruizgo

@ruizgo which error? the svd_sovler error? Any recent version is fine.

If you are seeing an error like in the first message (KeyError: u'1:900505:G:C'), then you must be using a set of sites that doesn't match the ones used to create the thousand genomes labeled set. Can you share the command you ran along with the full error message?

brentp avatar May 18 '23 03:05 brentp

2023-05-18 03:06:57 01a458450dfc peddy.cli[786] INFO Running Peddy version 0.4.8 2023-05-18 03:06:57 01a458450dfc peddy.cli[786] INFO ped_check 2023-05-18 03:06:57 01a458450dfc peddy.cli[786] INFO ran in 0.3 seconds 2023-05-18 03:06:57 01a458450dfc peddy.cli[786] INFO het_check 2023-05-18 03:06:58 01a458450dfc peddy.pca[786] INFO loaded and subsetted thousand-genomes genotypes (shape: (2504, 1)) in 0.4 seconds Traceback (most recent call last): File "/usr/local/bin/peddy", line 33, in sys.exit(load_entry_point('peddy', 'console_scripts', 'peddy')()) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1128, in call return self.main(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "/opt/peddy-0.4.8/peddy/cli.py", line 208, in peddy for check, df, background_df in map(run, [(check, ped, vcf, plot, prefix, each, procs, sites) for check File "/opt/peddy-0.4.8/peddy/cli.py", line 42, in run df = getattr(p, check)(vcf, plot=plot, each=each, ncpus=ncpus, File "/opt/peddy-0.4.8/peddy/peddy.py", line 879, in het_check pca_df, background_pca_df = pca(pca_plot, sitesfile, gt_types, sites) File "/opt/peddy-0.4.8/peddy/pca.py", line 64, in pca clf.fit(genos1kg, background_target) File "/usr/local/lib/python3.10/site-packages/sklearn/pipeline.py", line 401, in fit Xt = self._fit(X, y, **fit_params_steps) File "/usr/local/lib/python3.10/site-packages/sklearn/pipeline.py", line 359, in _fit X, fitted_transformer = fit_transform_one_cached( File "/usr/local/lib/python3.10/site-packages/joblib/memory.py", line 349, in call return self.func(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/sklearn/pipeline.py", line 893, in _fit_transform_one res = transformer.fit_transform(X, y, **fit_params) File "/usr/local/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 140, in wrapped data_to_wrap = f(self, X, *args, **kwargs) File "/usr/local/lib/python3.10/site-packages/sklearn/decomposition/_pca.py", line 462, in fit_transform U, S, Vt = self._fit(X) File "/usr/local/lib/python3.10/site-packages/sklearn/decomposition/_pca.py", line 514, in _fit return self._fit_truncated(X, n_components, self._fit_svd_solver) File "/usr/local/lib/python3.10/site-packages/sklearn/decomposition/_pca.py", line 587, in _fit_truncated raise ValueError( ValueError: n_components=4 must be between 1 and min(n_samples, n_features)=1 with svd_solver='randomized' None

This is an error output message, which may cause errors when reading certain VCF files

ruizgo avatar May 18 '23 03:05 ruizgo

This means that there was only 1 SNP in the thousand genomes set that was in your set. So you either have data that is too sparse or you're using the wrong genome build most likely.

brentp avatar May 18 '23 03:05 brentp

This means that there was only 1 SNP in the thousand genomes set that was in your set. So you either have data that is too sparse or you're using the wrong genome build most likely.

I will check the upstream operation, thank you for your reply!

ruizgo avatar May 18 '23 03:05 ruizgo