pca icon indicating copy to clipboard operation
pca copied to clipboard

Data Column Adjusted PCA

Open BradKML opened this issue 1 year ago • 6 comments

Sometimes the base columnar data might not be normally distributed (bell curve) or continuously uniform (quantile-esque).

Guides on Power Transform (Yeo-Johnson vs Box-Cox):

  • https://en.wikipedia.org/wiki/Power_transform
  • https://statisticaloddsandends.wordpress.com/2021/02/19/the-box-cox-and-yeo-johnson-transformations-for-continuous-variables/
  • https://jsmp.dk/posts/2019-08-23-transformingdata/

Library Choice:

  • https://feature-engine.readthedocs.io/en/1.0.x/transformation/YeoJohnsonTransformer.html
  • https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.yeojohnson.html
  • https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html

Demo on how scaling helps with PCA visualization https://towardsdatascience.com/feature-scaling-and-normalisation-in-a-nutshell-5319af86f89b For non-transforms: Robust Scaling > Normalization > Standardization https://stats.stackexchange.com/questions/476394/impact-of-different-scaling-methods-on-pca-for-clustering

Q: is Yeo-Johnson good for quantile-esque columns?

Side note: this looks useful? https://github.com/erdogant/distfit

BradKML avatar Sep 11 '22 08:09 BradKML

Currently testing this to see which combinations makes a difference

# https://gist.github.com/regtm/5be7337561215c4e107b393312a17f2e
ace = save_unzip('https://files.catbox.moe/2nik4u.zip') 

from pandas import read_csv, concat
df = read_csv(ace + '/data-final.csv', sep='\t')
df = concat([df.iloc[:, :50], df.iloc[:, -3:]], axis=1, sort=False).dropna()
X = df.iloc[:, 50].dropna()
y = df.iloc[:, -3:]

from pca import pca
from sklearn.preprocessing import RobustScaler, PowerTransformer, StandardScaler, MaxAbsScaler

fit_set = [RobustScaler, PowerTransformer, StandardScaler, MaxAbsScaler]

for i in fit_set:
  model = pca(n_components=0.8, verbose=2)
  results = model.fit_transform(i().fit_transform(X))
  fig, _ = model.plot(title=str(model.results['pcp'] * 100)[0:5]+'% '+i.__name__)
  fig.savefig(i.__name__+'.png')

from itertools import permutations 

for i in list(permutations(fit_set, 2)):
  model = pca(n_components=0.8, verbose=2)
  results = model.fit_transform(i[1]().fit_transform(i[0]().fit_transform(X)))
  fig, _ = model.plot(title=str(model.results['pcp'] * 100)[0:5]+' '+i[0].__name__+' '+i[1].__name__)
  fig.savefig(i[0].__name__+'_'+i[1].__name__+'.png')

BradKML avatar Sep 28 '22 08:09 BradKML

Currently found out that PowerTransformer followed by RobustScaler (or just RobustScaler) performs the best compared to others Curves.zip

Question:

  1. How can these methods be compared against one another for usefulness/accuracy? Comparing against each other with a static N, or comparing N against minimum cumulative variance?
  2. Would having automated feature engineering produce similar speedups? If so, how would one benchmark against different tables with different column counts? https://github.com/feature-engine/feature_engine

BradKML avatar Sep 28 '22 13:09 BradKML

True. Normalizing your data can be very beneficial for the end-result. However, it is not always easy to describe (or show) what the "best" normalization is. But you can get "a feeling" for it.

Let me share my thoughts about this. An idea could be to transform your data in such a manner that it most resembles the normal distribution. In this case you can indeed use the distfit library to find the normalization strategy that leads towards a distribution that fits best with the normal distribution. However, when you push your data to hard (lets say by using multiple normalizations behind each other), you may even lose information.

I would proceed as following:

This data contains solely integers. I will assume that these variables are measurements and not categorical.

df = read_csv('data-final.csv', sep='\t')
df = df.iloc[0:10000, :50]
X = df.dropna()


     EXT1  EXT2  EXT3  EXT4  EXT5  EXT6  ...  OPN5  OPN6  OPN7  OPN8  OPN9  OPN10
0      4.0   1.0   5.0   2.0   5.0   1.0  ...   4.0   1.0   5.0   3.0   4.0    5.0
1      3.0   5.0   3.0   4.0   3.0   3.0  ...   3.0   1.0   4.0   2.0   5.0    3.0
2      2.0   3.0   4.0   4.0   3.0   2.0  ...   4.0   2.0   5.0   3.0   4.0    4.0
3      2.0   2.0   2.0   3.0   4.0   2.0  ...   3.0   1.0   4.0   4.0   3.0    3.0
4      3.0   3.0   3.0   3.0   5.0   3.0  ...   5.0   1.0   5.0   3.0   5.0    5.0
   ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...    ...
9995   2.0   3.0   2.0   3.0   2.0   4.0  ...   4.0   4.0   4.0   1.0   3.0    3.0
9996   2.0   3.0   3.0   4.0   4.0   2.0  ...   2.0   3.0   4.0   1.0   4.0    4.0
9997   4.0   4.0   3.0   3.0   3.0   3.0  ...   4.0   2.0   4.0   3.0   3.0    4.0
9998   2.0   2.0   2.0   3.0   3.0   3.0  ...   2.0   2.0   3.0   3.0   4.0    4.0
9999   1.0   1.0   3.0   4.0   2.0   4.0  ...   4.0   1.0   5.0   4.0   4.0    4.0

[10000 rows x 50 columns]

Lets have a look at the distribution of these values.

plt.figure(figsize=(20,10));plt.boxplot(X.values) plt.figure();plt.hist(X.values.ravel(), bins=5); image image

So, there are no big outliers, higher values appear more often, and it is not normally distributed.

Lets start with the standardscaler because of its simplicity and then scale the data per feature, and over the samples. But this decision depends, among others, on the aim and data. In this manner we preserve the underlying distribution across the samples. We should expect to see an average close to zero with standard deviation of one per feature.

X_norm.mean(axis=0)
X_norm.std(axis=0)

X_norm = StandardScaler().fit_transform(X)
plt.figure(figsize=(20,10));plt.boxplot(X_norm )
plt.figure();plt.hist(X_norm .ravel(), bins=20);

image! image

This looks good. I do not see huge skewness. Maybe the data is a bit bimodal but that can be underlying (hidden) structure (?)

fig, _ = model.plot(title=str(model.results['pcp'] * 100)[0:5]+'% StandardScaler')
model.scatter(legend=False)

image image

The questions still is: "how well does this represent the original space?" Lets make some plots. There seems not to be a lot of variation in the data based on the explained variance. We can see that too in the scatterplot. In order to get a bit of feeling how good this is, I would expect to see some known grouping, and hopefully something new grouping.

We can also quantify the PCA mapping compared to the high dimension space. The flameplot library can help to quantify this. Checkout this blog for more information. This maybe also answers one of your questions.

pip install flameplot

import flameplot as flameplot
scores = flameplot.compare(X_norm, results['PC'], n_steps=25)
fig = flameplot.plot(scores, xlabel='X normalzed', ylabel='PCA (95% expl.var)')
np.mean(scores['scores'])

image

If the results are off, then maybe you need a different normalization approach, or you can set the with_std=False because sometimes it can skew the distribution too much and lose information. Or maybe sample-wise normalization or you need one-hot encoding. Or maybe, there is no structure in the data.

Note that I do not know what variables exactly mean in the input dataset. The decision I made, may therefore be incorrect.

erdogant avatar Sep 29 '22 15:09 erdogant

For reference, this is the personality dataset, and these are the 50 questions regarding oneself ("Likehart Scale" that tends to follow a skewed bell curve). Some of the items should be strongly correlated but the inter-cluster correlations may be fuzzy. The original idea is to reduce this into clusters of "major personalities" rather than "personality meta-factors" (the three-letter acronym are well-researched meta-factors) since some questions may have extra-characteristics.

Also nice blog on the comparison, having PHATE, TriMap, PaCMAP, LargeVis, TOPAOE, ATSNE, UMATO, and others would be a good primer on local preservation vs global preservation.

P.S. I might consider other good datasets as example of some of these issues: multi-modal non-normal distributions, noisy distributions, de-correlation of columns, "distribution" based on ranking (e.g. HDI nation rankings), distance ranking vs relative distance consistency in dimensionality reduction.

BradKML avatar Oct 02 '22 02:10 BradKML

ok! This one on kaggle?

erdogant avatar Oct 02 '22 08:10 erdogant

Yes, and one thing special about this, is that most survey data:

  • are somewhat correlated in facet level (correlation between dimensional cluster aggregates exists e.g. Plasticity and Stability)
  • can often be correlated in the item level (correlation between questions of different clusters exists e.g. Correlational Heat Map)
  • are difficult to cluster in regards to data points (t-SNE attempt and YellowBrick)
  • each column have discrete value instead of continuous, which makes normalization and scaling difficult

If one wants to try these techniques on other similar datasets see OpenPsychometrics in http://openpsychometrics.org/_rawdata/, or maybe if there can be something done between two sets of inventories try https://www.kaggle.com/datasets/mathurinache/machivallianism-test or https://www.kaggle.com/lucasgreenwell/datasets

BradKML avatar Oct 03 '22 02:10 BradKML