MHKiT-Python
MHKiT-Python copied to clipboard
Sea states clustering GM vs KM
I was working with this example: PacWave_resource_characterization_example but tried using sklearn.cluster.KMeans instead of sklearn.mixture.GaussianMixture to cluster the data. KMeans does require normalization prior to the clustering.
I found that KMeans requires a smaller number of clusters to converge to the total amount of energy in the data compared to GM. Here illustrated as the ratio between the representative sea states and the total energy (as in the example)
| Nr clusters | ratio GM |
ratio KM |
|---|---|---|
| 4 | 0.9099 | 0.9331 |
| 6 | 0.9126 | 0.9679 |
| 8 | 0.9370 | 0.9859 |
| 12 | 0.9657 | 1.0006 |
| 16 | 0.9822 | 1.0078 |
I also organized the clusters after weight and visualized the weight with the color scale in the plots
The results of the different cluster algorithms are different, but I don't know if we should be considering other criteria apart from representing the total amount of energy.
The quick and dirty code to extend the PacWave_resource_characterization_example to reproduce the figure and table are:
# Compute Gaussian Mixture Model for each number of clusters
Ns= [4,6, 8, 12, 16]# 16, 32, 64]
X = np.vstack((data_clean.Te.values, data_clean.Hm0.values)).T
## normalize data for kmeans comparison n
Hm0_max = data_clean.Hm0.max()
Te_max = data_clean.Te.max()
Hm0_, Te_ = [], []
Hm0_.append(data.Hm0/Hm0_max)
Te_.append(data.Te/Te_max)
Hm0_ = pd.concat(Hm0_, axis=0)
Te_ = pd.concat(Te_, axis=0)
data_norm = pd.concat([Hm0_, Te_,], axis=1).dropna().sort_index()
from sklearn.cluster import KMeans
import string
results_km={}
fig, axs = plt.subplots(len(Ns),2, figsize=(16, 24), sharex=True)
results={}
for N in Ns:
gmm = GaussianMixture(n_components=N).fit(X)
# Save centers and weights
result = pd.DataFrame(gmm.means_, columns=['Te','Hm0'])
result['weights'] = gmm.weights_
result['Tp'] = result.Te / 0.858
labels = gmm.predict(X)
#sort clusters after weight
result.sort_values('weights', inplace=True, ascending=False)
result['labels'] = list(string.ascii_uppercase[0:N])
idx = result.index
idx = [int(np.where(idx == i)[0]) for i in np.arange(N)]
idx = [idx[i] for i in labels]
result.reset_index(drop=True, inplace=True)
results[N] = result
i = Ns.index(N)
gm_sc = axs[i,0].scatter(data_clean.Te.values, data_clean.Hm0.values, c=result['weights'][idx], s=0.5, cmap='viridis')
axs[i,0].plot(result.Te, result.Hm0, 'm+')
axs[i,0].title.set_text(f'{N} GM Clusters')
cbar = fig.colorbar(gm_sc, ticks=result["weights"], label='Weight [ ]')
cbar.ax.set_yticklabels([result['labels'][i]+': '+str(round(result["weights"][i],3)) for i in range(N)])
for x, y, lbl in zip(result['Te'], result.Hm0, result['labels'] ):
axs[i,0].text(x+0.1, y+0.1, lbl)
plt.setp(axs[i,0], ylabel='Energy Period, $T_e$ [s]')
## kmeans comparison
km = KMeans(n_clusters=N, random_state=1).fit(data_norm[["Hm0", 'Te']])
# Un-normalize clustered data
km.cluster_centers_[:, 0] = km.cluster_centers_[:, 0]*Hm0_max
km.cluster_centers_[:, 1] = km.cluster_centers_[:, 1]*Te_max
result_km = pd.DataFrame(km.cluster_centers_, columns=["Hm0", 'Te'])
result_km['weights'] = [(km.labels_ == i).sum() / len(km.labels_) for i in range(N)]
result_km['Tp'] = result_km.Te / 0.858
#sort clusters after weight
result_km.sort_values('weights', inplace=True, ascending=False)
result_km['labels'] = list(string.ascii_uppercase[0:N])
idx_km = result_km.index
idx_km = [int(np.where(idx_km == i)[0]) for i in np.arange(N)]
idx_km = [idx_km[i] for i in km.labels_]
result_km.reset_index(drop=True, inplace=True)
results_km[N] = result_km
km_sc = axs[i,1].scatter(data_clean.Te, data_clean.Hm0, c=result_km['weights'][idx_km], s=0.5, cmap='viridis', rasterized=True, vmin = 0)
axs[i,1].scatter(result_km['Te'], result_km['Hm0'], s=40, marker="x", color="w")
axs[i,1].title.set_text(f'{N} Kmeans Clusters')
cbar = fig.colorbar(km_sc, ticks=result_km["weights"], label='Weight [ ]')
cbar.ax.set_yticklabels([result_km['labels'][i]+': '+str(round(result_km["weights"][i],3)) for i in range(N)])
for x, y, lbl in zip(result_km['Te'], result_km.Hm0, result_km['labels'] ):
axs[i,1].text(x+0.1, y+0.1, lbl)
plt.setp(axs[len(Ns)-1,0], xlabel='Sig. wave height, $Hm0$ [m]')
plt.setp(axs[len(Ns)-1,1], xlabel='Sig. wave height, $Hm0$ [m]')
w = ndbc_data[year].columns.values
f = w / 2*np.pi
for N in results:
result = results[N]
J=[]
for i in range(len(result)):
b = resource.jonswap_spectrum(f, result.Tp[i], result.Hm0[i])
J.extend([resource.energy_flux(b, h=399.).values[0][0]])
result['J'] = J
results[N] = result
for N in results_km:
result_km = results_km[N]
J=[]
for i in range(len(result_km)):
b = resource.jonswap_spectrum(f, result_km.Tp[i], result_km.Hm0[i])
J.extend([resource.energy_flux(b, h=399.).values[0][0]])
result_km['J'] = J
results_km[N] = result_km
ratios={}
for N in results:
J_hr = results[N].J*len(data_clean)
total_weighted_J= (J_hr * results[N].weights).sum()
normalized_weighted_J = total_weighted_J / Total
ratios[N] = np.round(normalized_weighted_J, 4)
ratios_km={}
for N in results_km:
J_hr = results_km[N].J*len(data_clean)
total_weighted_J= (J_hr * results_km[N].weights).sum()
normalized_weighted_J = total_weighted_J / Total
ratios_km[N] = np.round(normalized_weighted_J, 4)
(pd.Series(ratios), pd.Series(ratios_km))
@ssolson, @ryancoe, @cmichelenstrofer, @jtgrasb just tagging you because at some point we probably all have talked about this.
In my mind, what you want to do is study how do your quantities of interest (e.g., AEP) change when you vary the number of clusters and change the algorithm. @ssolson - Didn't you write a paper where you did this? I can't seem to find it.
In my mind, what you want to do is study how do your quantities of interest (e.g., AEP) change when you vary the number of clusters and change the algorithm. @ssolson - Didn't you write a paper where you did this? I can't seem to find it.
@ryancoe I stopped the study I wrote up at the sea states. I did forward you some preliminary results I found using WecOptTool although I don't think it will be of much help.
The useful part of the paper I wrote is captured in https://github.com/MHKiT-Software/MHKiT-Python/blob/master/examples/PacWave_resource_characterization_example.ipynb
@dtgaebe that is a cool idea to use an idealized spectrum and then compare the clusters to the original seastate.
Based on what you show here it looks like it makes sense to at a minimum extend the PACWAVE example to include your comparison using the ratio and potentially look to summarize a couple steps into function for MHKiT.
I'm interested to hear your thoughts on next steps for you and perhaps where MHKiT could help with the data processing.
Note also that @cmichelenstrofer is working on something that intersects this where's he's considering expanding the parameterization of sea states based on machine learning.
@ssolson I think that cool idea came from you, or whoever wrote the PacWave resource assessment example because the ratio is already in there. I just used the same methodology for the KMeans clustered data.
Personally, I find it very useful to sort and visualize the clusters after weight - independent of the algorithm. So this might be a nice addition to MhKit.
I would think that developers would use sea state clusters for gain scheduling. So, as Ryan suggested, it would be interesting how the different clusters impact performance and predicted performance and how sensitive the results are to number of clusters and algorithm. We did a study with sensitivity analysis to bulk spectral bulk parameters here: OCEAN2021. We could potentially do something similar here.
Close #375