DEKM
DEKM copied to clipboard
clustering optimization iteration
Hi, I want to ask. Did you really used 140*100 = 14000 iteration in clustering optimization step? When applying a similar approach to my own dataset, which comprises over 30,000 texts, the iteration process did not converge to below a 0.1% n_change_assignment. Consequently, the entire algorithm required approximately 3 hours to complete, which, in my opinion, is quite extensive. Would you be able to provide some insights or clarification?
Thanks.
oh and in the paper you mentioned that the minimum n_change_assignment to stop the iteration is 0.1%, but in the code you used 0.5%. So which one is valid?
Do you really used 140*100 = 14000 iteration in clustering optimization step?
No, in all of my experiments, I halt at line 123 of DEKM.py. The default setting of '140*100 = 14000 iterations' is borrowed from DEC.
Optimal threshold
Both 0.5% and 0.1% result in comparable clustering performance. The optimal threshold might vary across different datasets. I update this threshold, but still employing the same one for all datasets.
ahh I see, thanks!
By the way, do you have a formula for the gradient of L_4 with respect to h? Since I'm a mathematics student and I want to use your DEKM method for my undergraduate thesis.
Also, I want to make sure is the y - y' resulted in a vector with the same dimension as y, but having zero values in all of the dimension except the last dimension? Since y' is a replicate of y, except the last dimension came from m_i right.
Thanks!
the y - y' resulted in a vector with the same dimension as y, but having zero values in all of the dimension except the last dimension?
Yes, it is.
I mean this L_4
Because $\mathbf{y}^\prime$ is a scalar, thus we have $$\frac{\partial{L_4} }{\partial{\mathbf{h}} } =\frac{\partial{L_4} }{\partial{\mathbf{y}} }\frac{\partial{\mathbf{y}} }{\partial{\mathbf{h}} }=\sum_{i=1}^k \sum_{\mathbf{y} \in \mathcal{C}_i}2\mathbf{V}(\mathbf{y}-\mathbf{y}^\prime)$$
Sorry but I don't get why y' is scalar? Isn't it's supposed to be $\frac{\partial{L_4}}{\partial{y}}\frac{\partial{y}}{\partial{h}} + \frac{\partial{L_4}}{\partial{y'}}\frac{\partial{y'}}{\partial{h}}$, because y' is a vector that depends on h just like y?
In the DEKM context, we interpret $L_4=\sum_i \sum_y ||y-y'||^2$ as a regression task, with $y'$ representing a predetermined constant target value. The objective of regression here is to adjust $y$ to approximate $y'$.
Ah I see thanks. By the way, can you explain the step by step how the equation at the top became the equation at the bottom? Thanks!
This derivation uses two properties of the trace function: (1) Cyclic property $Tr(ABCD)=Tr(BCDA)$, and (2) trace additivity $\sum_i Tr(A_i)=Tr(\sum_i A_i)$.
Sorry if I'm asking too much, but related to the Rayleigh-ritz theorem version that you mentioned in the paper, is it the version in the image? Since the most related theorem I found in the Handbook of Matrices book is this image, because $X$ and $A$ in the context of DEKM is e x e matrix, so $Tr(X^T A X) = \lambda_1 + \dots + \lambda_e$, where X is real and $X^T X=I$. Hence, $X=[v_1,\dots,v_e]$.
is it the version in the image?
Yes, it is. Any question is welcom.
I just realized, isn't the theorem says that each eigenvector is the column vector of $X$ (since it uses this notation: $X=[v_1,\dots,v_e]$), not the row vector? Because you stacked the eigenvector as row vectors for the orthonormal transformation matrix $V$
In DEKM, we use $Tr(VS_wV^T)$, but not $Tr(V^TS_wV)$. Thus, $V$ consists of the row eigenvectors.
Ohh I see, missed that part😄. Thank you so much!