glosim
glosim copied to clipboard
How to handle similarity/distance matrix with sketchmap ?
As far as I can understand, glosim tool generates similarity or distance matrix between the molecules or the structures. While, sketchmap tool handle the high dimensional input data. How can I input the similarity or distance matrix into sketchmap ? Is there the mode like sklearn.manifold.MDS to handle pre-computed dissimilarities? (sklearn.manifold.MDS can handle them with dissimilarity=‘precomputed’ option) regard,
Hi Yiino, Yes. use the utility script https://github.com/cosmo-epfl/sketchmap/blob/master/utils/sketch-map.sh .. it asks you if your input file is similarity matrix or not. provide distance matrix (there is a problem with convention here, it expects distance matrix actually) as input file and say yes. if you are using dimred executable directly then you can use --similarity tag to indicate that the input is the distance matrix. Thank you Best Regards Sandip
On Tue, Dec 25, 2018 at 12:56 PM yiino [email protected] wrote:
As far as I can understand, glosim tool generates similarity or distance matrix between the molecules or the structures. While, sketchmap tool handle the high dimensional input data. How can I input the similarity or distance matrix into sketchmap ? Is there the mode like sklearn.manifold.MDS to handle pre-computed dissimilarities? (sklearn.manifold.MDS can handle them with dissimilarity=‘precomputed’ option) regard,
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cosmo-epfl/glosim/issues/14, or mute the thread https://github.com/notifications/unsubscribe-auth/AKytCV-a_7cyzFXR1-tx56mPvnnhdetMks5u8egxgaJpZM4Zg8G7 .
Thank you, I'll try it.
For the similarity matrix input, 'dimlandmark' works? I tries it as follow:
dimlandmark -similarity -n 1000 -mode minmax -w -lowmem < dist_rematch-564_peratom > output
The output looks strange since it contains many same numbers. 'dist_rematch-564_peratom' is 564x564 matrix data. Generated output file contains 1000 data lines and each line contains 1001 numbers in which gtom 4th to 1000th numbers are same. Is it the expected action? Are the options for the program wrong?
for selecting landmarks from kernel matrix you have https://github.com/cosmo-epfl/glosim/blob/master/tools/select_landmarks.py
Thank you Best Regards Sandip
On Wed, Jan 16, 2019 at 9:45 AM yiino [email protected] wrote:
For the similarity matrix input, 'dimlandmark' works? I tries it as follow:
dimlandmark -similarity -n 1000 -mode minmax -w -lowmem < dist_rematch-564_peratom > output
The output looks strange since it contains many same numbers. 'dist_rematch-564_peratom' is 564x564 matrix data. Generated output file contains 1000 data lines and each line contains 1001 numbers in which gtom 4th to 1000th numbers are same. Is it the expected action? Are the options for the program wrong?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cosmo-epfl/glosim/issues/14#issuecomment-454698050, or mute the thread https://github.com/notifications/unsubscribe-auth/AKytCd3b18_QSLCMpOq4qHEdxP-sR-SKks5vDua9gaJpZM4Zg8G7 .
Hi! Sandip, Thank you for your quick response.
Frequently sorry, but I'm stuck to project whole data using highd- and lowd-landmarks. Is there any remark at using the landmark files generated from similarity matrix?
dimproj -P dist-landmark100.k -p dist100.gmds -similarity -fun-hd 0.05,3,2 -fun-ld 0.05,1,1 < dist_rematch-564_peratom > dist_rematch-564_peratom.lowd Error in main: HD and LD point list mismatch
'dist100.gmds' is lowd-landmarks, while 'dist-landmark100.k' is highd-landmarks. The former contains 100 lines with 3 columns each. (comment lines removed) The latter is generated with 'select_landmarks.py' utility which contains 100x100 numbers. 'dist_rematch-564_peratom' is the similarity matrix data which contains 564x564 numbers.
'select_landmarks.py' is kicked as follows:
python select_landmarks.py --mode fps --output kernel --nland 100 --prefix dist dist_rematch-564_peratom
I get 'dist-landmark100.k' file, then, I can get low dimensional representation using 'sketch-map.sh':
./sketch-map.sh Please enter the dimensionality of input data 100 Are we reading the similarity matrix? y Please enter the input data file name dist-landmark100.k Please enter the output data prefix dist100 Please enter high dimension sigma, a, b [e.g. 6.0 2 6 ] 0.05 3 2 Please enter low dimension sigma, a, b [e.g. 6.0 2 6 ] 0.05 1 1
regard, Y.Iino
Hi , I told you in one of my first comments that although the sketchmap asks for similarity matrix it actually needs distance (dissimilarity) matrix . It is easy to check what you have by checking the diagonals of the matrix . For distance matrix all diagonals should be zero.
So when using landmark selection python code you need to use --output distance . The naming convention is .K files are similarity kernel and .Sim files are distance matrix( I know this is a bit misleading but it was due to early naming convention).
Now regarding your HD LD mismatch problem. I believe you have the comment line in the highd landmark matrix which needs to be removed for this part. Also just remove the 3rd column of the .gmds file. But beware that you need to do all these with distance matrix and not kernel.
Best regards Sandip
On Thu, 17 Jan 2019, 10:17 yiino <[email protected] wrote:
Hi! Sandip, Thank you for your quick response.
Frequently sorry, but I'm stuck to project whole data using highd- and lowd-landmarks. Is there any remark at using the landmark files generated from similarity matrix?
dimproj -P dist-landmark100.k -p dist100.gmds -similarity -fun-hd 0.05,3,2 -fun-ld 0.05,1,1 < dist_rematch-564_peratom > dist_rematch-564_peratom.lowd Error in main: HD and LD point list mismatch
'dist100.gmds' is lowd-landmarks, while 'dist-landmark100.k' is highd-landmarks. The former contains 100 lines with 3 columns each. (comment lines removed) The latter is generated with 'select_landmarks.py' utility which contains 100x100 numbers. 'dist_rematch-564_peratom' is the similarity matrix data which contains 564x564 numbers.
'select_landmarks.py' is kicked as follows:
python select_landmarks.py --mode fps --output kernel --nland 100 --prefix dist dist_rematch-564_peratom
I get 'dist-landmark100.k' file, then, I can get low dimensional representation using 'sketch-map.sh':
./sketch-map.sh Please enter the dimensionality of input data 100 Are we reading the similarity matrix? y Please enter the input data file name dist-landmark100.k Please enter the output data prefix dist100 Please enter high dimension sigma, a, b [e.g. 6.0 2 6 ] 0.05 3 2 Please enter low dimension sigma, a, b [e.g. 6.0 2 6 ] 0.05 1 1
regard, Y.Iino
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cosmo-epfl/glosim/issues/14#issuecomment-455098943, or mute the thread https://github.com/notifications/unsubscribe-auth/AKytCVbl9rLpWSmsVyd3LA7w2Z4pQp5Gks5vED-wgaJpZM4Zg8G7 .
Hi! Sandip, Thank you for your kind advices. Finally, I think I can get the sketch map image. As you pointed out, I'm confused between similarity(distance) and kernel.
For selecting landmark, the input is kernel data. ("sim_rematch-564_peratom.ssv" is kernel data)
$ python select_landmarks.py --mode fps --output distance --nland 100 --prefix dist sim_rematch-564_peratom.ssv
This generates "dist-landmark100.sim" (shrunk distance matrix) and "dist-landmark100-OOS.sim " (the coordinate of all points base on the coordinate system of landmark).
For reducing dimension, "sketch-map.sh" gets "dist-landmark100.sim" as input.
$ sketch-map.sh ./sketch-map.sh Please enter the dimensionality of input data 100 Are we reading the similarity matrix? y Please enter the input data file name dist-landmark100.sim Please enter the output data prefix dist100 Please enter high dimension sigma, a, b [e.g. 6.0 2 6 ] 0.05 3 2 Please enter low dimension sigma, a, b [e.g. 6.0 2 6 ] 0.05 1 1
Projection of all points is done as follows. It looks the keypoint to use "dist-landmark100-OOS.sim" as input. (Is that right?) Also, it is necessary to remove comment lines and third column in the file "dist100.gmds" generated by sketch-map.sh.
$ dimproj -D 100 -d 2 -P dist-landmark100.sim -p lowd-landmarks_ -similarity -fun-hd 0.05,3,2 -fun-ld 0.05,1,1 -cgmin 3 < dist-landmark100-OOS.sim > dist-landmark100-OOS.sim.lowd
Thank you. yiino