RecursiveHierarchicalClustering icon indicating copy to clipboard operation
RecursiveHierarchicalClustering copied to clipboard

Just to verify my test result

Open ywu-stats opened this issue 4 years ago • 14 comments

Hi,

I was finally able to run it through with the sample data. I just wanted to verify that my result look as expected(attached visualization). Very cool visualization though! test

ywu-stats avatar Oct 24 '19 23:10 ywu-stats

Yes, the results look like it is supposed to! Glad you figured it out by yourself! Nice work!

On Thu, Oct 24, 2019, 4:22 PM ywu-stats [email protected] wrote:

Hi,

I was finally able to run it through with the sample data. I just wanted to verify that my result look as expected(attached visualization). Very cool visualization though! [image: test] https://user-images.githubusercontent.com/56888960/67532216-54c2b400-f67a-11e9-9525-4df95579f4a2.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/xychang/RecursiveHierarchicalClustering/issues/8?email_source=notifications&email_token=AAL6WMC2Z7JBXC54PIJWFMDQQIU2TA5CNFSM4JE4UIT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HUH6W3A, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL6WMDGSUHSAUQS543DJ33QQIU2TANCNFSM4JE4UITQ .

xychang avatar Oct 24 '19 23:10 xychang

Now you can tell how badly I wanted to make use of this. So now I'm trying to figure out in each of the clusters, which userid are in it and what are the top ngrams shared in each of the cluster. Is it something I can easily get from the result.json? Or I have to print out in previous steps?

ywu-stats avatar Oct 28 '19 19:10 ywu-stats

It's actually fairly easy to get the userid for each cluster. If you look at the visulization.py, you can see a function called allUser, which takes in a tree/sub-tree and returns all the users in it. For ways to traverse the tree structure, you can look at line 59-79 in visulization.py.

As to the ngrams, you can look at line 90-91 in visulization.py.

You might need a bit of knowledge in python to modify the code to suit your own needs, but it should be fairly straightforward.

xychang avatar Oct 29 '19 06:10 xychang

Thank you for the information! I'm indeed learning Python recently :) Another question I have is that where can I change the length of ngrams? I remember in your publication you mentioned 5, but I only see one action per feature in different test case I have.

On Mon, Oct 28, 2019 at 11:06 PM Xinyi Zhang [email protected] wrote:

It's actually fairly easy to get the userid for each cluster. If you look at the visulization.py, you can see a function called allUser, which takes in a tree/sub-tree and returns all the users in it. For ways to traverse the tree structure, you can look at line 59-79 in visulization.py.

As to the ngrams, you can look at line 90-91 in visulization.py.

You might need a bit of knowledge in python to modify the code to suit your own needs, but it should be fairly straightforward.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/xychang/RecursiveHierarchicalClustering/issues/8?email_source=notifications&email_token=ANSA5AHF7GNMKL6IQYIXXCTQQ7HGRA5CNFSM4JE4UIT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECPLEGI#issuecomment-547271193, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANSA5AFD5NV7ETTKUIZGZDDQQ7HGRANCNFSM4JE4UITQ .

ywu-stats avatar Oct 29 '19 06:10 ywu-stats

Yes, you can indeed do any length you want. It's just a part of preprocessing which is not included in this code, which means you need to be able to write some preprocessing code. If you check out the link below, you can see a more detailed description. https://github.com/xychang/RecursiveHierarchicalClustering#frequently-asked-questions

xychang avatar Oct 29 '19 06:10 xychang

Hmmm...seems like it's the predefined input structure? Then I do want to clarify something about the methodology in the publication. My understanding was, the whole feature space is a union set of all possible Ngrams and the values are count of each Ngram appeared in whole path at userid level.

For example, from the path of ABCDEFG, if I set N-grams N=3 I should look at features={ABC,BCD,CDE,DEF,...EFG}, right? So you are saying, this Ngram is part of the data processing step and ABC etc. are predefined in input data. I'm confused about how I should format my input. Is it ABC()BCD()CDE()...? Or I only need to modify the way of splitting the line for sid_seq?

ywu-stats avatar Oct 29 '19 17:10 ywu-stats

Yes, the input format should be ABC()BCD()CDE(). This is because this github repo is intended to be more general purpose than what is described in the paper. Hope this answers your question!

On Tue, Oct 29, 2019, 10:58 AM ywu-stats [email protected] wrote:

Hmmm...seems like it's the predefined input structure? Then I do want to clarify something about the methodology in the publication. My understanding was, the whole feature space is a union set of all possible Ngrams and the values are count of each Ngram appeared in whole path at userid level.

For example, from the path of ABCDEFG, if I set N-grams N=3 I should look at {ABC,BCD,CDE,DEF,...EFG}, right? So you are saying, this Ngram is part of the data processing step and ABC etc. are predefined in input data. I'm confused about how I should format my input. Is it ABC()BCD()CDE()...?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xychang/RecursiveHierarchicalClustering/issues/8?email_source=notifications&email_token=AAL6WMB7DY3OUNM5D5Y4MHDQRB2VXA5CNFSM4JE4UIT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECRP7SI#issuecomment-547553225, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL6WMBOMHO6VINHI4LTHMTQRB2VXANCNFSM4JE4UITQ .

xychang avatar Oct 29 '19 21:10 xychang

I see, thanks!

ywu-stats avatar Oct 29 '19 21:10 ywu-stats

Hi @xychang In the paper it is mentioned that 5 as the optimal k value for creating K-Grams. Now if I understand it correctly below is what actions define from the point of view of the Repo.

Ex: A(5)B(7) A => S1(g3)S2(g1)S3(g2)S4(g1)S2(g2) B => S3(g1)S3(g1)S3(g2)S8(g1)S6(g1)

akhildevelops avatar Nov 20 '19 07:11 akhildevelops

Hi @Enforcer007, when we say 5-gram, it actually includes the timegap. Following your example, A => S1(g3)S2(g1)S3 B => S3(g1)S3(g1)S3

xychang avatar Nov 20 '19 07:11 xychang

Hi @xychang

Thanks for responding. I have 2 questions:

Q1: Consider we go for 3 gram and below is the click stream: Sequence = S1g1S2g2S1g1S3g1S4g2S2g3S4g1S1

Then what wud be T3(Sequence):

T3(Sequence) = {(S1g1S2),(g1,S2,g2),(S2,g2,S1),......} OR T3(Sequence) = {(S1g1S2),(S2,g2,S1),(S1,g1,S3),......}

Q2: When you say it's 5 gram. I see in the visualisation there is a 3 gram pattern. Can you please explain.

doubt

Thanks

akhildevelops avatar Nov 20 '19 07:11 akhildevelops

So, in our implementation, we actually included both 3 grams and 5 grams. We found it to be helpful in practice.

On Tue, Nov 19, 2019, 11:22 PM Akhil a.k.a Enforcer007 < [email protected]> wrote:

Hi @xychang https://github.com/xychang

Thanks for responding. I have 2 questions:

Q1: Consider we go for 3 gram and below is the click stream: Sequence = S1g1S2g2S1g1S3g1S4g2S2g3S4g1S1

Then what wud be T3(Sequence):

T3(Sequence) = {(S1g1S2),(g1,S2,g2),(S2,g2,S1),......} OR T3(Sequence) = {(S1g1S2),(S2,g2,S1),(S1,g1,S3),......}

Q2: When you say it's 5 gram. I see in the visualisation there is a 3 gram pattern. Can you please explain.

[image: doubt] https://user-images.githubusercontent.com/6951100/69217376-8e60df00-0b94-11ea-8db9-85246448de06.png

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xychang/RecursiveHierarchicalClustering/issues/8?email_source=notifications&email_token=AAL6WMDKRE6YHE7PX6FLM7LQUTQT5A5CNFSM4JE4UIT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEQ7L4Y#issuecomment-555873779, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL6WMHMSNJKP4BRD3Y6IXDQUTQT5ANCNFSM4JE4UITQ .

xychang avatar Nov 20 '19 07:11 xychang

K, that's gr8. Can you pls confirm on Q1

Thanks

akhildevelops avatar Nov 20 '19 07:11 akhildevelops

For q1, the answer would be the latter.

On Tue, Nov 19, 2019, 11:22 PM Akhil a.k.a Enforcer007 < [email protected]> wrote:

Hi @xychang https://github.com/xychang

Thanks for responding. I have 2 questions:

Q1: Consider we go for 3 gram and below is the click stream: Sequence = S1g1S2g2S1g1S3g1S4g2S2g3S4g1S1

Then what wud be T3(Sequence):

T3(Sequence) = {(S1g1S2),(g1,S2,g2),(S2,g2,S1),......} OR T3(Sequence) = {(S1g1S2),(S2,g2,S1),(S1,g1,S3),......}

Q2: When you say it's 5 gram. I see in the visualisation there is a 3 gram pattern. Can you please explain.

[image: doubt] https://user-images.githubusercontent.com/6951100/69217376-8e60df00-0b94-11ea-8db9-85246448de06.png

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xychang/RecursiveHierarchicalClustering/issues/8?email_source=notifications&email_token=AAL6WMDKRE6YHE7PX6FLM7LQUTQT5A5CNFSM4JE4UIT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEQ7L4Y#issuecomment-555873779, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAL6WMHMSNJKP4BRD3Y6IXDQUTQT5ANCNFSM4JE4UITQ .

xychang avatar Nov 20 '19 07:11 xychang