RecursiveHierarchicalClustering icon indicating copy to clipboard operation
RecursiveHierarchicalClustering copied to clipboard

Failed to reproduce the results with sample data 'input.txt' provided

Open AnthonyruihChen opened this issue 2 years ago • 0 comments

Hi xychang,

Thank you for sharing your great work! I would greatly appreciate if you could help me resolve the below issue.

I first tried the CLI interface, and was able to generate 'results.json' and 'vis.json'. However, it didn't allow me to http://localhost:8000/multi_color.html?json=vis.json, so I decided to give Python interface a try.

I am using the below code and parameter configuration to reproduce 'results.json' and 'vis.json'.

import recursiveHierarchicalClustering as rhc
import recursiveHierarchicalClusteringFast as rhcFast
data = rhc.getSidNgramMap(inputPath)
treeData = rhcFast.run(inputPath, data, outPath)

environment: Jupyter Notebook

inputPath: I added your input.txt file to one directory and set inputPath = '/home/chenruihao/test_clustering/input.txt'

outPath: I didn't find description of outPath but found outputPath which is "The directory to place all temporary files as well as the final result.". I suppose outPath and outputPath are both the directory to store output files. so I set outPath = '/home/chenruihao/test_clustering/output/'

I got below error when I try to run the above code:

/home/chenruihao/test_clustering/recursiveHierarchicalClustering.py:247: FutureWarning: `rcond` parameter will change to the default of machine precision times ``max(M, N)`` where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass `rcond=None`, to keep using the old, explicitly pass `rcond=-1`.
  result = np.linalg.lstsq(A, y)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-57-3760f95de317> in <module>
----> 1 treeData = rhcFast.run(inputPath, data, outPath)

~/test_clustering/recursiveHierarchicalClusteringFast.py in run(ngramPath, sid_seq, outPath)
    416 
    417     hc = HCClustering(
--> 418         matrix, sid_seq, outPath, [], idxToSid,
    419         sizeThreshold=0.05 * len(sid_seq), idfMap=idfMap)
    420     result = hc.runDiana()

~/test_clustering/recursiveHierarchicalClusteringFast.py in runDiana(self)
    337                     matrix = calculateDistance.partialMatrix(
    338                         sids,
--> 339                         rhc.excludeFeatures(rhc.getIdf(self.sid_seq, sids),
    340                                             newExclusions),
    341                         ngramPath,

NameError: name 'ngramPath' is not defined

Q1: How may I fix this error? Fix trial: 'ngramPath' is called in 'recursiveHierarchicalClusteringFast.py', so I hard coded it in the below way:

  1. looks like the run function under ngramPath seems to be the same as sys.argv[1], and by definition, ngramPath is the path to the computed pattern dataset, so I hard code ngramPath = '/home/chenruihao/test_clustering/input.txt', same as the inputPath, but I still got the above error...Would love to hear your thoughts.

Q2 I also want to understand what user_id were clustered, their membership, and their corresponding action-gap-action similar to the issue discussed in another thread. Would it be possible to just use the result.json file to answer my question as well as the question in the above thread, rather than modify the code?

My understanding is that from the result.json, it looks like for each level of cluster,

  • key = 1 stores the user_ids that were clustered in that level of cluster;
  • key = 2, exclusions stores the action-gap-action/token members of the cluster

Thanks! Anthony

AnthonyruihChen avatar Aug 18 '21 19:08 AnthonyruihChen