graphinius icon indicating copy to clipboard operation
graphinius copied to clipboard

De-duplicating of nodes via similarity & community detection

Open cassinius opened this issue 5 years ago • 0 comments

Use something along the line of the ingredient de-duplicating pipeline demonstrated in a neo4j tutorial using the BBC goodfood ingredients

Pipeline

  • [ ] download the goodfood dataset (scraping required !) - or something equivalend
  • [ ] normal NLP preprocessing steps
    • [ ] character encodings
    • [ ] tokenization
    • [ ] stemming (plurals)
    • [ ] stopwords / length etc.
  • [ ] connect tokens to an ingredient
    • ingredient: cherry tomato => parts: cherry and tomato
  • [ ] Use string distance to create similarity edges
    • [ ] sorensenDiceSimilarity ??
  • [ ] Use phonetic similarity to create similarity edges
    • [ ] doubleMetaphone ??
  • [ ] Run a community detection algorithm (like Louvain) to cluster similar ingredients together

cassinius avatar Apr 28 '20 08:04 cassinius