graphinius De-duplicating of nodes via similarity & community detection

De-duplicating of nodes via similarity & community detection

Open cassinius opened this issue 5 years ago • 0 comments

Use something along the line of the ingredient de-duplicating pipeline demonstrated in a neo4j tutorial using the BBC goodfood ingredients

[ ] download the goodfood dataset (scraping required !) - or something equivalend
[ ] normal NLP preprocessing steps
- [ ] character encodings
- [ ] tokenization
- [ ] stemming (plurals)
- [ ] stopwords / length etc.
[ ] connect tokens to an ingredient
- ingredient: cherry tomato => parts: cherry and tomato
[ ] Use string distance to create similarity edges
- [ ] sorensenDiceSimilarity ??
[ ] Use phonetic similarity to create similarity edges
- [ ] doubleMetaphone ??
[ ] Run a community detection algorithm (like Louvain) to cluster similar ingredients together

Apr 28 '20 08:04 cassinius