graphinius
graphinius copied to clipboard
De-duplicating of nodes via similarity & community detection
Use something along the line of the ingredient de-duplicating pipeline demonstrated in a neo4j tutorial using the BBC goodfood ingredients
Pipeline
- [ ] download the goodfood dataset (scraping required !) - or something equivalend
- [ ] normal NLP preprocessing steps
- [ ] character encodings
- [ ] tokenization
- [ ] stemming (plurals)
- [ ] stopwords / length etc.
- [ ] connect tokens to an ingredient
- ingredient:
cherry tomato=> parts:cherryandtomato
- ingredient:
- [ ] Use string distance to create similarity edges
- [ ] sorensenDiceSimilarity ??
- [ ] Use phonetic similarity to create similarity edges
- [ ] doubleMetaphone ??
- [ ] Run a community detection algorithm (like Louvain) to cluster similar ingredients together