trafilatura
trafilatura copied to clipboard
Thoroughly implement and test duplicate detection
- [x] Least-recently-used (LRU) cache
- [x] Maximum number of occurrences allowed?
- [ ] Line / sentence / paragraph / document level?
- [ ] Concurrency: thread-safety / multiprocessing
Useful test case: https://github.com/miso-belica/jusText/issues/42