datashare
datashare copied to clipboard
fix: remove artificial line break for `.eml` NER
The current Tika pipeline keeps line break added by email servers in order to fit the 78/998
max line length RFC limit.
Ideally emails inside DS should display without these artificial line break.
Minimally, the NER should get rid of these line breaks
This issue is stale because it has been open for 40 days with no activity.
This issue is stale because it has been open for 40 days with no activity.
This issue is stale because it has been open for 40 days with no activity.
Wait for Spacy NER to be implemented to allow for faster prototyping / easier text processing: https://github.com/ICIJ/datashare/issues/1452
This issue is stale because it has been open for 40 days with no activity.