datashare icon indicating copy to clipboard operation
datashare copied to clipboard

fix: remove artificial line break for `.eml` NER

Open ClemDoum opened this issue 1 year ago • 6 comments

The current Tika pipeline keeps line break added by email servers in order to fit the 78/998 max line length RFC limit. Ideally emails inside DS should display without these artificial line break. Minimally, the NER should get rid of these line breaks

ClemDoum avatar Feb 13 '24 10:02 ClemDoum

This issue is stale because it has been open for 40 days with no activity.

github-actions[bot] avatar Mar 25 '24 00:03 github-actions[bot]

This issue is stale because it has been open for 40 days with no activity.

github-actions[bot] avatar May 09 '24 00:05 github-actions[bot]

This issue is stale because it has been open for 40 days with no activity.

github-actions[bot] avatar Jun 23 '24 00:06 github-actions[bot]

Wait for Spacy NER to be implemented to allow for faster prototyping / easier text processing: https://github.com/ICIJ/datashare/issues/1452

ClemDoum avatar Jul 09 '24 08:07 ClemDoum

This issue is stale because it has been open for 40 days with no activity.

github-actions[bot] avatar Aug 19 '24 00:08 github-actions[bot]