ace2005-preprocessing
ace2005-preprocessing copied to clipboard
exact sentence which caused 'end_idx = -1' issue
Hi there! Sorry for bothering again. I am using ace_2005_td_v7_LDC2006T06.tgz dataset and I have downloaded the latest version of this github repo.
During the processing of the training data, assertion error occurred: assert end_idx != -1, "end_idx: {}, end_pos: {}, phrase: {}, tokens: {}, chars:{}".format(end_idx, end_pos, phrase, tokens, chars) AssertionError: end_idx: -1, end_pos: 133, phrase: Doctors Without Borders/Médecins Sans Frontières (MSF, tokens: [{'index': 1, 'word': '', 'originalText': '"', 'lemma': '', 'characterOffsetBegin': 0,
I simply commented the assertion code and the main.py finished running without exception.
Here is what I found in the output file:
"sentence": ""Doctors Without Borders/M\u8305decins Sans Fronti\u732bres (MSF) has received an extraordinary outpouring of support for the people of South Asia and we are extremely grateful.", "golden-entity-mentions": [
{
"text": "Doctors Without Borders/M\u00e9decins Sans Fronti\u00e8res (MSF",
"entity-type": "ORG:Non-Governmental",
"start": 12,
**"end": -1**
},...]
How to solve this end: -1 problem? The entity recognition could be incomplete.
I meet the same problem with you!
meet same problem with same data
you can change the raw data that in Engish/un/timex2norm/alt.vacation.las-vegas_20050109.0133.apf.xml and alt.vacation.las-vegas_20050109.0133.sgm. In this two files,you can search "Doctors Without" and change following é to e .and the problem will solve.
Hi,
I am doing research on information extraction and need to use ACE2005 dataset urgently. But unfortunately, the LDC licence for ACE2005 is not available for my university. May I know if you can by any chances share the dataset for research purpose?
Many thanks, Regards, kc
Hi,
I am doing research on information extraction and need to use ACE2005 dataset urgently. But unfortunately, the LDC licence for ACE2005 is not available for my university. May I know if you can by any chances share the dataset for research purpose?
Many thanks, Regards, kc
Hi there,
sorry for the late response. I am wondering if you are still in need of the dataset. Contact me through email ([email protected]) if you are still interested.
Regards, Feng Yao
you can change the raw data that in Engish/un/timex2norm/alt.vacation.las-vegas_20050109.0133.apf.xml and alt.vacation.las-vegas_20050109.0133.sgm. In this two files,you can search "Doctors Without" and change following é to e .and the problem will solve.
In addition to change é to e, one should also change è to e to solve the problem.