ace2005-preprocessing icon indicating copy to clipboard operation
ace2005-preprocessing copied to clipboard

exact sentence which caused 'end_idx = -1' issue

Open yaof20 opened this issue 5 years ago • 6 comments

Hi there! Sorry for bothering again. I am using ace_2005_td_v7_LDC2006T06.tgz dataset and I have downloaded the latest version of this github repo.

During the processing of the training data, assertion error occurred: assert end_idx != -1, "end_idx: {}, end_pos: {}, phrase: {}, tokens: {}, chars:{}".format(end_idx, end_pos, phrase, tokens, chars) AssertionError: end_idx: -1, end_pos: 133, phrase: Doctors Without Borders/Médecins Sans Frontières (MSF, tokens: [{'index': 1, 'word': '', 'originalText': '"', 'lemma': '', 'characterOffsetBegin': 0,

I simply commented the assertion code and the main.py finished running without exception.

Here is what I found in the output file:

"sentence": ""Doctors Without Borders/M\u8305decins Sans Fronti\u732bres (MSF) has received an extraordinary outpouring of support for the people of South Asia and we are extremely grateful.", "golden-entity-mentions": [

  {
    "text": "Doctors Without Borders/M\u00e9decins Sans Fronti\u00e8res (MSF",
    "entity-type": "ORG:Non-Governmental",
    "start": 12,
    **"end": -1**
  },...]

How to solve this end: -1 problem? The entity recognition could be incomplete.

yaof20 avatar Dec 10 '19 11:12 yaof20

I meet the same problem with you!

Hanlard avatar Dec 30 '19 02:12 Hanlard

meet same problem with same data

scarydemon2 avatar Aug 01 '20 08:08 scarydemon2

you can change the raw data that in Engish/un/timex2norm/alt.vacation.las-vegas_20050109.0133.apf.xml and alt.vacation.las-vegas_20050109.0133.sgm. In this two files,you can search "Doctors Without" and change following é to e .and the problem will solve.

scarydemon2 avatar Aug 03 '20 01:08 scarydemon2

Hi,

I am doing research on information extraction and need to use ACE2005 dataset urgently. But unfortunately, the LDC licence for ACE2005 is not available for my university. May I know if you can by any chances share the dataset for research purpose?

Many thanks, Regards, kc

daviddongkc avatar Dec 25 '20 15:12 daviddongkc

Hi,

I am doing research on information extraction and need to use ACE2005 dataset urgently. But unfortunately, the LDC licence for ACE2005 is not available for my university. May I know if you can by any chances share the dataset for research purpose?

Many thanks, Regards, kc

Hi there,

sorry for the late response. I am wondering if you are still in need of the dataset. Contact me through email ([email protected]) if you are still interested.

Regards, Feng Yao

yaof20 avatar Feb 26 '21 03:02 yaof20

you can change the raw data that in Engish/un/timex2norm/alt.vacation.las-vegas_20050109.0133.apf.xml and alt.vacation.las-vegas_20050109.0133.sgm. In this two files,you can search "Doctors Without" and change following é to e .and the problem will solve.

In addition to change é to e, one should also change è to e to solve the problem.

zyz0000 avatar Aug 07 '22 09:08 zyz0000