unitxt Add relation extraction

Adding support for relation-extraction task.

May 21 '24 13:05 pklpriv

Hi. I added my comments. I think you should create a card that uses the tasks, and loads the raw data from the file, and converts it to the format required by the task.

May 21 '24 13:05 yoavkatz

Since this is an important NLP task i suggest we try to get it merged asap:

My suggestion is to follow the conventions and naming in the TACRED dataset as a representivte of the jargon concensus:

The simple naming for the task output fields is:

subjects: List[str]
relations: List[str]
objects: List[str]

The more verbosed version:

subjects_mentions: List[str]
relations_types: List[str]
objects_mentions: List[str]

I personally prefer the simple one.

Now to the second observation: There are two types of tasks really (1) To produce only mentions and relations (2) To produce mentions with their exact location in the text. Both have different metrics and different use cases IMO.

So my practical suggestion here is to actually have two different tasks:

tasks.relation_extraction
tasks.relation_extraction.with_positions

The second ofcourse should be with :

subjects_starts: List[int]
subjects_ends: List[int]
objects_starts: List[int]
objects_ends: List[int]

What do you think @pklpriv and @yoavkatz ?

Jun 10 '24 18:06 elronbandel

Since this is an important NLP task i suggest we try to get it merged asap:

My suggestion is to follow the conventions and naming in the TACRED dataset as a representivte of the jargon concensus:

The simple naming for the task output fields is:
subjects: List[str]
relations: List[str]
objects: List[str]
The more verbosed version:
subjects_mentions: List[str]
relations_types: List[str]
objects_mentions: List[str]
I personally prefer the simple one.

Now to the second observation: There are two types of tasks really (1) To produce only mentions and relations (2) To produce mentions with their exact location in the text. Both have different metrics and different use cases IMO.

So my practical suggestion here is to actually have two different tasks:

tasks.relation_extraction

tasks.relation_extraction.with_positions

The second ofcourse should be with :
subjects_starts: List[int]
subjects_ends: List[int]
objects_starts: List[int]
objects_ends: List[int]
What do you think @pklpriv and @yoavkatz ?

I agree that the short names are better, and that there are two tasks (with position and without position). They not only differ in the input, but also in the prediction type (e.g. the position need to be checked). Initially we can support only tasks.relation_extraction. Even if the data is with positions, it should be converted to lists of strings in the pre-processing phase of the card (and not in the template, as done today in NER)

Jun 11 '24 05:06 yoavkatz

@elronbandel @yoavkatz

I agree with You, I think this is good way to implement that. Additionally I am currently implementing task/metric for datasets that do not have specified and named relation like (obj1, relation, obj2). Instead, for tuple (ob1,obj2,...,obj_n) we consider being in the same tuple as relation itself.

Jun 12 '24 15:06 pklpriv

@pklpriv Can you explain more about the n-ary tuple with unnamed relation? Can you give one input and required output example?

Jun 12 '24 18:06 elronbandel

@elronbandel

I will allow myself to cite Yoav:

“As of December 31, 2011, we had 4,260 employees of whom 2,155, or 51%, are employed in the U.S.; 1,165, or 27%, are employed in Europe; 615, or 14%, are employed in Asia and 325, or 8%, are employed in the Middle East.“,

This is the expected output:

[ {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“4,260\“}, {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“2155\“,\“EMPLOYEE_PERCENT\“:\“51%\“,\“GEOGRAPHY\“:\“U.S.\“}, {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“1165\“,\“EMPLOYEE_PERCENT\“:\“27%\“,\“GEOGRAPHY\“:\“Europe\“}, {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“615\“,\“EMPLOYEE_PERCENT\“:\“14%\“,\“GEOGRAPHY\“:\“Asia\“}, {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“325\“,\“EMPLOYEE_PERCENT\“:\“8%\“,\“GEOGRAPHY\“:\“Middle East\“}]

As You see, there is no object that represents relation itself (in opposite to f.e. (John,employedBy,Hannah) where 'employedBy' is relation stated as an object). Instead, it is defined by sentence itself.

Jun 13 '24 15:06 pklpriv