metadata icon indicating copy to clipboard operation
metadata copied to clipboard

Change the order of local tags

Open SaulLu opened this issue 3 years ago • 7 comments

I propose in this PR to change the way in which local tags are added to the text.

Motivation

HTML defines a tree. In this sense, it is important that the rules for opening and closing tags are respected. For example, the following html is not valid "<h1> test </h1></a><a>". With the current way of adding local tags, this rule may not be respected because of self closing tags which can have their opening and closing tags on the same character index.

As a side note, with the current metadata format, one level of information is lost for HTM tags. Indeed, when two tags have the same index for their opening and closing tag, we don't know which one is the parent.

Implementation details

  • I have modified the way local tags are added
  • I added a test
  • I also modified the HTMLProcessor because HTML tags have a tag value but also attributes.

Review

I would be very happy to have your opinion on this PR

SaulLu avatar Sep 06 '21 09:09 SaulLu

As a side note, with the current metadata format, one level of information is lost for HTM tags. Indeed, when two tags have the same index for their opening and closing tag, we don't know which one is the parent.

Just curious, is this (same index) a limitation of the source dataset?

tianjianjiang avatar Sep 06 '21 14:09 tianjianjiang

Just curious, is this (same index) a limitation of the source dataset?

No no, I compile the dataset myself from Natural Questions and put it in the format proposed here.

SaulLu avatar Sep 06 '21 14:09 SaulLu

Just curious, is this (same index) a limitation of the source dataset?

No no, I compile the dataset myself from Natural Questions and put it in the format proposed here.

Ah, I see. I misunderstood. Thank you for the clarification! I suppose if we truly want to resolve the nested-tag-situation then there may be a bit too much work for your HTML parser?

tianjianjiang avatar Sep 06 '21 15:09 tianjianjiang

I suppose if we truly want to resolve the nested-tag-situation then there may be a bit too much work for your HTML parser?

:slightly_smiling_face:

I suppose if we truly want to resolve the nested-tag-situation then there may be a bit too much work for your HTML parser?

On the parser side I think it's really feasible, what's more blocking in my opinion is where to store the information in the jsonl format we have defined and how to use it in the add_local_metadata_to_text method without it being something specific to HTML.

SaulLu avatar Sep 06 '21 15:09 SaulLu

what's more blocking in my opinion is where to store the information in the jsonl format we have defined and how to use it in the add_local_metadata_to_text method without it being something specific to HTML.

I see. Funny you should mention that, because my first comment was gonna be "having a new property or even structure of jsonl for it" but then I didn't send it for exactly the same reason. lol

That being said, I wonder if entities will also need that kind of nested construct.

tianjianjiang avatar Sep 08 '21 23:09 tianjianjiang

Hi @SaulLu, as discussed yesterday, I fully agree that this is an important issue that needs to be fixed. However, I'm not sure I fully understand how your proposed solution works - would you perhaps have 15 minutes or so sometime tomorrow (ideally, between 15:00 and 17:00 CEST) to go through this pull request with me? :)

timoschick avatar Sep 09 '21 15:09 timoschick

Summary of an offline discussion: unfortunately this PR does not solve all the concerns about the order of the local HTML metadata to be added.

For example, @timoschick has rightly shown that <tr><th>elephant</th><th></th></tr> would be transformed into <tr><th>elephant</th></tr><th></th> with this PR (which is still not correct).

Another pathological case is:

<table>
  <tr>
    <th>event_id</th>
    <th>year</th>
    <th>month</th>
  </tr>
  <tr>
    <td>idghtu</td>
    <td>1998</td>
    <th>may</th>
  </tr>
  <tr>
    <td></td>
    <td></td>
    <td></td>
  </tr>
</table>

I'm converting this PR into a draft, for now, the time to think about another workaround.

SaulLu avatar Sep 10 '21 14:09 SaulLu