metadata
metadata copied to clipboard
Change the order of local tags
I propose in this PR to change the way in which local tags are added to the text.
Motivation
HTML defines a tree. In this sense, it is important that the rules for opening and closing tags are respected. For example, the following html is not valid "<h1> test </h1></a><a>"
. With the current way of adding local tags, this rule may not be respected because of self closing tags which can have their opening and closing tags on the same character index.
As a side note, with the current metadata format, one level of information is lost for HTM tags. Indeed, when two tags have the same index for their opening and closing tag, we don't know which one is the parent.
Implementation details
- I have modified the way local tags are added
- I added a test
- I also modified the HTMLProcessor because HTML tags have a tag value but also attributes.
Review
I would be very happy to have your opinion on this PR
As a side note, with the current metadata format, one level of information is lost for HTM tags. Indeed, when two tags have the same index for their opening and closing tag, we don't know which one is the parent.
Just curious, is this (same index) a limitation of the source dataset?
Just curious, is this (same index) a limitation of the source dataset?
No no, I compile the dataset myself from Natural Questions and put it in the format proposed here.
Just curious, is this (same index) a limitation of the source dataset?
No no, I compile the dataset myself from Natural Questions and put it in the format proposed here.
Ah, I see. I misunderstood. Thank you for the clarification! I suppose if we truly want to resolve the nested-tag-situation then there may be a bit too much work for your HTML parser?
I suppose if we truly want to resolve the nested-tag-situation then there may be a bit too much work for your HTML parser?
:slightly_smiling_face:
I suppose if we truly want to resolve the nested-tag-situation then there may be a bit too much work for your HTML parser?
On the parser side I think it's really feasible, what's more blocking in my opinion is where to store the information in the jsonl
format we have defined and how to use it in the add_local_metadata_to_text
method without it being something specific to HTML.
what's more blocking in my opinion is where to store the information in the
jsonl
format we have defined and how to use it in theadd_local_metadata_to_text
method without it being something specific to HTML.
I see. Funny you should mention that, because my first comment was gonna be "having a new property or even structure of jsonl for it" but then I didn't send it for exactly the same reason. lol
That being said, I wonder if entities will also need that kind of nested construct.
Hi @SaulLu, as discussed yesterday, I fully agree that this is an important issue that needs to be fixed. However, I'm not sure I fully understand how your proposed solution works - would you perhaps have 15 minutes or so sometime tomorrow (ideally, between 15:00 and 17:00 CEST) to go through this pull request with me? :)
Summary of an offline discussion: unfortunately this PR does not solve all the concerns about the order of the local HTML metadata to be added.
For example, @timoschick has rightly shown that <tr><th>elephant</th><th></th></tr>
would be transformed into <tr><th>elephant</th></tr><th></th>
with this PR (which is still not correct).
Another pathological case is:
<table>
<tr>
<th>event_id</th>
<th>year</th>
<th>month</th>
</tr>
<tr>
<td>idghtu</td>
<td>1998</td>
<th>may</th>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</table>
I'm converting this PR into a draft, for now, the time to think about another workaround.