mwparserfromhell icon indicating copy to clipboard operation
mwparserfromhell copied to clipboard

Doc about parsing wiki tables

Open lmorillas opened this issue 10 years ago • 8 comments

Docs say that new release can parse wiki tables, but it's not documented. How can I parse a wiki table? Is there an special filter?

lmorillas avatar Jan 10 '15 17:01 lmorillas

There's no special filter (right now) – tables are just parsed as a special kind of HTML tag, so you can use filter_tags. For example:

>>> import mwparserfromhell
>>> text = """{|
... |-
... | testing
... |}"""
>>> code = mwparserfromhell.parse(text)
>>> code.filter_tags(matches=lambda node: node.tag == "table")
[u'{|\n|-\n| testing\n|}']
>>> print code.get_tree()
<
      table
>
      <
            tr
      >
            <
                  td
            >
                   testing\n
            </
                  td
            >
      </
            tr
      >
</
      table
>

It's a little clunky if you actually want to manipulate the tables... I'm not sure what proper methods would even look like. At any rate, I'm leaving this open as a reminder to document how these less-obvious features work.

earwig avatar Jan 11 '15 04:01 earwig

I want to extract data from wikitable here https://en.wikipedia.org/wiki/OHL_Classic_at_Mayakoba but only the rows with columnspan = 10 so I want current and all previous names of the tournament e.g. 1) OHL Classic at Mayakoba 2) Mayakoba Golf Classic 3) Mayakoba Golf Classic at Riviera Maya-Cancun Will it be possible using filter_tags

I want to also do some validation i.e. I only want to look at winners table , there can be other tables on the page which I don't want look at. Within such table , only want to look at rows which span over all the columns and get its text.

Let me know the approach using code.filter_<>() methods. Or you think it's easier to do it using Python regex on whole wiki page markup.

shrikantp-vbt avatar Jan 09 '17 22:01 shrikantp-vbt

Have the wiki table manipulation methods been updated? Documented? Is it the same situation for lists? I was looking at methods that can access individual table cells or list elements.

suhassumukh avatar Apr 30 '19 11:04 suhassumukh

The only methods we currently have for this are the normal HTML tag traversal methods. What you want to do should be possible with those, but it’s not ideal. I would like to add more tailored things in the future, but this hasn’t happened yet.

On Apr 30, 2019, at 7:52 AM, suhassumukhv [email protected] wrote:

Have the wiki table manipulation methods been updated? Documented? Is the same situation for lists? I was looking at methods that can access individual table cells or list elements.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

earwig avatar Apr 30 '19 13:04 earwig

I would also second a feature like this.

TheSandDoctor avatar May 05 '19 06:05 TheSandDoctor

There's no special filter (right now) – tables are just parsed as a special kind of HTML tag, so you can use filter_tags. For example:

>>> import mwparserfromhell
>>> text = """{|
... |-
... | testing
... |}"""
>>> code = mwparserfromhell.parse(text)
>>> code.filter_tags(matches=lambda node: node.tag == "table")
[u'{|\n|-\n| testing\n|}']
>>> print code.get_tree()
<
      table
>
      <
            tr
      >
            <
                  td
            >
                   testing\n
            </
                  td
            >
      </
            tr
      >
</
      table
>

It's a little clunky if you actually want to manipulate the tables... I'm not sure what proper methods would even look like. At any rate, I'm leaving this open as a reminder to document how these less-obvious features work. @earwig

for manipulation, the right way will probably be implementing something like smart_list but for tables

MeitarR avatar Dec 23 '21 22:12 MeitarR

There's no special filter (right now) – tables are just parsed as a special kind of HTML tag, so you can use filter_tags. For example:

>>> import mwparserfromhell
>>> text = """{|
... |-
... | testing
... |}"""
>>> code = mwparserfromhell.parse(text)
>>> code.filter_tags(matches=lambda node: node.tag == "table")
[u'{|\n|-\n| testing\n|}']
>>> print code.get_tree()
<
      table
>
      <
            tr
      >
            <
                  td
            >
                   testing\n
            </
                  td
            >
      </
            tr
      >
</
      table
>

It's a little clunky if you actually want to manipulate the tables... I'm not sure what proper methods would even look like. At any rate, I'm leaving this open as a reminder to document how these less-obvious features work. @earwig

for manipulation, the right way will probably be implementing something like smart_list but for tables

This actually works fairly well, but runs into some kind of problems with with nested tables. the smart_list is incredible, and would love to see something implemented in the package to handle tables similarly. Currently, I am working on recursing on my own, but the code is becoming ugly -- but I think it's manageable.

Thanks for the package @earwig.

ryandward avatar Jan 20 '24 20:01 ryandward

I'll leave this here if anyone wants it. It saves having to clean up the html elements that get split up

def wiki_link_to_html(node):
    text = str(node.title)
    return f'<a href="#">{text}</a>'

def wiki_table_to_html(node):
    result = ['<table>']
    for row in node.contents.nodes:
        if isinstance(row, mwparserfromhell.nodes.Tag) and row.tag == 'tr':
            result.append('<tr>')
            for cell in row.contents.nodes:
                if isinstance(cell, mwparserfromhell.nodes.Tag) and cell.tag in ['td', 'th']:
                    result.append(f'<{cell.tag}>')
                    for content in cell.contents.nodes:
                        if isinstance(content, mwparserfromhell.nodes.Text):
                            result.append(str(content))
                        elif isinstance(content, mwparserfromhell.nodes.Wikilink):
                            result.append(wiki_link_to_html(content))
                    result.append(f'</{cell.tag}>')
            result.append('</tr>')
    result.append('</table>')
    return ''.join(result)

wiki_text = """
{| class="eoTable2 sortable" style="text-align:center" 
|-
! Spell !! Level !! Component A !! Component B !! Component C !! Trivial !! Mana Efficiency (Damage per Mana) Assumes 4 Targets & No Resists
|-
| [[Pillar of Fire]] || 16 || [[Rune of Nagafen]] || [[Rune of Proximity]] || || 22 || 3.6
|-
| [[Project Lightning]] || 16 || [[Rune of Fulguration]] || [[Rune of Periphery]] || || 21? || PBAoE
|}"""

wikicode = mwparserfromhell.parse(wiki_text)
html_text = wiki_table_to_html(wikicode.filter_tags(matches='table')[0])

print(html_text)

ryandward avatar Jan 24 '24 06:01 ryandward