wikiextractor icon indicating copy to clipboard operation
wikiextractor copied to clipboard

[Feature Request]: Capture Paragaph Heading information

Open dnk8n opened this issue 4 years ago • 1 comments

If we were able to split paragraphs by double newline characters, and somehow tag a paragraph heading to be destinct from a paragraph, then we could retain useful paragraph heading information for NLP tasks.

e.g. often there are paragraph headings such as:

  • Aftermath and cause
  • History and infrastructure
  • Accomplishment
  • Mission

They could all exist across multiple articles. If we could tag a paragraph heading with a symbol in the output, potentially this oculd be useful in doc2vev tags for example.

e.g of output of dummy text assuming # as the symbol to tag paragraph headings

Here is a sentence. Here is a second sentence. This paragraph doesn't have a paragraph heading.

#History and infrastructure
Here is a second paragraph. It had a paragraph heading and can be tagged as such. Sentences from other articles could likely also have the same paragraph heading. It is useful to capture this information.
Here is a third paragraph. It is also attributed to the same paragraph heading as the paragraph before.

Here is another paragraph without paragraph heading.

#Flight
Another paragraph, this time to do with flights. Just a stupid example

If text was output in this format we could split paragraphs by single newline character and groups of paragraphs by double newline character. We could also introduce a tag to associate with the one or multiple paragraphs that fall under such a paragraph.

Open to any other suggestions. If the '#' symbol was to be used, we would need to escape any text that includes in at the start (but I am unsure that is even allowed, so not a big issue... worth covering as an edge case anyway).

dnk8n avatar Aug 04 '21 08:08 dnk8n

Currently the above example would look like this:

Here is a sentence. Here is a second sentence. This paragraph doesn't have a paragraph heading.
History and infrastructure.
Here is a second paragraph. It had a paragraph heading and can be tagged as such. Sentences from other articles could likely also have the same paragraph heading. It is useful to capture this information.
Here is a third paragraph. It is also attributed to the same paragraph heading as the paragraph before.
Here is another paragraph without paragraph heading.
Flight.
Another paragraph, this time to do with flights. Just a stupid example

dnk8n avatar Aug 04 '21 08:08 dnk8n