eatiht icon indicating copy to clipboard operation
eatiht copied to clipboard

Extractor misses on certain congressional sites

Open dwillis opened this issue 10 years ago • 6 comments

First, thank you for this library - it's really useful and an impressive achievement. I'll try digging into the code to see if I can't pinpoint where this happens, but wanted to bring it to your attention. For some congressional sites (example), eatiht extracts Non-breaking space within span tags -   - is required for WYSIWYG. as the text from the page.

dwillis avatar Jan 12 '15 03:01 dwillis

Hi, thanks for the feedback! I'll try giving this a proper inspection as soon as I can. From what I can tell after a brief look at the html of the example you gave, there's span tags like crazy: in between each p tag there are double span tags (for styling purposes) wrapping around the actual text. This no doubt is breaking the algorithm in some way.

Anyways, thanks for bringing this up and I'll try to come up with a solution.

rodricios avatar Jan 12 '15 03:01 rodricios

Thank you - congressional sites are pretty much a worst-case scenario for clean HTML.

dwillis avatar Jan 12 '15 03:01 dwillis

Hi @dwillis, I'm super sorry for not having addressed this issue yet! I've been busy with another project and it's been taking much longer than I anticipated. The good news is that I'm almost done :)

rodricios avatar Jan 18 '15 21:01 rodricios

Hi,

I was trying to use your tool for my undergraduate thesis but as pointed out above it is missing a lot of websites like -

http://www.bollywoodhungama.com/moviemicro/criticreview/id/570145 http://www.who.int/malaria/areas/diagnosis/en/

Hope you fix them and make your tool a success

Thanks, Faisal

IndianShifu avatar Feb 19 '15 18:02 IndianShifu

Hi @FaisalCoder. First, I'm flattered that you've chosen to use eatiht in your undergrad thesis (I'm definitely interested in reading/hearing about your thesis's subject)!

I'm going to try to describe the algorithm's steps before I respond to your and dwillis's issue.

Currently, eatiht builds a histogram (frequency distribution) measuring the occurrence of text-heavy nodes (text nodes above some string-length threshold); the keys (AKA bins, buckets, etc.) are xpaths leading to text-heavy nodes. What I'm eventually going to count up is the frequency of the parents (after filtering out nodes that not "text-heavy").

Here's what I mean:

I first use an xpath query to get a list of text nodes kinda like so:


[ 
'//body/div/p/text()',
'//body/div/div/div/p[1]/text()',
'//body/div/div/div/p[2]/text()',
...
'//body/article/p/text()',
'//body/div/footer/p/text()'
]

# I strip out the text node ( "/text()" ) in the above list of xpaths with query used in code.

Now the next step is to create a list of tuples where the first element is the xpath, and as i'm setting the first element of said tuple to the xpath, I'm also taking a note of the text's string length (among other statistics). So in a simplified way, we end up with the following:

[ 
( '//body/div/p/text()', 25 ),
( '//body/div/div/div/p[1]/text()', 30 ),
( '//body/div/div/div/p[2]/text()', 40 ),
...
( '//body/article/p/text()', 60),
( '//body/div/footer/p/text()', 10 )
]

The "other statistics" I also calculate are the average string-length and standard deviation. This is so that I can then apply what I liken to low- and high-pass filters. Eventually I end up filtering obvious non-content elements.

Finally, and sorry but I have to simplify the process to keep this short, I make another pass through the list, rstrip'ing the xpath's so that they now describe the path to the parent of the text nodes. After this happens, I count up the occurrences of the parents

[ 
( '//body/div/text()', 1),
( '//body/div/div/div/text()', 2),
...
( '//body/article/p/text()', 1),
( '//body/div/footer/text()', 1)
]

Now if you can imagine the above structure as a frequency distribution or a histogram, operations like "ARGMAX" and "MAX" are useful because you're doing is essentially calculating the most likely location of the content based off the number of times the parent node of "text-heavy" text nodes appear.

Now, in reality, eatiht goes thru some more passes through the list, all while calculating some statistics (like the average child node's string-length and the standard deviation away from the mean), and this helps some in bringing down the number of candidate nodes, but look again at the above example and take note of the tuple containing the largest string-length.

In the above example, imagine that our desired extracted node was actually the "//body/article" element. But the current version of eatiht will predict "//body/div/div/div" (70 > 60 after all).

Now there are hacks that I can apply to eatiht, in a short time frame, but I'm afraid they wont provide better results in the long run. A proper fix would require a decomposition of my algorithm, a better understanding of tree theory on my part, among other things.

I would love to be able to fix this and have started, but I'm afraid that I don't have the time nor resources to do so, at least not in any reasonably short amount of time :|

Edit*: I cleaned up what I wrote and made the examples I brought up a little closer to the actual implementation.

rodricios avatar Feb 20 '15 01:02 rodricios

thanks for replying,

Will surely send you the link to the online implementation of my thesis as soon as I complete it. (near around 10 April 2015)

IndianShifu avatar Feb 20 '15 09:02 IndianShifu