python-goose icon indicating copy to clipboard operation
python-goose copied to clipboard

Do we need to download images to file?

Open atlithorn opened this issue 11 years ago • 24 comments

Aren't we only looking for width,height and number of bytes?

A head request gives you the content-length and something like https://gist.github.com/atlithorn/6155288 for width and height could remove the need for /tmp access.

If you are interested I'll create a pull request.

Atli.

atlithorn avatar Aug 05 '13 11:08 atlithorn

Imho, it should be an option to not download to a file

dzen avatar Aug 05 '13 12:08 dzen

Still new to this project, but I think there is the configuration: enable_image_fetching

Which sounds like it can be set to False for outright disabling of image downloading. Haven't actually tried this yet though.

kcolton avatar Aug 05 '13 13:08 kcolton

Not looking to disable image fetching - I need image fetching - just don't think we need to download to disk if we only use bytes,width and height.

atlithorn avatar Aug 05 '13 13:08 atlithorn

Actually you can disable image fetching if not needed. I do agree that I would be better not to download the whole image file if it's not needed. However, image extractor is not tested at all. I wish to have this part of goose tested before changing how it work to be sure the implemented solution works as well.

grangier avatar Aug 05 '13 14:08 grangier

@atlithorn could you source you piece of code ... it seems to be a Zope code right ?

grangier avatar Aug 05 '13 14:08 grangier

You're right, looks like Zope. I have been using this snippet for a while with no idea where it came from :)

atlithorn avatar Aug 05 '13 14:08 atlithorn

Hi all... This is also a Pain Point on my side, although rather from another perspective: right now there are no unit tests for the image stuff. I'd like to do Dependency Injections for http_client.

In fact the variable http_client exists but isn't in use. So I think to do this:

  • turn network.HtmlFetcher into network.HttpClient
  • move the method fetch(...) from images.image.ImageUtils into network.HttpClient
  • actually use the http_client variable. Also change the constructor of Goose to init(self, config, http_client)

In a later step, one could make the function fetch(...) return an image.Image object. So either changing the existing algorithm to HEAD or just implementing a custom http_client would become a piece of cake...

Let me know what you think, I'm keen on creating a patch for this ;)

Cheers, Philip

psilva261 avatar Aug 08 '13 15:08 psilva261

Couldn't wait, I've created a pull request for the DI stuff ;)

psilva261 avatar Aug 08 '13 16:08 psilva261

Hello,

I don't see why we need to change the Goose.init constructor, nor why we need to pass the http client around if the variable is not used.

xav

grangier avatar Aug 08 '13 16:08 grangier

Same here, as in my comment is the PR

Le jeudi 8 août 2013, Xavier Grangier a écrit :

Hello,

I don't see why we need to change the Goose.init constructor, nor why we need to pass the http client around if the variable is not used.

xav

— Reply to this email directly or view it on GitHubhttps://github.com/grangier/python-goose/issues/31#issuecomment-22336779 .

Benoit Calvez.

dzen avatar Aug 08 '13 17:08 dzen

+1 for the mocking/unit testing of the image stuff

psilva261 avatar Aug 18 '13 16:08 psilva261

@psilva261 the mocking testing stuff is now implemented. Did you check it out ?

grangier avatar Aug 18 '13 18:08 grangier

@grangier yeah, looks great, I couldn't have imagined an easier way to write unit tests for that.. ;) I've also seen you already solved the problem with the mentioned page on timeforkids.com. so right now I don't have other issues for the top image extractor on the radar, but this could change any day...

psilva261 avatar Aug 18 '13 18:08 psilva261

What was the timeforkids.com issue ? I don't remember.

grangier avatar Aug 18 '13 18:08 grangier

In this article: http://www.timeforkids.com/news/gaga-goo-goo-clusters/54591 The author's image (bottom right) was extracted instead of the large image above the article... I was writing a unit test but then I realized the correct image is already extracted

psilva261 avatar Aug 18 '13 18:08 psilva261

@psilva261 Hum weird. In my case the correct image is not extracted. It is still the author's image.

>>> import goose
>>> url = 'http://www.timeforkids.com/news/gaga-goo-goo-clusters/54591'
>>> g = goose.Goose()
>>> a = g.extract(url=url)
>>> a.top_image.src
'http://www.timeforkids.com/files/Web_Faye.jpg'

Don't you have the same result ?

grangier avatar Aug 18 '13 19:08 grangier

Problem is with this article the main image is not in the article top node. The only image in the article top_node is the author's image

grangier avatar Aug 18 '13 19:08 grangier

Odd... In the project I work on, on my laptop the correct image is extracted. But on our server it seems still the author's image is extracted.

psilva261 avatar Aug 18 '13 19:08 psilva261

@psilva261 i guess you don't have the same html on both side.

grangier avatar Aug 18 '13 20:08 grangier

To solve this issue, the only way it to grab all the images of the document and not only the ones on the detected top_node. Side effects would be that image extraction would take more time and results may be worst than expected. For instance we could have a large image that is used as a sprit for icons and goose will detect it as main image.

grangier avatar Aug 18 '13 20:08 grangier

I see this a lot too and the only solution I came up with was to add to KNOWN_IMG_DOM_NAMES.

Would sprite files be dangerous, can you do huge sprite images with just the img tag? That's usually done via css which goose would not catch anyway, right?

But I know what you're saying, the danger is images that are not connected to the article.

Maybe some kind of scoring mechanism based on how close to the top node the image is? A big image just outside the top node would score higher than a small image inside it?

Atli.

On Sun, Aug 18, 2013 at 8:42 PM, Xavier Grangier [email protected]:

To solve this issue, the only way it to grab all the images of the document and not only the ones on the detected top_node. Side effects would be that image extraction would take more time and results may be worst than expected. For instance we could have a large image that is used as a sprit for icons and goose will detect it as main image.

— Reply to this email directly or view it on GitHubhttps://github.com/grangier/python-goose/issues/31#issuecomment-22838975 .

atlithorn avatar Aug 18 '13 21:08 atlithorn

@atlithorn you are right sprites were a bad exemple as they usually loaded by the CSS file. I like the idea of scoring for images outside the top node .. But this means fetching all the document images. We could also use a configuration settings for a maximum parent node distance.

grangier avatar Aug 18 '13 22:08 grangier

I haded some more test file regarding KNOWN_IMG_DOM_NAMES and known-image-css.txt e008ac18287974e7fd979675ca15d0534b65353c

to solve @psilva261 issue you could add timeforkids.com^content to known-image-css.txt file

This part of the code need to be refactor because knwon class are only taking in account parent node and not the image class it self.

grangier avatar Aug 18 '13 23:08 grangier

@grangier Alright, I'll send you a merge request soon. Ok ok

psilva261 avatar Aug 19 '13 15:08 psilva261