node-unfluff icon indicating copy to clipboard operation
node-unfluff copied to clipboard

implement domainExtractor for image and title, with a single implementation wikipedia

Open danielgranat opened this issue 10 years ago • 1 comments

First thanks for sharing this code. It's exactly what i needed, and i didn't find anything i liked in NodeJS.

This is more a request to get feedback then actual pull request. The problem i encountered is that wikipedia does not work with the image and title extraction that's implemented now. The Image is not in the header, but is the first image in the '.infobox'. Title- When splitting the Title, the longest part is usually not the important part, like in 'Thomas Edison - Wikipedia, the free encyclopedia' Trying to tackle this problem i saw 2 options:

  1. Re-factor the current extraction implementation to support wikipedia structure. I don't think it's a good option. First it will cause the code to be less readable. Second, what will happen when i need more customization?!
  2. Second option is to have something like domain specific plugins.

Obviously i decided to use the second option.

There is still work to be done and issues to address, but I would like to get your input on the proposed solution.

Thank you for your time!

danielgranat avatar Feb 24 '15 23:02 danielgranat

Hey, thanks for the PR. I do agree that domain-specific plugins are a better path here than hacks on top of hacks in the main code.

I'll take a look at the PR in detail when I have a little free time and let you know what I think.

Thanks!

ageitgey avatar Feb 24 '15 23:02 ageitgey