hyphe issues

paginate tree page in Web Entity folder view

When dealing with a high number of pages, the web entity folder view can take a very large amount of time to display a folder view. In my case I...

paulgirard

fine tuning

web interface

[IMPORT] bug in URL detection

1

![image](https://user-images.githubusercontent.com/193478/55246038-ab7f2100-5244-11e9-874f-b67f08a87452.png) the last " is kept from a href="url" parsing

paulgirard

web interface

bug

JS Crawling via chrome headless

TODO: - [x] download chrome headless + driver - [x] remove phantom binary - [x] plug chrome within selenium in scrapyd spider - [x] include install in build docker -...

arnaudmolo

Add SiteMaps as a complementary StartPage mode ?

boogheta

Add a recrawl corpus feature

1

The recrawl process first asks the crawl limits (see #158 ) There is also an option to avoid downloading already downloaded page (set to true by default). This process will...

paulgirard

Content extraction features

4

It would be useful to extract a clean textual content from each web page. We could use Boilerpipe for instance https://github.com/kohlschutter/boilerpipe

jacomyma

discussion

feature

core

[Front] Settings page : add some documentation

such as tooltips, or text for each box and entry

boogheta

fine tuning

web interface

[crawler] Implement form of crawl size limit

We need three settings for web entity crawls: - *depth* : integer or infinity (only if *number of pages* is not infinite) - *number of pages* : integer or infinity...

boogheta

discussion

crawler

Lighten mongodb disk space by merging queue & pages

boogheta

core

[TAGS] Tag selected amount starts at 0 instead of 1

2

Which could be justified (as in "0 elements in the elements selected were tagged with that particular tag before you clicked"), but it gets awkward as you can tag the...

Guillaume-Levrier

web interface

bug

hyphe
hyphe copied to clipboard

Metadata

paginate tree page in Web Entity folder view

[IMPORT] bug in URL detection

JS Crawling via chrome headless

Add SiteMaps as a complementary StartPage mode ?

Add a recrawl corpus feature

Content extraction features

[Front] Settings page : add some documentation

[crawler] Implement form of crawl size limit

Lighten mongodb disk space by merging queue & pages

[TAGS] Tag selected amount starts at 0 instead of 1

← Metadata

Owner

Metadata

hyphe hyphe copied to clipboard

Metadata

← Metadata

Owner

Metadata

hyphe
hyphe copied to clipboard