hyphe issues

Allow crawler to parse and extract links from PDFs

crawler

feature

[lookup] check redirections pointing to same page

core

Handle js redirections

example https://www.google.com/url?rct=j&sa=t&url=http://www.out-law.com/en/articles/2016/june/ico-sees-jump-in-number-of-self-reported-data-breaches/&ct=ga&cd=CAIyHDBlNzRhZjQ0NzBhYjBhZDI6Y29tOmVuOkdCOlI&usg=AFQjCNGr_2Yci_y5pbA222T3bmLPb5dNmg&utm_source=twitterfeed&utm_medium=twitter returns 200 but redirect in js in content to http://www.out-law.com/en/articles/2016/june/ico-sees-jump-in-number-of-self-reported-data-breaches/

boogheta

crawler

bug

Change URL/LRU rule to put t: (port) after h: (host) instead of before

2

boogheta

discussion

fine tuning

core

memory structure

Add backwards fonctions to cancel done crawls ?

Need to mark in memory structure elements coming from a specific crawl

boogheta

discussion

crawler

fine tuning

feature

core

Handle accented URLs in prefixes

ex: - http://www.nosdéputés.fr -> http://www.xn--nosdputss-e4ad.fr/ - http://identità.com -> http://xn--identit-fwa.com/

boogheta

Display in crawl list the number of crawled pages by crawl depth

boogheta

crawler

feature

core

web interface

Investigate why some subdomains are not automatically set as such when using the related WECR

Example with skyblogs on web archives

boogheta

download internal hyperlinks

9

are there any access point or button to download all the internal hyperlinks in gexf file in a same time? for instance, i have 100 urls to crawl, so for...

SeyedAlirezaMalih

hyphe
hyphe copied to clipboard

Metadata

Allow crawler to parse and extract links from PDFs

[lookup] check redirections pointing to same page

Handle js redirections

Change URL/LRU rule to put t: (port) after h: (host) instead of before

Add backwards fonctions to cancel done crawls ?

Handle accented URLs in prefixes

Display in crawl list the number of crawled pages by crawl depth

Investigate why some subdomains are not automatically set as such when using the related WECR

download internal hyperlinks

← Metadata

Owner

Metadata

hyphe hyphe copied to clipboard

Metadata

← Metadata

Owner

Metadata

hyphe
hyphe copied to clipboard