DarkSpider
DarkSpider copied to clipboard
Extractor should use proper mechanism to extract and store URLs
Is your feature request related to a problem? Please describe.
Extractor takes maximum file name length under consideration and creates sub-directories based on the url.
http://a.com/b.ext?x=&y=$%z2 -> a.com/b.extxyz2_.html (a.com folder with b.extxyz2_.html file in it)
This is good for storage purpose but does not act like a database.
Issues:
- File retrieval and merging of data for URL classification is complex.
- An URL can be very big but file names have length constraints.
Describe the solution you'd like A linear architecture where a folder consists of files with file names as SHA1 hash of the respective URLs.
$ cat output/github.com/extracted/
00d1fbae77557ec45b3bfb3bdebfee49fd155cf9
b615c769e688dd83b2845ea0f32e2ee0c125c366
9b76fbceb3abd3423318ee37fd9ec1073961c14d
The links.txt file is renamed to links.json with the following content:
{
"00d1fbae77557ec45b3bfb3bdebfee49fd155cf9": "http://github.com",
"b615c769e688dd83b2845ea0f32e2ee0c125c366": "http://github.com/about/careers",
"9b76fbceb3abd3423318ee37fd9ec1073961c14d": "http://github.com/sponsors"
}
Describe alternatives you've considered
Storing URLs in a big flat directories is a performance overhead as well (O(N) lookups).
Possible options:
- SQL DB
- Neo4j