Extractor should use proper mechanism to extract and store URLs

Open PROxZIMA opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe.

Extractor takes maximum file name length under consideration and creates sub-directories based on the url.

http://a.com/b.ext?x=&y=$%z2 -> a.com/b.extxyz2_.html (a.com folder with b.extxyz2_.html file in it)

This is good for storage purpose but does not act like a database.

Issues:

File retrieval and merging of data for URL classification is complex.
An URL can be very big but file names have length constraints.

Describe the solution you'd like A linear architecture where a folder consists of files with file names as SHA1 hash of the respective URLs.

$ cat output/github.com/extracted/

00d1fbae77557ec45b3bfb3bdebfee49fd155cf9
b615c769e688dd83b2845ea0f32e2ee0c125c366
9b76fbceb3abd3423318ee37fd9ec1073961c14d

The links.txt file is renamed to links.json with the following content:

{
    "00d1fbae77557ec45b3bfb3bdebfee49fd155cf9": "http://github.com",
    "b615c769e688dd83b2845ea0f32e2ee0c125c366": "http://github.com/about/careers",
    "9b76fbceb3abd3423318ee37fd9ec1073961c14d": "http://github.com/sponsors"
}

Describe alternatives you've considered

Storing URLs in a big flat directories is a performance overhead as well (O(N) lookups).

Possible options:

SQL DB
Neo4j

Feb 19 '23 13:02 PROxZIMA