slicer Improve the URL and path extractor

Improve the URL and path extractor

Open mzfr opened this issue 3 years ago • 2 comments

Not sure if its the regex or what but we get loads of binary/unicode kind of characters in the file.

Also, there are loads of URLs pointing to websites like googleapi or w3school. We should keep it clean

Oct 08 '20 20:10 mzfr

I think the way to do this is that we make a blacklist and check if the domain is in that list before writing it to the file here

Something like:

var BadURLs = map[string]bool{
	"URL HERE": true,
}

for _, d := range data {
if BadURLs[d] {
		continue
	} else {
		_, _ = datawriter.WriteString(d + "\n")
	}
}

But the thing is we don't want to compare the exact urls but just the root domains. Might have to use a regex for each value or have to parse the URL and takeout the root domains.

Oct 20 '20 08:10 mzfr

I spent quite a lot of time trying to figure out the way but couldn't. Actually, there is no way, I mean changing the regex would sometimes miss a lot of stuff(could be important).

What should we do then?

Once the URL.txt and path.txt is generated then run strings on the file. Simple.

May 23 '21 18:05 mzfr

slicer slicer copied to clipboard

Improve the URL and path extractor

slicer
slicer copied to clipboard