slicer icon indicating copy to clipboard operation
slicer copied to clipboard

Improve the URL and path extractor

Open mzfr opened this issue 3 years ago • 2 comments

Not sure if its the regex or what but we get loads of binary/unicode kind of characters in the file.

Also, there are loads of URLs pointing to websites like googleapi or w3school. We should keep it clean

mzfr avatar Oct 08 '20 20:10 mzfr

I think the way to do this is that we make a blacklist and check if the domain is in that list before writing it to the file here

Something like:

var BadURLs = map[string]bool{
	"URL HERE": true,
}

for _, d := range data {
if BadURLs[d] {
		continue
	} else {
		_, _ = datawriter.WriteString(d + "\n")
	}
}

But the thing is we don't want to compare the exact urls but just the root domains. Might have to use a regex for each value or have to parse the URL and takeout the root domains.

mzfr avatar Oct 20 '20 08:10 mzfr

I spent quite a lot of time trying to figure out the way but couldn't. Actually, there is no way, I mean changing the regex would sometimes miss a lot of stuff(could be important).

What should we do then?

Once the URL.txt and path.txt is generated then run strings on the file. Simple.

mzfr avatar May 23 '21 18:05 mzfr