slicer
slicer copied to clipboard
Improve the URL and path extractor
Not sure if its the regex or what but we get loads of binary/unicode kind of characters in the file.
Also, there are loads of URLs pointing to websites like googleapi or w3school. We should keep it clean
I think the way to do this is that we make a blacklist and check if the domain is in that list before writing it to the file here
Something like:
var BadURLs = map[string]bool{
"URL HERE": true,
}
for _, d := range data {
if BadURLs[d] {
continue
} else {
_, _ = datawriter.WriteString(d + "\n")
}
}
But the thing is we don't want to compare the exact urls but just the root domains
. Might have to use a regex for each value or have to parse the URL and takeout the root domains.
I spent quite a lot of time trying to figure out the way but couldn't. Actually, there is no way, I mean changing the regex would sometimes miss a lot of stuff(could be important).
What should we do then?
Once the URL.txt
and path.txt
is generated then run strings
on the file. Simple.