python-Wappalyzer icon indicating copy to clipboard operation
python-Wappalyzer copied to clipboard

created test for valid selector that does not increase time

Open brandonscholet opened this issue 2 years ago • 4 comments

My room wappybird implement ls your library. I started pulling the updated wappalyzer libraries. They have had issues with valid json, so I started pulling the current release of, but the tally selector is malformed. I talked to the maintainer of soupsieve and they provided a function to tech for valid selectors and skip if not. This replaces the crude try/catch code

I can update your repo to pull the current technologies if you would like. Or feel free to pull from wappybird.

Also, the pip is out of date and incompatible with the updated technologies files

brandonscholet avatar Jan 13 '23 00:01 brandonscholet

Thanks @brandonscholet. Can you provide a test for an invalid selector please ?

tristanlatr avatar Jan 14 '23 18:01 tristanlatr

The current release of npm-Wappalyzer has this broken selector Broken Selector iframe[scr*='//airtable.com/'], a[href*='//airtable.com/][target='_blank']

brandonscholet avatar Jan 17 '23 22:01 brandonscholet

This will pull the latest into the technology file. They have had broken selectors for the past two releases

def update_technologies_from_latest():
	print("updating technologies")
	technologies_file = os.path.expanduser('~/.python-Wappalyzer/technologies.json')
	technologies = {}
	 
	#get release page
	latest_release = requests.get('https://api.github.com/repos/wappalyzer/wappalyzer/releases/latest').json()
	#get zip from url
	zip_url = requests.get(latest_release['zipball_url'])
	myzip = ZipFile(io.BytesIO(zip_url.content)) 

	#parse files
	for listed_file in myzip.namelist():
		#get all technology files
		if "src/technologies" in listed_file and ".json" in listed_file:
			#extract file into json
			tech_json_file=myzip.read(listed_file).decode('UTF-8')
			tech_json = json.loads(tech_json_file)
			#add to full json
			technologies = {**technologies, **tech_json}
		if "src/categories.json" in listed_file:
			#extract categories into json
			categories = json.loads(myzip.read(listed_file).decode('UTF-8'))
		#merge into one object
	combined_object = {'categories': categories, 'technologies': technologies}

	#write to file
	with open(technologies_file, 'w', encoding='utf-8') as tfile:
	    tfile.write(json.dumps(combined_object))
	    tfile.flush()
	print("done!\n")

webpage = WebPage.new_from_url("https://example.com", verify=False, timeout=60)
wappalyzer= Wappalyzer.latest(technologies_file=technologies_file)
techs = wappalyzer.analyze_with_versions_and_categories(webpage)

brandonscholet avatar Jan 17 '23 22:01 brandonscholet

looking back, the print statement should probably be removed.

brandonscholet avatar Jan 17 '23 22:01 brandonscholet