linkedin_scraper
linkedin_scraper copied to clipboard
Profile scraping error: res = description.find_element(By.TAG_NAME,"a").find_elements(By.XPATH,"*")
When scraping the person who worked multiple times at organization this error occurred.
I checked the page structure and it should work fine but for some reason it fails.
This part of code causes the problem:
if position_summary_text and len(position_summary_text.find_element(By.CLASS_NAME,"pvs-list").find_elements(By.XPATH,"li")) > 1: #.find_element(By.CLASS_NAME,"pvs-list") descriptions = position_summary_text.find_element(By.CLASS_NAME,"pvs-list").find_element(By.CLASS_NAME,"pvs-list").find_elements(By.XPATH,"li") for description in descriptions: res = description.find_element(By.TAG_NAME,"a").find_elements(By.XPATH,"*") position_title_elem = res[0] if len(res) > 0 else None work_times_elem = res[1] if len(res) > 1 else None location_elem = res[2] if len(res) > 2 else None
it cant find res by tag name a.
As far as I understood it tries to find the top part of the job description(title, duration at position, location) and all this is located under a tag on the web page. @joeyism do you have any insights on that? Am I referring correctly to the part of the page that this code is trying to analyse?
The whole error message:
`Traceback (most recent call last):
File "C:\Users\User\PycharmProjects\pythonProject\pythonProject\lists_check.py", line 23, in
Process finished with exit code 1 `
Can you provide the code that you've used please?
Sorry, I forgot to include the failing account at the first place. This bug occurred at this profile: https://www.linkedin.com/in/sheanahamill/. Error occurs at any basic person scraping. This is a code I used to discover this bug
from selenium.common.exceptions import WebDriverException
from selenium import webdriver
from linkedin_scraper import Person, actions, Company
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
import time, pickle
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("user-data-dir=C:\\Users\\User\\AppData\\Local\\Google\\Chrome\\User Data\\Profile 3")
driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)
person=Person("https://www.linkedin.com/in/sheanahamill", driver=driver, scrape=False)
time.sleep(3)
person.scrape(close_on_complete=False)
name=person.name
title=person.job_title
now_company=person.company
print(name, title, now_company)
experience=person.experiences
print(experience)
current_company=experience[0]
print(current_company)
link_to_company=current_company.linkedin_url
print(link_to_company)
location=current_company.location
print(location)
company=Company(link_to_company, driver=driver, get_employees=False, close_on_complete=False)
company_name=company.name
company_size=company.company_size
company_website=company.website
about=company.about_us
print(company_name, company_size, company_website, about)
this code works fine with other account(other that log1 problem from #173)
Hey - I only updated two functions as I needed: get_experiences() and get_name_and_location(). In addition to UI updates I also fixed the scraper issue where it gets confused when a person has multiple positions at the same company over time.
You can selectively scrape by doing this: person=Person("https://www.linkedin.com/in/sheanahamill", driver=driver, scrape=False) person.get_experiences() print(person.experiences)
def get_name_and_location(self):
main = self.wait_for_element_to_load(by=By.TAG_NAME, name="main")
top_panels = main.find_elements(By.CLASS_NAME,"pv-text-details__left-panel")
self.name = top_panels[0].find_elements(By.XPATH,"*")[0].text
self.location = top_panels[1].find_element(By.TAG_NAME,"span").text
def get_experiences(self): # modified
url = os.path.join(self.linkedin_url, "details/experience")
self.driver.get(url)
self.focus()
main = self.wait_for_element_to_load(by=By.TAG_NAME, name="main")
self.scroll_to_half()
self.scroll_to_bottom()
main_list = self.wait_for_element_to_load(name="pvs-list", base=main)
for position in main_list.find_elements(By.XPATH,"li"):
position = position.find_element(By.CLASS_NAME,"pvs-entity")
company_logo_elem, position_details = position.find_elements(By.XPATH,"*")
# company elem
company_linkedin_url = company_logo_elem.find_element(By.XPATH,"*").get_attribute("href")
# position details
position_details_list = position_details.find_elements(By.XPATH,"*")
position_summary_details = position_details_list[0] if len(position_details_list) > 0 else None
position_summary_text = position_details_list[1] if len(position_details_list) > 1 else None # skills OR list of positions
outer_positions = position_summary_details.find_element(By.XPATH,"*").find_elements(By.XPATH,"*")
if len(outer_positions) == 4:
position_title = outer_positions[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].text
company = outer_positions[1].find_element(By.TAG_NAME,"span").text
work_times = outer_positions[2].find_element(By.TAG_NAME,"span").text
location = outer_positions[3].find_element(By.TAG_NAME,"span").text
elif len(outer_positions) == 3:
if "·" in outer_positions[2].text:
position_title = outer_positions[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].text
company = outer_positions[1].find_element(By.TAG_NAME,"span").text
work_times = outer_positions[2].find_element(By.TAG_NAME,"span").text
location = ""
else:
position_title = ""
company = outer_positions[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].text
work_times = outer_positions[1].find_element(By.TAG_NAME,"span").text
location = outer_positions[2].find_element(By.TAG_NAME,"span").text
elif len(outer_positions) == 2: # this is for when person has multiple pos over time at one company
company_div, work_times_div = outer_positions
company = company_div.find_element(By.TAG_NAME,"span").text
company_linkedin_url = ""
print(colored(company, 'yellow'))
positions_list = position_summary_text.find_element(By.CLASS_NAME, "pvs-list").find_element(By.CLASS_NAME, "pvs-list")
for position in positions_list.find_elements(By.XPATH,"*"):
print(colored('count position', "yellow"))
position = position.find_element(By.CLASS_NAME,"pvs-entity")
position_details_list = position.find_elements(By.XPATH,"*")[1].find_elements(By.XPATH,"*")
position_summary_details = position_details_list[0] if len(position_details_list) > 0 else None
position_summary_text = position_details_list[1] if len(position_details_list) > 1 else None # skills OR list of positions
outer_positions = position_summary_details.find_element(By.XPATH,"*").find_elements(By.XPATH,"*")
if len(outer_positions) == 3:
position_title = outer_positions[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].find_elements(By.XPATH,"*")[0].text
print(colored(position_title, 'yellow'))
work_times = outer_positions[1].find_element(By.TAG_NAME,"span").text
location = outer_positions[2].find_element(By.TAG_NAME,"span").text
else:
print('need fix.')
if 'work_times' not in locals() and 'work_times' not in globals():
work_times = None # modified
times = work_times.split("·")[0].strip() if work_times else ""
duration = work_times.split("·")[1].strip() if times != "" and len(work_times.split("·")) > 1 else None # modified
from_date = " ".join(times.split(" ")[:2]) if times else ""
to_date = " ".join(times.split(" ")[3:]) if times else ""
if position_summary_text and len(position_summary_text.find_element(By.CLASS_NAME,"pvs-list").find_element(By.CLASS_NAME,"pvs-list").find_elements(By.XPATH,"li")) > 1:
descriptions = position_summary_text.find_element(By.CLASS_NAME,"pvs-list").find_element(By.CLASS_NAME,"pvs-list").find_elements(By.XPATH,"li")
for description in descriptions:
res = description.find_element(By.TAG_NAME,"a").find_elements(By.XPATH,"*")
position_title_elem = res[0] if len(res) > 0 else None
work_times_elem = res[1] if len(res) > 1 else None
location_elem = res[2] if len(res) > 2 else None
location = location_elem.find_element(By.XPATH,"*").text if location_elem else None
position_title = position_title_elem.find_element(By.XPATH,"*").find_element(By.TAG_NAME,"*").text if position_title_elem else ""
work_times = work_times_elem.find_element(By.XPATH,"*").text if work_times_elem else ""
times = work_times.split("·")[0].strip() if work_times else ""
duration = work_times.split("·")[1].strip() if len(work_times.split("·")) > 1 else None
from_date = " ".join(times.split(" ")[:2]) if times else ""
to_date = " ".join(times.split(" ")[3:]) if times else ""
experience = Experience(
position_title=position_title,
from_date=from_date,
to_date=to_date,
duration=duration,
location=location,
description=description,
institution_name=company if 'company' in locals() or 'company' in globals() else "Not provided", #modified
linkedin_url=company_linkedin_url
)
self.add_experience(experience)
else:
description = position_summary_text.text if position_summary_text else ""
experience = Experience(
position_title=position_title,
from_date=from_date,
to_date=to_date,
duration=duration,
location=location,
description=description,
institution_name=company,
linkedin_url=company_linkedin_url
)
self.add_experience(experience)
return
if 'work_times' not in locals() and 'work_times' not in globals():
work_times = None
times = work_times.split("·")[0].strip() if work_times else ""
duration = work_times.split("·")[1].strip() if times != "" and len(work_times.split("·")) > 1 else None
from_date = " ".join(times.split(" ")[:2]) if times else ""
to_date = " ".join(times.split(" ")[3:]) if times else ""
if position_summary_text and len(position_summary_text.find_element(By.CLASS_NAME,"pvs-list").find_element(By.CLASS_NAME,"pvs-list").find_elements(By.XPATH,"li")) > 1:
descriptions = position_summary_text.find_element(By.CLASS_NAME,"pvs-list").find_element(By.CLASS_NAME,"pvs-list").find_elements(By.XPATH,"li")
for description in descriptions:
res = description.find_element(By.TAG_NAME,"a").find_elements(By.XPATH,"*")
position_title_elem = res[0] if len(res) > 0 else None
work_times_elem = res[1] if len(res) > 1 else None
location_elem = res[2] if len(res) > 2 else None
location = location_elem.find_element(By.XPATH,"*").text if location_elem else None
position_title = position_title_elem.find_element(By.XPATH,"*").find_element(By.TAG_NAME,"*").text if position_title_elem else ""
work_times = work_times_elem.find_element(By.XPATH,"*").text if work_times_elem else ""
times = work_times.split("·")[0].strip() if work_times else ""
duration = work_times.split("·")[1].strip() if len(work_times.split("·")) > 1 else None
from_date = " ".join(times.split(" ")[:2]) if times else ""
to_date = " ".join(times.split(" ")[3:]) if times else ""
experience = Experience(
position_title=position_title,
from_date=from_date,
to_date=to_date,
duration=duration,
location=location,
description=description,
institution_name=company if 'company' in locals() or 'company' in globals() else "Not provided",
linkedin_url=company_linkedin_url
)
self.add_experience(experience)
else:
description = position_summary_text.text if position_summary_text else ""
experience = Experience(
position_title=position_title,
from_date=from_date,
to_date=to_date,
duration=duration,
location=location,
description=description,
institution_name=company,
linkedin_url=company_linkedin_url
)
self.add_experience(experience)
This is from ~ a week ago, hopefully still working.
Still facing same issue even with this update
I just deployed a fix. Please try with v2.11.2 please
Thanks it works, I tested it on 2 profiles but havent tested at scale yet. I am new to git; I dont know how to submit a pr but the company.py doesnt work either. Updates needed are:
- Change class name to mb6 in line 210: grid = driver.find_element(By.CLASS_NAME, "mb6") # used to be artdeco-card.p5.mb4
- Change class name to mb1 in line 241: grid = driver.find_element(By.CLASS_NAME, "mb1") # used to be mt1
And now it works for me
Thanks it works, I tested it on 2 profiles but havent tested at scale yet. I am new to git; I dont know how to submit a pr but the company.py doesnt work either. Updates needed are:
- Change class name to mb6 in line 210: grid = driver.find_element(By.CLASS_NAME, "mb6") # used to be artdeco-card.p5.mb4
- Change class name to mb1 in line 241: grid = driver.find_element(By.CLASS_NAME, "mb1") # used to be mt1
And now it works for me
hi it is not working i have changed as it shown properties "https://www.linkedin.com/company/google" i was checking
The way I troubleshooted it is:
- Try to identify the part of the scraping that is failing
- See the error
- Try to understand what it is doing and what its function
- Find the block that this part is trying to find(might be challenging as sometimes its not clear)
- Find the new element name etc You can send the full error message and I can try helping out