linkedin-public-dir-companies
linkedin-public-dir-companies copied to clipboard
Crawler and scraper of the public directory of companies on LinkedIn.
[Crawler + Scraper] LinkedIn Public Directory Companies
Prerequisites
- Python 3.7
sudo apt-get install python3.7
- Pip
sudo apt-get install python3-pip
- VirtualEnv
sudo pip3 install virtualenv
- MongoDB with collections
linkedin_companies
,linkedin_crawlers
andlinkedin_scrapers
- Writing permission in the app directory to save cookies
Considerations
To run the crawler and scraper scalably, you will need to use a residential proxies server.
Installation
Clone the project:
git clone [email protected]:robertoarruda/linkedin-public-dir-companies.git
Enter the project directory:
cd ./linkedin-public-dir-companies
Create the Environment:
Within the project root, run the command below:
virtualenv venv --python=python3.7
Activate the environment:
Run the command below to enable:
source venv/bin/activate
Install dependencies:
Run the command below to install the project dependencies:
pip install -r requirements.txt
Configure MongoDB
Enter the connection settings with the database in the client_db.py file.
class ClientDB():
__MONGO = 'mongodb://root:[email protected]:80'
[Opcional step] Setting residential proxy
Enter the host of your residential proxies server in the main.py file.
class Main():
__PROXIES = {
'http': 'http://127.0.0.1:80'
}
Execute the crawler:
Execute the command below to run the crawler:
python main.py crawler
The crawler data is saved in the linkedin_crawlers
collection. The crawled companies are saved in the linkedin_companies
collection.
Execute the scraper:
Execute the command below to run the scraper:
python main.py scraper
The scraper data is saved in the linkedin_scrapers
collection. The scraped companies are updated in the collection linkedin_companies
.
Turn off the environment:
Execute the command below to deactivate:
deactivate