Web-Crawler
Web-Crawler copied to clipboard
A multithreaded web crawler using two mechanism - single lock and thread safe data structures
Multi Threaded Web-Crawler
Description
The goal of this project is to create a multi-threaded web crawler. A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. Any search engine uses these indexes, web graphs, and an appropriate algorithm ( such as PageRank ) to rank the pages. The main focus of the project would be to implement a multi-threaded downloader that can download multi websites at the same time. The plan is to implement this in C or C++.
Features/Deliverables
- Part 1: Multi-threaded web-crawler
- Part 2: (Extended scope) Web Ranking
Simple Crawler Flowchart
Table of contents
-
Single Threaded Web Crawler
- Single Threaded Components
- Single Threaded Psuedocode
- How to run single threaded web crawler
-
MULTITHREADED Web Crawler
- Multithreaded Components
- Multithreaded Psuedocode
-
Different locking techniques
- Using SINGLE LOCK technique
- Using THREAD SAFE DATA STRUCTURE technique
- How to run multi threaded web crawler
-
Website domain ranking algorithms
- Simple counter based
- Sampling based PageRank algorithm
- Iterative based PageRank algorithm
- Graphs
- Libraries used
- Future extension of project
- Repo structure
- How to reuse this repo
- Credits
- Contributors
Single Threaded Web Crawler
Single Threaded Components
- HTTP website downloader
- using socket library
- HTTPs websites downloader
- using openssl library
- HTML file parser
- using regex
- Domain extractor
- using regex
- Crawler loop
- Website ranker
- using a simple counter
Single Threaded Psuedocode
...
while(!mainQueue.empty() && totalVisitedPages
How to run single threaded web crawler
- use
make
to compile the program -
maxlinks
,pagelimit
can be given as argument in withmake
command.- For e.g.
make maxlinks=100 pagelimit=100
- Here the arguments are:
-
maxlinks
: Maximum number of links to be extracted from a website -
pagelimit
: Maximum number of websites to be downloaded while crawling
-
- For e.g.
Back to Table of Contents
MULTITHREADED Web Crawler
Multithreaded Components
- Crawler as a thread controller
-
Child thread
- HTML downloader
- Link extractor
- Domain extractor
- Ranking using different algorithms
Multithreaded Psuedocode
Crawler loop code
...
while(1){
if(pagesLimitReached || visitedpages>=pagesLimit){
pagesLimitReached = true;
if(w_threads){
gotosleep();
}
else {
break;
}
}
else{
if (w_threads 0){
createThread();
}
else if(w_threads == 0){
break;
}
else{
gotosleep();
}
}
}
...
Child Thread code
...
download(url);
parse(url);
update(queue, visitedLinks, ranking);
if(pagesLimitReached){
if(workingThreads == 0){
wake_parent();
}
}
else{
wake_parent();
}
...
Different locking techniques
-
Using SINGLE LOCK technique
- Having one lock for all of our shared variables
- Pros Easy to implement
- Cons Need to keep the lock acquired for more amount of time which may result in serial processing only
-
Using THREAD SAFE DATA STRUCTURE technique
- For each of the data structure, having a single lock or RW lock if required
- Waiting time distributed over different locks
- Different thread safe data structures:
- Thread safe integer
- Thread safe queue
- Thread safe map
- Pros No need to keep the lock acquired for more amount of time. Hence concurrency can be efficiently achieved in multi processor CPUs
- Cons Overhead due to multiple locks
How to run multi threaded web crawler
- use
make
to compile the program -
maxlinks
,pagelimit
,threads
can be given as argument in withmake
command.- For e.g.
make maxlinks=100 pagelimit=100 threads=20
- Here the arguments are:
-
maxlinks
: Maximum number of links to be extracted from a website -
pagelimit
: Maximum number of websites to be downloaded while crawling -
threads
: Maximum number of threads to be created -
rankerFlag
: Flag to choose which ranking algorithm to be executed (n
= simple counter based web ranker,sp
= pagerank with sampling,ip
= pagerank with iteration)
-
- For e.g.
Back to Table of Contents
Website domain ranking algorithms
-
Simple counter based The intuition behind this approach of ranking is to increase the rank of a domain name whenever we visit it.
-
PageRank algorithm The intuition behind our pagerank algorithm is as follows. Suppose there is a random surfer which randomly chooses a starting website. After that it chooses next website from all the linked website with current chosen website with probability of 0.85 or it chooses next website from all available websites with a probability of 0.15.
In this way, the importance of a website is measured by how repetitively a random surfer visits a website. Hence, a website is important if it is linked to more number of important websites.
There are two ways to implement this algorithm
- Iterative based PageRank algorithm
- Sampling based PageRank algorithm
Simple counter based
...
corpus = read(csv_file)
for website in corpus.keys():
for x in corpus[website]:
rank[website]+=1
...
Demo run
------------------------------------------------
Domain Name rankings using counter
------------------------------------------------
................................................
Domain Name Rank
................................................
1 . mygov.in 43
2 . main.ayush.gov.in 36
3 . tourism.gov.in 24
4 . digitalindia.gov.in 19
5 . asi.nic.in 16
------------------------------------------------------------
Back to Table of Contents
Sampling based PageRank algorithm
In this approach, we have asked random surfer for certain number of sample times to choose a website from all websites which are weighted as per the pagerank algorithm intuition.
...
DAMPING = 0.85
SAMPLES = 10000
corpus = read(csv_file)
for i in range(1,SAMPLES):
for x in corpus.keys():
if (x in corpus[page]):
model[x] = ((DAMPING * (1/number_of_linked_websites)) + ((1-DAMPING)* (1/total_websites)))
else:
model[x] = ((1-DAMPING)* (1/total_websites))
x = random.choices(websites, weights=model.values())[0]
pagerrank[x] += (1/n)
return pagerrank
...
Demo run
-------------------------------------------------------------
Domain Name ranking using PageRank: Sampling (n = 10000)
-------------------------------------------------------------
................................................
Domain Name Rank
................................................
1 . haryana.mygov.in 0.1290000000000021
2 . chhattisgarh.mygov.in 0.11000000000000212
3 . blog.mygov.in 0.07730000000000119
4 . mygov.in 0.07260000000000105
5 . aatmanirbharbharat.mygov.in 0.04840000000000036
------------------------------------------------------------
Back to Table of Contents
Iterative based PageRank algorithm
The intuition behind ranking using iterative pagerank algorithm is as follows. We will intialize the probability of surfer visiting a given website to 1/total_websites. We will update the probability of every website according to below formula.

We will stop iterating when the difference between old and updated probabilities is less than certain threashold.
...
DAMPING = 0.85
THRESHOLD = 0.001
corpus = read(csv_file)
while(1):
before = [pagerrank[v] for v in pagerrank.keys()]
for x in corpus.keys():
for v in linkedWebsites:
pagerrank[x] += (DAMPING*(pagerank[v]/total_linkedWebsites))
pagerank[x] += ((1-DAMPING)/number_of_websites)
if (change(before, [pagerrank[v] for v in pagerrank.keys()])
Demo run
----------------------------------------------
Domain Name ranking using PageRank: Iteration
----------------------------------------------
................................................
Domain Name Rank
................................................
1 . india.gov.in 0.01762736346840192
2 . digitalindia-gov.zoom.us 0.017587054793058533
3 . tourism.gov.in 0.016866581191734113
4 . digitalindiaawards.gov.in 0.014974653859083446
5 . mygov.in 0.0122561916194452
------------------------------------------------------------
Back to Table of Contents
Graphs
- We have made a graph between number of threads vs time. According to this graph, we can infer as follows:
- When number of threads are very low, time required for crawling is large
- Time increases when number of threads becomes huge becuase of locking overhead
- When we use optimal number of threads, concurrent crawling is useful.

** Above graph varies whenever your run the code. Above all inferences are made according to general trend in graph which is seen for most of times
- We have made a graph by varying one of our parameters pagelimit vs time. According to this graph, we can infer as follows:
- As pagelimit increases, crawler time increases
- Multithreaded with single locking works better than single threaded becuase of concurrency
- Multithreaded with thread safe data structures works worst than other two appraoches because of locking overheads. Because in this approach, we have used individual locks for each data structures. And while crawling we needed to acquire and release locks back to back to update each data structure. This increases lot of overhead. As a result, time increases singnificantly in this approach

Back to Table of Contents
Libraries used
- Sockets
- OpenSSL
- Pthread library
- For concurrency and synchronization techniques
- Locks
- Single locks
- Reader Writer locks
- Condition Variables
- Locks
- For concurrency and synchronization techniques
- Matplotlib
- Plotting the graphs
Future extension of project
- Efficient downloader to download all websites
- Efficient parser to parse large websites
- Creating a search engine by using web indexing and web scraping
Repo structure
- Three main branches:
-
main
: contains the multithreaded using threadsafe data structure web crawler -
MT_singlelock
: contains the multithreaded using single lock web crawler -
single_threaded
: contains the single threaded web crawler
-
Back to Table of Contents
How to reuse this repo
-
fork
the repo from top right corner of this page - Run following in your terminal
$ git clone https://github.com/[USERNAME]/Web-Crawler
$ cd Web-Crawler
$ make
-
Pull requests are highly appreciated
-
All the codes are well commented, in case of query feel free to contact
Presentation link
Credits
- How to write a multi-threaded webcrawler
- DOWNLOADING A WEB PAGE IN C USING A SOCKET
- std::condition_variable
- Scrapy
- PageRank
Contributors
Anupam Kumar |
Preeti Chiluveru |
Shruti Katpara |
Vivek Modi |