Web-Crawler icon indicating copy to clipboard operation
Web-Crawler copied to clipboard

A multithreaded web crawler using two mechanism - single lock and thread safe data structures

image image image

Multi Threaded Web-Crawler

Description

The goal of this project is to create a multi-threaded web crawler. A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. Any search engine uses these indexes, web graphs, and an appropriate algorithm ( such as PageRank ) to rank the pages. The main focus of the project would be to implement a multi-threaded downloader that can download multi websites at the same time. The plan is to implement this in C or C++.

Features/Deliverables

  • Part 1: Multi-threaded web-crawler
  • Part 2: (Extended scope) Web Ranking

Simple Crawler Flowchart

flowchart

Table of contents

  • Single Threaded Web Crawler
    • Single Threaded Components
    • Single Threaded Psuedocode
    • How to run single threaded web crawler
  • MULTITHREADED Web Crawler
    • Multithreaded Components
    • Multithreaded Psuedocode
    • Different locking techniques
      • Using SINGLE LOCK technique
      • Using THREAD SAFE DATA STRUCTURE technique
    • How to run multi threaded web crawler
  • Website domain ranking algorithms
    • Simple counter based
    • Sampling based PageRank algorithm
    • Iterative based PageRank algorithm
  • Graphs
  • Libraries used
  • Future extension of project
  • Repo structure
  • How to reuse this repo
  • Credits
  • Contributors

Single Threaded Web Crawler

Single Threaded Components

  • HTTP website downloader
    • using socket library
  • HTTPs websites downloader
    • using openssl library
  • HTML file parser
    • using regex
  • Domain extractor
    • using regex
  • Crawler loop
  • Website ranker
    • using a simple counter

Single Threaded Psuedocode

    ...
    while(!mainQueue.empty() && totalVisitedPages 

How to run single threaded web crawler

  • use make to compile the program
  • maxlinks, pagelimit can be given as argument in with make command.
    • For e.g. make maxlinks=100 pagelimit=100
    • Here the arguments are:
      • maxlinks: Maximum number of links to be extracted from a website
      • pagelimit: Maximum number of websites to be downloaded while crawling

Back to Table of Contents

MULTITHREADED Web Crawler

Multithreaded Components

  • Crawler as a thread controller
  • Child thread
    • HTML downloader
    • Link extractor
    • Domain extractor
    • Ranking using different algorithms

Multithreaded Psuedocode

Crawler loop code

...
while(1){
    if(pagesLimitReached || visitedpages>=pagesLimit){
        pagesLimitReached = true;
        if(w_threads){
            gotosleep();
        }
        else {
            break;
        }
    }
    else{
        if (w_threads 0){
            createThread();
        }
        else if(w_threads == 0){
            break;
        }
        else{
            gotosleep();
        }
    }
}
...

Child Thread code


...
download(url);
parse(url);
update(queue, visitedLinks, ranking);
if(pagesLimitReached){
    if(workingThreads == 0){
        wake_parent();
    }
}
else{
    wake_parent();
}
...

Different locking techniques

  • Using SINGLE LOCK technique

    • Having one lock for all of our shared variables
    • Pros Easy to implement
    • Cons Need to keep the lock acquired for more amount of time which may result in serial processing only
  • Using THREAD SAFE DATA STRUCTURE technique

    • For each of the data structure, having a single lock or RW lock if required
    • Waiting time distributed over different locks
    • Different thread safe data structures:
      • Thread safe integer
      • Thread safe queue
      • Thread safe map
    • Pros No need to keep the lock acquired for more amount of time. Hence concurrency can be efficiently achieved in multi processor CPUs
    • Cons Overhead due to multiple locks

How to run multi threaded web crawler

  • use make to compile the program
  • maxlinks, pagelimit, threads can be given as argument in with make command.
    • For e.g. make maxlinks=100 pagelimit=100 threads=20
    • Here the arguments are:
      • maxlinks: Maximum number of links to be extracted from a website
      • pagelimit: Maximum number of websites to be downloaded while crawling
      • threads: Maximum number of threads to be created
      • rankerFlag: Flag to choose which ranking algorithm to be executed (n = simple counter based web ranker, sp = pagerank with sampling, ip = pagerank with iteration)

Back to Table of Contents

Website domain ranking algorithms

  • Simple counter based The intuition behind this approach of ranking is to increase the rank of a domain name whenever we visit it.

  • PageRank algorithm The intuition behind our pagerank algorithm is as follows. Suppose there is a random surfer which randomly chooses a starting website. After that it chooses next website from all the linked website with current chosen website with probability of 0.85 or it chooses next website from all available websites with a probability of 0.15.

    In this way, the importance of a website is measured by how repetitively a random surfer visits a website. Hence, a website is important if it is linked to more number of important websites.

    There are two ways to implement this algorithm

    • Iterative based PageRank algorithm
    • Sampling based PageRank algorithm

Simple counter based


...
corpus = read(csv_file)
for website in corpus.keys():
    for x in corpus[website]:
        rank[website]+=1
...

Demo run

------------------------------------------------
  Domain Name rankings using counter
------------------------------------------------

................................................
  Domain Name            Rank
................................................

1 .  mygov.in                                  43
2 .  main.ayush.gov.in                         36
3 .  tourism.gov.in                            24
4 .  digitalindia.gov.in                       19
5 .  asi.nic.in                                16
------------------------------------------------------------

Back to Table of Contents

Sampling based PageRank algorithm

In this approach, we have asked random surfer for certain number of sample times to choose a website from all websites which are weighted as per the pagerank algorithm intuition.


...
DAMPING = 0.85
SAMPLES = 10000

corpus = read(csv_file)

for i in range(1,SAMPLES):
    for x in corpus.keys():
        if (x in corpus[page]):
            model[x] = ((DAMPING * (1/number_of_linked_websites)) + ((1-DAMPING)* (1/total_websites)))
        else:
            model[x] = ((1-DAMPING)* (1/total_websites))
    x = random.choices(websites, weights=model.values())[0]
    pagerrank[x] += (1/n)
return pagerrank
...

Demo run

-------------------------------------------------------------
  Domain Name ranking using PageRank: Sampling (n = 10000)
-------------------------------------------------------------

................................................
  Domain Name            Rank
................................................

1 .  haryana.mygov.in                          0.1290000000000021
2 .  chhattisgarh.mygov.in                     0.11000000000000212
3 .  blog.mygov.in                             0.07730000000000119
4 .  mygov.in                                  0.07260000000000105
5 .  aatmanirbharbharat.mygov.in               0.04840000000000036
------------------------------------------------------------

Back to Table of Contents

Iterative based PageRank algorithm

The intuition behind ranking using iterative pagerank algorithm is as follows. We will intialize the probability of surfer visiting a given website to 1/total_websites. We will update the probability of every website according to below formula.

Pagerank equation

We will stop iterating when the difference between old and updated probabilities is less than certain threashold.


...
DAMPING = 0.85
THRESHOLD = 0.001
corpus = read(csv_file)

while(1):
    before = [pagerrank[v] for v in pagerrank.keys()]
    for x in corpus.keys():
        for v in linkedWebsites:
            pagerrank[x] += (DAMPING*(pagerank[v]/total_linkedWebsites))
        pagerank[x] += ((1-DAMPING)/number_of_websites)
        
    if (change(before, [pagerrank[v] for v in pagerrank.keys()]) 

Demo run

----------------------------------------------
  Domain Name ranking using PageRank: Iteration
----------------------------------------------

................................................
  Domain Name            Rank
................................................

1 .  india.gov.in                              0.01762736346840192
2 .  digitalindia-gov.zoom.us                  0.017587054793058533
3 .  tourism.gov.in                            0.016866581191734113
4 .  digitalindiaawards.gov.in                 0.014974653859083446
5 .  mygov.in                                  0.0122561916194452
------------------------------------------------------------

Back to Table of Contents

Graphs

  • We have made a graph between number of threads vs time. According to this graph, we can infer as follows:
    • When number of threads are very low, time required for crawling is large
    • Time increases when number of threads becomes huge becuase of locking overhead
    • When we use optimal number of threads, concurrent crawling is useful.
Crawler Analytics

** Above graph varies whenever your run the code. Above all inferences are made according to general trend in graph which is seen for most of times

  • We have made a graph by varying one of our parameters pagelimit vs time. According to this graph, we can infer as follows:
    • As pagelimit increases, crawler time increases
    • Multithreaded with single locking works better than single threaded becuase of concurrency
    • Multithreaded with thread safe data structures works worst than other two appraoches because of locking overheads. Because in this approach, we have used individual locks for each data structures. And while crawling we needed to acquire and release locks back to back to update each data structure. This increases lot of overhead. As a result, time increases singnificantly in this approach
Page limit vs time elapsed

Back to Table of Contents

Libraries used

  • Sockets
  • OpenSSL
  • Pthread library
    • For concurrency and synchronization techniques
      • Locks
        • Single locks
        • Reader Writer locks
      • Condition Variables
  • Matplotlib
    • Plotting the graphs

Future extension of project

  • Efficient downloader to download all websites
  • Efficient parser to parse large websites
  • Creating a search engine by using web indexing and web scraping

Repo structure

  • Three main branches:
    • main: contains the multithreaded using threadsafe data structure web crawler
    • MT_singlelock: contains the multithreaded using single lock web crawler
    • single_threaded: contains the single threaded web crawler

Back to Table of Contents

How to reuse this repo

  • fork the repo from top right corner of this page
  • Run following in your terminal

    $ git clone https://github.com/[USERNAME]/Web-Crawler
    $ cd Web-Crawler
    $ make

Presentation link

Credits

Contributors

anupam
Anupam Kumar
anupam
Preeti Chiluveru
anupam
Shruti Katpara
anupam
Vivek Modi

ForTheBadge built-with-love ForTheBadge powered-by-electricity