RobotsTxtParser
                                
                                 RobotsTxtParser copied to clipboard
                                
                                    RobotsTxtParser copied to clipboard
                            
                            
                            
                        An extensible robots.txt parser and client library, with full support for every directive and specification.
Robots.txt parser
An easy to use, extensible robots.txt parser library with full support for literally every directive and specification on the Internet.
Usage cases:
- Permission checks
- Fetch crawler rules
- Sitemap discovery
- Host preference
- Dynamic URL parameter discovery
- robots.txtrendering
Advantages
(compared to most other robots.txt libraries)
- Automatic robots.txtdownload. (optional)
- Integrated Caching system. (optional)
- Crawl Delay handler.
- Documentation available.
- Support for literally every single directive, from every specification.
- HTTP Status code handler, according to Google's spec.
- Dedicated User-Agentparser and group determiner library, for maximum accuracy.
- Provides additional data like preferred host, dynamic URL parameters, Sitemap locations, etc.
- Protocols supported: HTTP,HTTPS,FTP,SFTPandFTP/S.
Requirements:
Installation
The recommended way to install the robots.txt parser is through Composer. Add this to your composer.json file:
{
  "require": {
    "vipnytt/robotstxtparser": "^2.1"
  }
}
Then run: php composer update
Getting started
Basic usage example
<?php
$client = new vipnytt\RobotsTxtParser\UriClient('http://example.com');
if ($client->userAgent('MyBot')->isAllowed('http://example.com/somepage.html')) {
    // Access is granted
}
if ($client->userAgent('MyBot')->isDisallowed('http://example.com/admin')) {
    // Access is denied
}
A small excerpt of basic methods
<?php
// Syntax: $baseUri, [$statusCode:int|null], [$robotsTxtContent:string], [$encoding:string], [$byteLimit:int|null]
$client = new vipnytt\RobotsTxtParser\TxtClient('http://example.com', 200, $robotsTxtContent);
// Permission checks
$allowed = $client->userAgent('MyBot')->isAllowed('http://example.com/somepage.html'); // bool
$denied = $client->userAgent('MyBot')->isDisallowed('http://example.com/admin'); // bool
// Crawl delay rules
$crawlDelay = $client->userAgent('MyBot')->crawlDelay()->getValue(); // float | int
// Dynamic URL parameters
$cleanParam = $client->cleanParam()->export(); // array
// Preferred host
$host = $client->host()->export(); // string | null
$host = $client->host()->getWithUriFallback(); // string
$host = $client->host()->isPreferred(); // bool
// XML Sitemap locations
$host = $client->sitemap()->export(); // array
The above is just a taste the basics, a whole bunch of more advanced and/or specialized methods are available for almost any purpose. Visit the cheat-sheet for the technical details.
Visit the Documentation for more information.
Directives
Specifications
- Google robots.txt specifications
- Yandex robots.txt specifications
- W3C Recommendation HTML 4.01 specification
- Sitemaps.org protocol
- Sean Conner: "An Extended Standard for Robot Exclusion"
- Martijn Koster: "A Method for Web Robots Control"
- Martijn Koster: "A Standard for Robot Exclusion"
- RFC 7231, ~~2616~~
- RFC 7230, ~~2616~~
- RFC 5322, ~~2822~~, ~~822~~
- RFC 3986, ~~1808~~
- RFC 1945
- RFC 1738
- RFC 952

