minicrawler
minicrawler copied to clipboard
Multiplexing web client supporting HTTP/2 and WHATWG URL compliant parser written in C
Minicrawler
Minicrawler parses URLs, executes HTTP (HTTP/2) requests while handling cookies, network connection management and SSL/TLS protocols. By default it follows redirect locations and returns a full response, final URL, parsed cookied and more. It is designed to handle many request in parallel in a single thread. It multiplexes connections, running the read/write communication asynchronously. The whole Minicrawler suite is licensed under the AGPL license.
URL Library (libminicrawler-url)
WHATWG URL Standard compliant parsing and serializing library written in C. It is fast and has only one external dependency – libicu. The library is licensed under the AGPL license.
Usage
#include <minicrawler/minicrawler-url.h>
/**
* First argument input URL, second (optional) base URL
*/
int main(int argc, char *argv[]) {
if (argc < 2) return 2;
char *input = argv[1];
char *base = NULL;
if (argc > 2) {
base = argv[2];
}
mcrawler_url_url url, *base_url = NULL;
if (base) {
base_url = (mcrawler_url_url *)malloc(sizeof(mcrawler_url_url));
if (mcrawler_url_parse(base_url, base, NULL) == MCRAWLER_URL_FAILURE) {
printf("Invalid base URL\n");
return 1;
}
}
if (mcrawler_url_parse(&url, input, base_url) == MCRAWLER_URL_FAILURE) {
printf("Invalid URL\n");
return 1;
}
printf("Result: %s\n", mcrawler_url_serialize_url(&url, 0));
return 0;
}
More in test/url.c.
Minicrawler Library (libminicrawler) Usage
#include <stdio.h>
#include <minicrawler/minicrawler.h>
static void onfinish(mcrawler_url *url, void *arg) {
printf("%d: Status: %d\n", url->index, url->status);
}
void main() {
mcrawler_url url[2];
mcrawler_url *urls[] = {&url[0], &url[1], NULL};
mcrawler_settings settings;
memset(&url[0], 0, sizeof(mcrawler_url));
memset(&url[1], 0, sizeof(mcrawler_url));
mcrawler_init_url(&url[0], "http://example.com");
url[0].index = 0;
mcrawler_init_url(&url[1], "http://example.com");
url[1].index = 1;
mcrawler_init_settings(&settings);
mcrawler_go(urls, &settings, &onfinish, NULL);
}
Minicrawler Binary Usage
minicrawler [options] [urloptions] url [[url2options] url2]...
Options
options:
-2 disable HTTP/2
-6 resolve host to IPv6 address only
-8 convert from page encoding to UTF-8
-A STRING custom user agent (max 255 bytes)
-b STRING cookies in the netscape/mozilla file format (max 20 cookies)
-c convert content to text format (with UTF-8 encoding)
-DMILIS set delay time in miliseconds when downloading more pages from the same IP (default is 100 ms)
-g accept gzip encoding
-h enable output of HTTP headers
-i enable impatient mode (minicrawler exits few seconds earlier if it doesn't make enough progress)
-k disable SSL certificate verification (allow insecure connections)
-l do not follow redirects
-mINT maximum page size in MiB (default 2 MiB)
-pSTRING password for HTTP authentication (basic or digest, max 31 bytes)
-S disable SSL/TLS support
-tSECONDS set timeout (default is 5 seconds)
-u STRING username for HTTP authentication (basic or digest, max 31 bytes)
-v verbose output (to stderr)
-w STRING write this custom header to all requests (max 4095 bytes)
urloptions:
-C STRING parameter which replaces '%' in the custom header
-P STRING HTTP POST parameters
-X STRING custom request HTTP method, no validation performed (max 15 bytes)
Output header
Minicrawler prepends its own header into the output with the following meaning
- URL: Requested URL
- Redirected-To: Final absolute URL
- Redirect-info: Info about each redirect
-
Status: HTTP Status of final response (negative in case of error)
-
-10
Invalid input -
-9
,-8
DNS error -
-7
,-6
Connection error -
-5
SSL/TLS error -
-4
,-3
Error during sending a HTTP request -
-2
Error during receiving a HTTP response -
-1
Decoding or converting error
-
- Content-length: Length of the downloaded content in bytes
- Timeout: Reason of timeout in case of timeout
- Error-msg: Error message in case of error (negative Status)
- Content-type: Correct content type of outputed content
- WWW-Authenticate: WWW-Authenticate header
- Cookies: Number of cookies followed by that number of lines of parsed cookies in Netscape/Mozilla file format
- Downtime: Length of an interval between time of the first connection and time of the last received byte; time of the start of the first connection
- Timing: Timing of request (DNS lookup, Initial connection, SSL, Request, Waiting, Content download, Total)
- Index: Index of URL from command line
Dependencies
- Asynchronous hostname resolving – c-ares
- Gzip decoding – zlib
- TLS/SSL – OpenSSL
- HTTP2 – Nghttp2
- Unicode processing – ICU
Build on Linux
Tested platforms: Debian Linux, Red Hat Linux, OS X.
Install following dependencies (including header files, i.e. dev packages):
On Linux with apt-get run:
apt install libc-ares-dev zlib1g-dev libicu-dev libssl-dev libnghttp2-dev
The GNU Autotools are also needed and the GNU Compiler Collection, they can be installed by:
apt install make autoconf automake autotools-dev libtool gcc
Link libminicrawler to your project
On macOS with homebrew CFLAGS
and LDFLAGS
need to contain proper paths.
You can assign them directly as the configure script options.
./configure CFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/opt -L/usr/local/lib"
After installation, you can link libminicrawler
by adding this to your Makefile:
CFLAGS += $(shell pkg-config --cflags libminicrawler-4)
LDFLAGS += $(shell pkg-config --libs libminicrawler-4)
Build minicrawler with docker
First create .env
file with COMPOSE_PROJECT_NAME=minicrawler
then build docker image
docker compose build minicrawler
docker compose run --rm minicrawler
Then run:
./autogen.sh
./configure --prefix=$PREFIX --with-ca-bundle=/var/lib/certs/ca-bundle.crt --with-ca-path=/etc/ssl/certs
make
make install
make check # for tests
Unit Tests
Unit tests are done by simply running make check
. They need php-cli to be installed.
Integration Tests
Integration tests require a running instance of httpbin. You can use public one like on nghttp2.org or install it locally For example as a library from PyPI and run it using Gunicorn:
Running httpbin locally
apt install -y python3-pip
pip install httpbin
gunicorn httpbin:app
Then run the following command:
make -C integration-tests check
Running httpbin using Docker
docker compose up -d httpbin
make -C integration-tests check
Install minicrawler to your image
COPY --from=minicrawler:latest /var/lib/minicrawler/usr /usr
Users
- Testomato – A simple website monitoring tool
- add me here