tosback2
tosback2 copied to clipboard
Reimplementing TOSBack using git as a database layer!
This is TOSBack version 2, a clean redesign & reimplementation of EFF's TOSBack project.
It uses Git as an inherently and efficiently versioned backend storage database.
After cloning the git repository, you need to execute this command:
git submodule update --init --recursive
That will fetch a recent version of the GitPython code, which we depend upon.
BUGS IN WGET
If you want to actually run the crawler yourself (not really necessary unless you're testing something), be aware that TOSBack2 also exposes a number of bugs in common versions of wget. As of December 2011, there are two bugs you might need to patch yourself!
(FOR YOUR CONVENIENCE, a patched version of the wget source can be found in lib/wget-1.13.4/ . There is also a binary .deb that Debian and Ubuntu users can try in lib/. More hints on building from source below)
-
Versions of wget built against gnutls may suffer from fatal memory leaks https://lists.gnu.org/archive/html/bug-wget/2011-10/msg00050.html (so apply that patch, or build against openssl using ./configure --with-ssl=openssl).
-
You should also apply the following patch https://savannah.gnu.org/support/download.php?file_id=24473 to fix this bug: https://savannah.gnu.org/bugs/?21714
HINTS FOR BUILDING WGET FROM SOURCE ON DEBIAN OR UBUNTU
sudo apt-get build-dep wget cd lib/wget-1.13.4/ fakeroot debian/rules binary an installable .deb file should be written to the lib/ directory