DataCollection
DataCollection copied to clipboard
Data collection, alignment and TAUS repository
Hi DataCollection team, I am trying to install DataCollection on a virtual machine with Ubuntu 16.04 LTS and I am getting this error after this command: `pip install -r requirements.txt`...
If the splitters specified in the command line do not exist this just fails silently and the script still runs through the entire corpus download.
For large language pairs with about 1.2 million candidate pairs this script takes days to run. While in this case 2.4 million web pages get downloaded and processed, it would...
When running `baseline/filter_hunalign_bitext.py` , e.g. like this ``` nohup cat en-de.sent | ~/DataCollection/baseline/filter_hunalign_bitext.py - en-de.filtered --lang1 en --lang2 de -cld2 -deleted en-de.deleted 2> filter.log & ``` and the process runs...
Should be modifiable with command line option to enable running own index server without editing the file.
locate_candidates_cc_index_api.py doesn't rate limit its queries to the CommonCrawl index server http://index.commoncrawl.org. The server is reported to be under heavy load frequently https://groups.google.com/forum/#!topic/common-crawl/o_MuZViu0O0. We should be nice and rate-limit our...
langstat2candidates.py, particularly when used with the `-candidates` parameter uses up large amounts of RAM (needing 32-64 GB of RAM for large language pairs). This is because it reads the entire...
``` achim 28910 0.0 0.0 14404 1448 ? SN 14:05 0:00 /bin/bash /hom e/achim/DataCollection/metadata/extract_monolingual.sh https://commoncrawl.s3.am azonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541518.17/wet/CC-MAIN-201 61202170901-00350-ip-10-31-129-80.ec2.internal.warc.wet.gz achim 28914 0.3 0.0 169336 4736 ? SN 14:05 0:00 curl -s https:...