DataCollection issues

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Hi DataCollection team, I am trying to install DataCollection on a virtual machine with Ubuntu 16.04 LTS and I am getting this error after this command: `pip install -r requirements.txt`...

mzeidhassan

candidates2corpus.py: verify that splitters exist

4

If the splitters specified in the command line do not exist this just fails silently and the script still runs through the entire corpus download.

achimr

enhancement

Script candidates2corpus.py needs days to run for large language pairs

4

For large language pairs with about 1.2 million candidate pairs this script takes days to run. While in this case 2.4 million web pages get downloaded and processed, it would...

achimr

enhancement

Cleaning script filter_hunalign_bitext.py silently fails when running out of memory

When running `baseline/filter_hunalign_bitext.py` , e.g. like this ``` nohup cat en-de.sent | ~/DataCollection/baseline/filter_hunalign_bitext.py - en-de.filtered --lang1 en --lang2 de -cld2 -deleted en-de.deleted 2> filter.log & ``` and the process runs...

achimr

bug

Index server domain hardcoded in locate_candidates_cc_index_api.py

Should be modifiable with command line option to enable running own index server without editing the file.

achimr

enhancement

Add rate-limiting for index server queries to locate_candidates_cc_index_api.py

locate_candidates_cc_index_api.py doesn't rate limit its queries to the CommonCrawl index server http://index.commoncrawl.org. The server is reported to be under heavy load frequently https://groups.google.com/forum/#!topic/common-crawl/o_MuZViu0O0. We should be nice and rate-limit our...

achimr

enhancement

langstat2candidates.py requires large amounts of RAM

1

langstat2candidates.py, particularly when used with the `-candidates` parameter uses up large amounts of RAM (needing 32-64 GB of RAM for large language pairs). This is because it reads the entire...

achimr

enhancement

Compression, not language tagging seems to be the bottleneck in extract_monolingual.sh

``` achim 28910 0.0 0.0 14404 1448 ? SN 14:05 0:00 /bin/bash /hom e/achim/DataCollection/metadata/extract_monolingual.sh https://commoncrawl.s3.am azonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541518.17/wet/CC-MAIN-201 61202170901-00350-ip-10-31-129-80.ec2.internal.warc.wet.gz achim 28914 0.3 0.0 169336 4736 ? SN 14:05 0:00 curl -s https:...

achimr

enhancement

DataCollection
DataCollection copied to clipboard

Metadata

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

candidates2corpus.py: verify that splitters exist

Script candidates2corpus.py needs days to run for large language pairs

Cleaning script filter_hunalign_bitext.py silently fails when running out of memory

Index server domain hardcoded in locate_candidates_cc_index_api.py

Add rate-limiting for index server queries to locate_candidates_cc_index_api.py

langstat2candidates.py requires large amounts of RAM

Compression, not language tagging seems to be the bottleneck in extract_monolingual.sh

← Metadata

Owner

Metadata

DataCollection DataCollection copied to clipboard

Metadata

← Metadata

Owner

Metadata

DataCollection
DataCollection copied to clipboard