seamless_communication icon indicating copy to clipboard operation
seamless_communication copied to clipboard

Error when reconstructing aligned data

Open starshine360 opened this issue 1 year ago • 7 comments

When I use the wet_lines script to download and gather aligned text information from the metadata, there is something wrong. The error message is as below. So what should I do to solve it?

preprocess/preprocess/warcstream.hh:37 in bool preprocess::WARCStream::GiveBytes(const char*, std::size_t, Callback&) [with Callback = DocumentCallback; std::size_t = long unsigned int] threw util::Exception because `ret != 0 && ret != 1'. zlib inflate returned unexpected code -3

starshine360 avatar Aug 28 '23 04:08 starshine360

Has the problem been resolved? I also encountered the same problem

yangsuxia avatar Sep 04 '23 07:09 yangsuxia

hi @starshine360 , thank you for your bug report. Could you please send the exact command line you run? Thanks.

Celebio avatar Sep 07 '23 07:09 Celebio

hi @starshine360 , thank you for your bug report. Could you please send the exact command line you run? Thanks.

I run it like this: zcat seamless.dataset.metadata.public.arb-enA.tsv.gz |egrep "^crawl"|tr '\t' ' '|./wet_lines

yangsuxia avatar Sep 08 '23 06:09 yangsuxia

are you running it on mac os? Do you get a clean text when you do zcat seamless.dataset.metadata.public.arb-enA.tsv.gz |egrep "^crawl"|tr '\t' ' '| head ?

Celebio avatar Sep 12 '23 09:09 Celebio

hi @starshine360 , thank you for your bug report. Could you please send the exact command line you run? Thanks.

I run it on CentOS, like this: cat seamless.dataset.metadata.public.arb-enA.tsv | egrep "^crawl" | tr '\t' ' ' | wet_lines

And when I run cat seamless.dataset.metadata.public.arb-enA.tsv | egrep "^crawl" | tr '\t' ' ' | head,the output is below.

crawl-data/CC-MAIN-2020-10/segments/1581875145648.56/wet/CC-MAIN-20200222023815-20200222053815-00545.warc.wet.gz sha1:MXL3UB4M4CQW7ZMNJQL35JF6Q47Q6QUC https://antolgy.com/%D8%AA%D8%AD%D8%A7%D9%88%D9%84-%D8%A7%D9%84%D8%B3%D8%A8%D8%A7%D8%AD%D8%A9-%D9%81%D9%8A-%D9%85%D8%B9%D9%8A%D9%91%D8%A9-%D8%A7%D9%84%D9%84%D9%87-%D9%88%D8%B1%D8%B3%D9%8E%D9%86-%D8%B4%D9%90%D8%B1%D9%8A/ 46 3345549219250573319 3345549219250573319 -1.0 1.17822 arb-enA arb 2 crawl-data/CC-MAIN-2021-31/segments/1627046153739.28/wet/CC-MAIN-20210728154442-20210728184442-00056.warc.wet.gz sha1:AK3V7K3GXJZ5GCJ22ED6MK3PTTLPUPIO https://www.ruqayah.net/books/index.php?id=1900 536 10827430569158463649 7094195376714998582 -1.0 1.1769682 arb-enA arb 3 crawl-data/CC-MAIN-2021-25/segments/1623487612537.23/wet/CC-MAIN-20210614135913-20210614165913-00474.warc.wet.gz sha1:QZNUQ6WIPZP56IOZFCCPHMYPILUBRU27 https://govirall.net/%D8%A7%D9%84%D9%86%D8%B4%D9%88%D8%A9-%D8%A7%D9%84%D8%A7%D9%83%D9%84%D9%8A%D9%84-%D8%A7%D9%84%D8%B1%D9%88%D8%B3%D9%8A-%D8%A8%D9%88%D9%84%D9%8A%D8%AA%D9%8A%D9%83%D9%88/1491/ 42 13481499645586975066 13481499645586975066 -1.0 1.1768104 arb-enA arb 4

starshine360 avatar Sep 13 '23 07:09 starshine360

ok thanks. Then maybe it is failing to download or uncompress correctly common-crawl shards. Kenneth might be able to assist you on this, but you can also wait for someone to rebuild the dataset if you are not in a hurry.

Celebio avatar Sep 13 '23 07:09 Celebio

Bump on this, I'm having the same issue and have reproduced the same things as @yangsuxia.

petehaha avatar Oct 31 '23 22:10 petehaha