ir_datasets Clueweb22

I'd like to keep this PR as a way of tracking progress of the ir_datasets integration for ClueWeb22. Of course, the implementation is far from finished (as you can see by the numerous todo's :laughing:). But I figure that keeping the process open to other contributors might encourage valuable feedback and discussion.

And of course, this PR would close #210 :wink:

Oct 19 '22 22:10 janheinrichmerker

Wow-- thanks! Seems to be coming along nicely. The vdom structure is a bit complicated, but I guess it needs to be in order to properly represent the data.

Oct 19 '22 22:10 seanmacavaney

Yep, I haven't started with the VDOM type yet, but will as soon as the documentation is up.

Oct 19 '22 22:10 janheinrichmerker

@seanmacavaney I think, now everything is ready. We just have the B category subset and therefore cannot test the A and L categories. Therefore, I've temporarily hidden those IDs, until we get a chance to test that as well (the parsers are ready for those already).

Added IDs:

clueweb22
clueweb22/b
clueweb22/b/de
clueweb22/b/en
clueweb22/b/es
clueweb22/b/fr
clueweb22/b/it
clueweb22/b/ja
clueweb22/b/nl
clueweb22/b/po
clueweb22/b/pt
clueweb22/b/zh
clueweb22/b/other-languages
clueweb22/b/as-a
clueweb22/b/as-l

Nov 29 '22 03:11 janheinrichmerker

Just applied some minor fixes for earlier Python versions.

Nov 29 '22 11:11 janheinrichmerker

Awesome, thanks! A few other nits:

Looks like 'cached_property' isn't supported in python 3.7 or below
It seems something requires that the clueweb22 exists, which we cannot assume. When trying to generate the documentation page, I get an error that the directory doesn't exist. (And when I create the directory, it expects there to be a version file or something.)

Nov 29 '22 16:11 seanmacavaney

Thinking a bit more about this... I sorta feel that the primary case will be the _Txt version. Might it make sense to have the alternate formats as separate datasets, like so:

Provides txt data

clueweb22/b
clueweb22/b/touche-2023-task-1 [and other tasks where the default case is text]
clueweb22/b/de
clueweb22/b/en
clueweb22/b/es
clueweb22/b/fr
clueweb22/b/it
clueweb22/b/ja
clueweb22/b/nl
clueweb22/b/po
clueweb22/b/pt
clueweb22/b/zh
clueweb22/b/other-languages
clueweb22/a
clueweb22/l

Provides html

clueweb22/html/b
clueweb22/html/a

Provides vdom

clueweb22/vdom/b
clueweb22/vdom/a

Provides links

clueweb22/links/b
clueweb22/links/a

I don't think this is so far off from what's there now. It should also be faster for the primary case, since it won't need to pull a bunch of data from multiple sources and collate it.

Nov 29 '22 17:11 seanmacavaney

I disagree about text being the default use case. Users will probably expect the types listed in the table from the ClueWeb22 website and paper to be present in the ir_datasets records. So, clueweb22/l should have text, clueweb22/a should have HTML, and clueweb22/b should have screenshots (at least once they're released).

And the Touché 2023 shared task is a good example here. Participants can use whatever they want from the B category. In my opinion, having text documents as the default in ir_datasets would lead to confusions here.

But you're right about the performance and that's exactly the use case I imagined for the "subset views". If I only need the text, then I can "view" only the text part of clueweb22/b, for instance.

What about renaming them like this:

clueweb22
clueweb22/b
clueweb22/b/de
clueweb22/b/en
...
clueweb22/b/zh
clueweb22/b/other-languages
clueweb22/b/no-png
clueweb22/b/no-png/no-html
clueweb22/a
clueweb22/a/de
clueweb22/a/en
...
clueweb22/a/zh
clueweb22/a/other-languages
clueweb22/a/no-html
clueweb22/l
clueweb22/l/de
clueweb22/l/en
...
clueweb22/l/zh
clueweb22/l/other-languages

I think this way it is more explicit that the clueweb22/b/no-png subset view would actually exclude data that is normally part of the category B subset.

Nov 29 '22 20:11 janheinrichmerker

Allright, I'll just backport the @cached_property bits then. And it should be fine to defer the version compatibility check (and hence expecting the directory) to creating the doc iterator. Consider it done :wink:

Nov 29 '22 21:11 janheinrichmerker

I see your points and I think I agree with some of them. I could probably be convinced. However, let me make a more complete case in favor of a text-only default:

Most (existing) systems currently rely only on the text, so nearly any baseline will end up just tossing out all the other information.
Let's say there's a new system that uses some of the other structured data -- let's say the vdom. This often relies on a format that's specific to the dataset itself. So a system would need to be built specifically for handling the CW22 format, which inherently won't transfer to other datasets. Not saying it won't happen, it's just less common. There's many of examples of this, e.g., wapo, cord19, etc. Lots of rich structured data, but it's almost always ignored in favour of a simple text format.
There's overhead in pulling data from additional sources to create records. I don't yet have my copy, but I imagine that the text data will be much smaller than vdom/html/etc. So less data overall. Further, reading a single file sequentially from disk should be faster than reading from multiple files, which will need to jump around more.
All 3 splits of CW22 include text, so it's a uniform default format across the splits.
The other formats will be available through their respective clueweb22/vdom etc, so they're still there and easy for folks to use if they want them. To me, it seems more straightforward to ask for what data you want rather than what data you don't want.
Alternative formats could be added more easily that fit nicely into this structure. E.g., if somebody runs a doc2query over all the documents, it could be added under a new clueweb22/d2q namespace, rather than always loading it for every record.
Would probably simplify the code, if there's no need for composite records, merging, etc.

cord19 is an example where we made a similar decision -- most folks work with the title+abstract text, which is easy and fast to load. There's a separate cord19/fulltext that includes the full article text, which is considerably more expensive to load (and would otherwise just be tossed out by most users).

Nov 30 '22 10:11 seanmacavaney

To refute some of your points:

Your first point might just be a "self-fulfilling prophecy". If text is the default in ir_datasets, then people are just going to use that. Then fewer people use rich data. Then we say, let's have text the default. :wink:
The documentation is a perfect place to encourage users to select just the text version if they feel that's all they need. It is also a great place to highlight that iterating over all WARCs (but also over all text JSONL files) comes with serious computational cost.
The kind of user that is willing to iterate over the whole ClueWeb22 is (hopefully) well aware that this requires more computational resources than cord19 and therefore read the documentation.
Others, say students, who just want to play around with a smaller sample, e.g., for re-ranking or just the Polish documents, often do not care about the fastest performance. (Especially for the fact that with random access by ID, we could easily enrich a run file.)
Indeed the records stem from multiple data sources (folders on disk). So there is overhead. But it just comes down to the speed of the "slowest" format which would be WARC/HTML anyway.
Unfortunately, we cannot always just split by type. For example, the node IDs for headings, tables, etc. are stored in the WARC headers. So to use the VDOM we always need the WARC. Consequentially, merging records must still be done.
I refute your argument about third-party formats to fit better. Take Touché as an example again. If all record types would be separate IDs, we would not be able to list Touché 2023 anywhere. The clueweb22/b/touche-2023 ID wouldn't work because we don't want to restrict participants to just use plain text. The clueweb22/b/html/touche-2023 ID wouldn't work either because then participants couldn't use the VDOM etc.
My previous argument still stands. If you look at the official paper and documentation from CMU, you would expect ir_datasets to return the same format. Implicitly overruling the official "default" can lead to confusion. So that would need to be documented very clearly. But the same place in the documentation can better be used to advertise the opinionated, non-official, but more performant "views" for only text or only html.

Nov 30 '22 14:11 janheinrichmerker

Fixed the issues with version assertions and @cached_property.

Nov 30 '22 14:11 janheinrichmerker

Thanks!

Looks like there are still some py36 incompatibilities: ImportError: cannot import name 'Final' from 'typing'.

My main hesitation remains that in my experience so far with the package, it seems that most users just care about having an easy way to get the text, even when loads of other nice structured data are available. So I'd like to make that case as easy and optimised as possible for folks. You make some reasonable counter-points, though, and I think I'm inclined to agree on the current path forward. But maybe it's worth getting some additional input before committing to it.

Nov 30 '22 14:11 seanmacavaney

My main hesitation remains that in my experience so far with the package, it seems that most users just care about having an easy way to get the text, even when loads of other nice structured data are available. So I'd like to make that case as easy and optimised as possible for folks.

Well, with the current approach it is already easy (just use clueweb22/b/text instead of clueweb22/b) and optimized (for clueweb22/b/text we would only look at the text files, no WARC is touched).

So why is it a problem to have users explicitly choose clueweb22/b/text if they only care about the text?

I'm now going to test everything with a Python 3.6 interpreter, just to be sure.

Dec 01 '22 12:12 janheinrichmerker

I'd like to add that there are also datasets in ir_datasets where the derived datasets are a suffix to the original dataset:

argsme/2020-04-01/processed is derived from argsme/2020-04-01
clueweb12/touche-2022-task-2/expanded-doc-t5-query is derived from clueweb12/touche-2022-task-2
cord19 is derived from cord19/fulltext

So I don't see a general pattern for preferring shorter IDs for the "only text"-version.

Dec 01 '22 12:12 janheinrichmerker

That should have been the last few 3.7-incompatible things.

Dec 01 '22 14:12 janheinrichmerker

Awesome, thanks!

Dec 01 '22 17:12 seanmacavaney

Maybe I'd feel a bit more comfortable if we had some performance benchmarks. E.g., how fast is it to iterate the first 100k documents for the combined vs text-only versions?

Dec 02 '22 13:12 seanmacavaney

These might not be too accurate as I'm accessing the files remotely via CephFS but here you go:

[INFO] [starting] first 100k docs, just text
100000it [00:07, 12524.57it/s]
[INFO] [finished] first 100k docs, just text [8.06s]
[INFO] [starting] first 100k docs, with html, txt, vdom, inlink, outlink
[WARNING] URL hash mismatch for clueweb22-de0000-00-13406: txt URL hash was 9D5A53C6ACCB07B2C2319A4E5E44AB76 but html URL hash was B6956297B5EBBDFEAABF458F2FA5EADC
[WARNING] URL mismatch for clueweb22-de0000-00-13406: outlink URL was https://www.jovanovic.com/quotidien.htm but html URL was https://www.jovanna.de/
[WARNING] URL hash mismatch for clueweb22-de0000-00-13406: outlink URL hash was 9D5A53C6ACCB07B2C2319A4E5E44AB76 but html URL hash was B6956297B5EBBDFEAABF458F2FA5EADC
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: txt URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
[WARNING] URL mismatch for clueweb22-de0000-01-14834: inlink URL was https://simon.ccbcmd.edu/pls/PROD/bwskalog.p_disploginnew?in_id=&cpbl=&newid= but html URL was https://simon-transporte.com/
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: inlink URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
[WARNING] URL mismatch for clueweb22-de0000-01-14834: outlink URL was https://simon.ccbcmd.edu/pls/PROD/bwskalog.p_disploginnew?in_id=&cpbl=&newid= but html URL was https://simon-transporte.com/
[WARNING] URL hash mismatch for clueweb22-de0000-01-14834: outlink URL hash was 612691A107701D76AD36FD32F8608F3C but html URL hash was 825E120CE7F82C8B0268440A59107D04
100000it [03:04, 541.70it/s]
[INFO] [finished] first 100k docs, with html, txt, vdom, inlink, outlink [03:05]

As expected parsing the WARC files is 22x slower than just reading the JSONL file.

Dec 08 '22 17:12 janheinrichmerker

Great news, my copy of the CW22 drive arrived.

Jan 11 '23 11:01 seanmacavaney

Great to hear that!

Jan 11 '23 11:01 janheinrichmerker

I've updated the branch to reflect upstream changes and added default_text() implementations.

Mar 14 '23 10:03 janheinrichmerker

Is anything still blocking the merge?

May 02 '23 11:05 janheinrichmerker

Sorry -- the only thing blocking is finding the time to run through the tests on my end.

May 02 '23 11:05 seanmacavaney

Hey @seanmacavaney, have you found time to run the tests? Now that the ClueWeb22 is used in a number of research papers, I really think it would be worth it to add it to ir_datasets. If there is anything I can help with, please let me know.

Feb 19 '24 09:02 janheinrichmerker

Closing this PR in favor of the new ir-datasets-clueweb22 extension.

Apr 19 '24 13:04 janheinrichmerker

ir_datasets ir_datasets copied to clipboard

Clueweb22

Added IDs:

ir_datasets
ir_datasets copied to clipboard