pyndri icon indicating copy to clipboard operation
pyndri copied to clipboard

Get only id2token dictionary (memory problem)

Open AvihayLevi opened this issue 5 years ago • 0 comments

Hi, I'm working with Clueweb09 category A - a big index (about 2TB). I need to extract the textual content of documents, and to do so I can extract the tokens-tuple using index.document(doc_id), but in order to "translate" it to text, I need id2token dictionary.

The problem is that I see I can get id2token only using index.get_dictionary(), but it uploads pretty much everything to the memory, and even though my machine got over 100GB of RAM, it gets killed in the process.

Can I get only the id2token dictionary? (hopefully that won't be to big) Do you have any other solution for my problem?

Thanks, Avihay

AvihayLevi avatar May 12 '19 15:05 AvihayLevi