pyndri
pyndri copied to clipboard
Get only id2token dictionary (memory problem)
Hi,
I'm working with Clueweb09 category A - a big index (about 2TB).
I need to extract the textual content of documents, and to do so I can extract the tokens-tuple using index.document(doc_id)
, but in order to "translate" it to text, I need id2token dictionary.
The problem is that I see I can get id2token only using index.get_dictionary()
, but it uploads pretty much everything to the memory, and even though my machine got over 100GB of RAM, it gets killed in the process.
Can I get only the id2token dictionary? (hopefully that won't be to big) Do you have any other solution for my problem?
Thanks, Avihay