ConvoKit
ConvoKit copied to clipboard
Partially loading utterances from a selected dataset
The documentation states on multiple pages:
However, it is possible to partially load utterances from a dataset to carry out processing of large corpora sequentially.
alas, the provided link leads to a 404.
Is this still possible? For individual conversation summarization problems, loading just a single conversation would be invaluable, as currently, datasets for large subreddits take significant computational power to load.
Hi @wwwidonja, thanks for raising this. We did some restructuring of the repo recently, so that broke some links.
Here is the link.
And here is the documentation for Corpus. The initialization parameters utterance_start_index
and utterance_end_index
are what you're looking for.
Thanks for the response. This does not, however, solve my problem entirely. What I'm trying to achieve is fetching a single (or a computationally acceptablly small subset) conversation with all of its corresponding utterances. As far as I'm understanding, this only lets me fetch a subset of utterances, with no guarantee that all utterances of a given conversation have been fetched.
I hope you'll be able to consider my issue as a feature request.
Thank you very much!
We've thought about this before, but simply put, there is no way to do this given the fact that the corpus is loaded from simple JSONList files. (Which does not allow for any kind of indexing other than line by line indexing.) If you'd like to work with a smaller subset of the corpus, I'd recommend loading the full corpus, using filter_conversations_by()
and then dumping it so you have the smaller corpus to work with.
Of course, if you have other ideas for how we might implement your feature request, we're happy to hear it.
EDIT: I should clarify that this is no way to do this elegantly, but it would be possible to filter utterances.jsonl and conversations.json for a specific conversation_id. (This just requires iterating through the whole JSON.) It's not clear this is a common enough use case to include it in the package, but you could implement it or filter the utterances.jsonl and conversations.json programmatically.