focalevents icon indicating copy to clipboard operation
focalevents copied to clipboard

Feature request: split result jsons for archiving purposes

Open FlxVctr opened this issue 2 years ago • 1 comments

First a question: Is it safe to cut output files from the head and archive away JSON while the app is running?

Second: With huge datasets it'd be great if there'd be a function that splits off and compresses old raw data. Maybe a good idea for a feature in the future. We have > 24 GB in a single file by now 😅

FlxVctr avatar Apr 06 '22 16:04 FlxVctr

I think it may cause an error to move / archive / compress the file in the middle of a search. I'm not an expert on Python file writing, but the file is only opened once at the start, and I assume that's when Python creates it. And so I'd also assume it'd hit an error when it tried to write if suddenly the file did not exist.

Yes, that's a good point; I always imagined using the database as the primary data source and keeping the JSON as backup, so I didn't put a lot of thought into how it was saved. It would make sense to have a parameter like n_tweets_per_file and a running counter file_n that writes n_tweets_per_file to a JSON file, closes it, increments file_n, and then opens the next file. It may even help clean up some of the --get_counts code because I did some messy stuff to make that output multiple files rather than one.

I probably won't have the capacity to do this soon unfortunately. If you want to give it a shot, you'd want to modify the class in listener.py, and probably make the change in the manage_writing() function. You can see that the class already keeps track of some other tweets counts too (https://github.com/ryanjgallagher/focalevents/blob/main/twitter/listener.py#L249)

One thing I worry about is how this may work with the streaming class. It uses two different Python processes to do the reading and the writing and I've had some trouble of modifying what are supposed to be shared class attributes across the processes (i.e. you try to change it on one, but the other process keeps going on as if it's the old attribute).

ryanjgallagher avatar Apr 08 '22 13:04 ryanjgallagher