kaiaulu
kaiaulu copied to clipboard
Scaling Kaiaulu to Larger Projects and Temporal information on Project Configuration File
When comparing Codeface and Kaiaulu scenarios with @nicolehoess, it became clearer Codeface BatchJobPool, a python script, greatly helped scale the analysis, because the entire pipeline of Codeface applied the analysis per time window for all steps.
The re-imagination of this in Kaiaulu would likely look something like this:
- A separate project in Sailuh would contain a simple Python script file that used BatchJobPool.
- BatchJobPool would fire R scripts, much like Codeface does (the dispatchers in Codeface that are in R).
- However, in Kaiaulu case, these dispatchers would be the existing CLI in
R/exec
. - To do so, Kaiaulu CLI scripts have to parameterize by time window. This is less practical in Kaiaulu as we do not require a Database, but it is still doable. Specifically we can:
- Guarantee downloaders will store monthly files worth of data.
- Request the time window from the project configuration file as an optional field.
- Use the specified time ranges to search the file names which store monthly data to decide what minimal combination of files needs to be loaded in memory (for then be more precisely subset).
- Apply the subsequent analysis.
- Note Kaiaulu CLI reuses Kaiaulu API. Therefore, if this is done right, Kaiaulu architecture with Notebooks remains the same. The API can be used for "normal sized" projects, where Notebooks can be more slowly analyzed and digested, especially to newcomers to the tool and MSR. This would also help test data assumptions, refine the project configuration file, etc. Meanwhile, the server-side large scale projects would require the project configuration file. Users could use the commands sequentially, if they do not have a server with multi-thread, or download the BatchJobPool script to parallelize everything.
- The bottomline thus would be that Kaiaulu would have 3 layers a user could approach, still maintaining an R package architecture.
This is, however, future work :^) There is still a lot of refinements to the API that needs to be done. One day we will get there. Maybe by then we will have an optional database to not do these ad-hoc file name "queries" too, since I still want a server-side minimal exec code on a CRON to keep downloading data.
Thanks Nicole for helping me understand this scenario from Codeface leading to a path forward to scale Kaiaulu 👍