kolibri
kolibri copied to clipboard
Add file storage option
Summary
In order to allow us to use a cloud backend for the File Storage in Django, this adds a value to options.py which defaults to the Django FileSystemStorage (which... it did anyway but now we make sure of it).
This should make it configurable on BCK such that if there is a module that is a Google cloud backend class that implements the Django Storage class, then it can be added as the value for the settings.
For example, if we have a new class "GCloudStorage" in kolibri.core.storage then we would use that class if we set the option added here to kolibri.core.storage.GCloudStorage.
This is very much a first whack -- one thing I'm not clear on is if by naming the option by the name that Django would look to in the env vars DEFAULT_FILE_STORAGE does the Kolibri options.py stuff automatically apply that setting because of the matching name?
References
Fixes #9441 (or at least begins to address it)
Reviewer guidance
- Run Kolibri without changing the options.ini file, do a file storagey thing (I generate a Facility Data CSV)
- Uncomment the
STORAGE_BACKENDand change it's value togcsand restart Kolibri, you should get an error message that the module is not available pip install -r requirements/storages.txtthen try again, no error and it should start up
Next steps
- Work out a way to test this in a BCK environment - will chat to @jredrejo for guidance from his experience w/ KDP.
- Test it that way too :)
Testing checklist
- [ ] Contributor has fully tested the PR manually
- [ ] If there are any front-end changes, before/after screenshots are included
- [ ] Critical user journeys are covered by Gherkin stories
- [ ] Critical and brittle code paths are covered by unit tests
PR process
- [ ] PR has the correct target branch and milestone
- [ ] PR has 'needs review' or 'work-in-progress' label
- [ ] If PR is ready for review, a reviewer has been added. (Don't use 'Assignees')
- [ ] If this is an important user-facing change, PR or related issue has a 'changelog' label
- [ ] If this includes an internal dependency change, a link to the diff is provided
Reviewer checklist
- PR is fully functional
- PR has been tested for accessibility regressions
- External dependency files were updated if necessary (
yarnandpip) - Documentation is updated
- Contributor is in AUTHORS.md
Build Artifacts
| Asset type | Download link |
|---|---|
| PEX file | kolibri-0.18.0b2.dev0_git.35.g0254a7f2.pex |
| Windows Installer (EXE) | kolibri-0.18.0b2.dev0+git.35.g0254a7f2-windows-setup-unsigned.exe |
| Debian Package | kolibri_0.18.0b2.dev0+git.35.g0254a7f2-0ubuntu1_all.deb |
| Mac Installer (DMG) | kolibri-0.18.0b2.dev0+git.35.g0254a7f2.dmg |
| Android Package (APK) | kolibri-0.18.0b2.dev0+git.35.g0254a7f2-0.1.4-debug.apk |
| Raspberry Pi Image | kolibri-pi-image-0.18.0b2.dev0+git.35.g0254a7f2.zip |
| TAR file | kolibri-0.18.0b2.dev0+git.35.g0254a7f2.tar.gz |
| WHL file | kolibri-0.18.0b2.dev0+git.35.g0254a7f2-py2.py3-none-any.whl |
@jredrejo thank you for retargeting - I've rebased and restructured the PR a bit so the commit history is cleaner and hopefully easier to follow. Ready for code review when you can.
Also, could you try it locally w/ your gcloud credentials? I think I lost something that I had gotten working previously but cannot figure it out right now.
EDIT: I figured it out - had to setup my local adc creds per https://cloud.google.com/docs/authentication/set-up-adc-local-dev-environment
@rtibbles this failing test seems to be related to the original issue itself, possibly, because it is ultimately trying to get a hold of the DB file using os.path.join(KOLIBRI_HOME...) instead of the DefaultStorage().
That repair_db stuff calls to get the default backup folder.
The local storage uses the MEDIA_ROOT by default.
Not sure the best path forward so would appreciate your thoughts.
There seems to be a problem with how files are being handled that is leaving file handles open - this is why things are erroring on windows, because it is much more sensitive to files not being closed:
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'D:\\a\\kolibri\\kolibri\\.pytest_kolibri_home\\media\\UserImportCommandTestCase.csv'
I haven't looked through the code in detail to see where this might be happening, but it suggests that something isn't being handled entirely properly.
@nucleogenesis and I had a quick sync up to see how to consolidate the file handling here - seems like there are some deficiencies in the Django Storage conformance to the Python file object spec https://forum.djangoproject.com/t/file-open-to-support-different-encodings/21491 - but an idea there to wrap the file object in a TextIOWrapper to ensure we generate with the correct encoding and newlines seems promising!
Hi @nucleogenesis - on my end while regression testing the .csv generation I noticed that it's no longer possible to generate the same .csv file twice. For example if I go to Facility > Data and generate a sessions logs file for the current day, it gets generated correctly, but if I attempt to to the same thing 5 minutes later it still says "Generated 5 minutes ago."
The same is valid when generating a new user .csv file - the first time it works fine, but if I go ahead and add several new users, go back to Facility > Data and click again the "Generate new user CSV file" it will seem that it's generated correctly but when you open it you don't see the newly added users. There are no errors in the console.
Here are the logs from one of my devices though: logs.zip
https://github.com/user-attachments/assets/eb497396-a39e-4e53-8214-28ea1a15ec1d
@pcenov I've made some significant changes here and I've been able to test it locally in both the Google Cloud & the local context.
Can you go through the various CSV downloads in Kolibri again to confirm?
I will connect with you once it is testable in a cloud context if needed.
@rtibbles things are ready for re-review code-wise, I'll annotate a few areas w/ thoughts & questions as well
@rtibbles updating the google-storages module and removing the GS_CREDENTIALS handling altogether just worked locally <3
Hi @nucleogenesis, I confirm that the generation of the session and summary logs is working correctly now however when I try to import a user CVS file I am repeatedly getting the following error even without having modified in any way the just exported csv file:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/kolibri/core/tasks/job.py", line 326, in execute
result = func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/kolibri/core/tasks/registry.py", line 237, in __call__
return self.func(*args, **kwargs)
File "/usr/lib/python3/dist-packages/kolibri/core/auth/tasks.py", line 188, in importusersfromcsv
delete=delete,
File "/usr/lib/python3/dist-packages/kolibri/dist/django/core/management/__init__.py", line 181, in call_command
return command.execute(*args, **defaults)
File "/usr/lib/python3/dist-packages/kolibri/dist/django/core/management/base.py", line 398, in execute
output = self.handle(*args, **options)
File "/usr/lib/python3/dist-packages/kolibri/core/tasks/management/commands/base.py", line 28, in handle
return self.handle_async(*args, **options)
File "/usr/lib/python3/dist-packages/kolibri/core/auth/management/commands/bulkimportusers.py", line 873, in handle_async
self.number_lines = self.get_number_lines(filepath)
File "/usr/lib/python3/dist-packages/kolibri/core/auth/management/commands/bulkimportusers.py", line 717, in get_number_lines
with open_csv_for_reading(filepath) as f:
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/usr/lib/python3/dist-packages/kolibri/core/utils/csv.py", line 50, in open_csv_for_reading
with default_storage.open(filename, "rb") as f:
File "/usr/lib/python3/dist-packages/kolibri/dist/django/core/files/storage.py", line 38, in open
return self._open(name, mode)
File "/usr/lib/python3/dist-packages/kolibri/dist/django/core/files/storage.py", line 243, in _open
return File(open(self.path(name), mode))
File "/usr/lib/python3/dist-packages/kolibri/dist/django/core/files/storage.py", line 338, in path
return safe_join(self.location, name)
File "/usr/lib/python3/dist-packages/kolibri/dist/django/utils/_os.py", line 31, in safe_join
'component ({})'.format(final_path, base_path))
django.core.exceptions.SuspiciousFileOperation: The joined path (/home/osboxes/.kolibri/temp/tmpv3rj_54v.upload.csv) is located outside of the base path component (/home/osboxes/.kolibri/media)
Here's a video as well:
https://github.com/user-attachments/assets/a1f7378c-3086-461c-810a-74a76690883c
Logs: logs.zip
@pcenov I've updated the code so that you should be able to import a user by CSV. I've tested in the cloud env, can you confirm it is resolved locally as well?
Hi @nucleogenesis - I confirm that I can import users now, however for some reason nothing happens when I click the link 'Generate new user CSV file':
https://github.com/user-attachments/assets/1db852b7-b7a6-4d63-a061-7b417f99c580
@pcenov thank you for catching that - I've resolved the issue and tested it in cloud & local envs.
@rtibbles in https://github.com/learningequality/kolibri/pull/12590/commits/15838260d102bb891968f0ae762243bf5b4292ee I removed some code I added in hopes of allowing for non-overwrite operations.
I ran into errors where it said the file wasn't open for writing... so I changed rb+ to wb+ and got "file not open for writing" and I got to wondering if the wrapping & encoding of a file for write+read was causing the issues but I couldn't work out anything to substantiate that.
In any case - the bulkuserexport's overwrite flag is simply a way to exit the task early if the file already exists to avoid overwriting, so it seems fair for open_csv_for_writing to always overwrite when given the same filename.
Just wanted to get your thoughts
@nucleogenesis - I wanted to verify that the issue is fixed but it's still the same in the latest build asset.
@pcenov I've updated this to make sure the CLI export/import works as well.
Could you please test the local filesystem regular CSV file export & import as you had previously for regressions. I have tested it locally and in the cloud environment successfully.
In addition to this, could you please test out the following CLI commands (note that if you put -h at the end, it will explain the usage):
kolibri manage exportlogs- This exports Summary & Session logs. Note the-lflag which determines which of these you're running. Additionally, please test with the-Ooutput file for the export, putting the file in a few places on your filesystem.kolibri manage bulkuserexport- This will export all of the default facility's users to a CSV in the current working directory (where you are in your terminal). You should be able to export to a particular file using the-Oflag.kolibri manage bulkuserimport- This will import a CSV - please test valid and invalid CSVs and try the--dryrunflag, which will only validate the data.- For export commands - please test them with the
-sflag. This is newly added and should result in your files being places into your$KOLIBRI_HOME/media/folder (or a subfolder of it).
In general do not worry about testing flags that aren't related to files - whether reading or writing them. Things like overwrite, errorlines, etc are not crucial here.
Thank you in advance - I've tried these all out locally so I'm hoping you see the same!
Hi @nucleogenesis - unfortunately it seems that there is a problem with the build as it says that linting is failing. Tried rerunning the jobs but it keeps failing.
Testing with the available build asset is not possible because both the generation of new session/summary logs and new user csv file is failing:
https://github.com/user-attachments/assets/ebd0b105-7d95-48cc-8a77-9fca69a6a2fe
@pcenov :thinking:
I downloaded the assets and I'm able to generate & import/export the files. Our version hashes are the same so we're definitely on the same asset.
Could you share your logs?
@nucleogenesis I did a quick manual test here by generating a users CSV, downloading it, and then adding a new user, and generating a new users CSV.
When I went to download this new users CSV file it did not contain the new user at all, but looking in my media folder, I do see another CSV file, so it seems that it has generated a new file rather than overwriting the existing one.
I assume this is probably what is happening in @pcenov's case as well? The logs probably won't show anything out of the ordinary, because no error has occurred, but the exported CSV files are not overwriting the previous versions.
@nucleogenesis - yes, there are no errors in the logs as mentioned by Richard but adding them just in case: logs.zip
Thanks @rtibbles for testing!
I've updated the code and tested the issues reported and things seem to be working. The elapsed time is reporting correctly and I have generated a CSV, updated it, imported it, then downloaded it again and saw the user I added.
@pcenov please give it another shot when you can
Hi @nucleogenesis - Happy to let you know that this time I didn't observe any regressions while testing the .csv file export & import and that newly generated session/summary/user csv files are correctly replacing the previously existing files.
Also no issues observed while testing the following CLI commands:
kolibri manage exportlogs - I confirm that it exports correctly the summary/session logs as specified. Here's an example of a working command with multiple flags: kolibri manage exportlogs -O session_logs.csv -l session --start_date 2025-04-02T00:00:00 --end_date 2025-04-03T23:59:59 --facility 618d5168ae7eeaf1ab6c448c92c4e7e9 -w
kolibri manage bulkexportusers - no issues observed while exporting with or without flags, the new -s flag is also working as specified
kolibri manage bulkimportusers - works as specified
Something that's probably known to everyone involved but mentioning it just in case is that the documentation at https://kolibri.readthedocs.io/en/latest/manage/command_line.html will have to be updated to include the new changes.
Also the following commands listed there are not working:
kolibri manage exportlogs --log-type summary
kolibri manage exportlogs --log-type session