Use TextIOWrapper for opening file from repository in text mode
In the current implementation, the contents of a file are wrapped in a StringIO if a user tries to open a file from the repository in text mode (i.e. with folderdata.open('somefile.txt', 'r')). This means the entire file is loaded into memory, which can cause a significant increase in memory usage for large text files. This PR proposed to instead wrap the binary stream in a TextIOWrapper for streaming based access in text mode to avoid having to load the entire file into memory.
A small change was also necessary for the cif-file interface, since apparently the CifParser from pymatgen only accepts filenames or StringIO streams.
This PR depends on a PR for the disk-objectstore dependency (see https://github.com/aiidateam/disk-objectstore/pull/192) that adds some attributes to custom streaming types that are required by TextIOWrapper.
See also https://aiida.discourse.group/t/memory-usage-folderdata-open/524/22 for a discussion on this issue.
I've added a small test that for the reading in text mode. I think that should cover the changes, right? The small change to the cif-interface is already covered by an existing test (that is also how I discovered that that needed to be changed).
Note btw that the changes in the disk-objectstore PR are needed for the tests to pass. I guess a bump of the version number is required in pyproject.toml once a new version of disk-objectstore with these changes has been released?
Thanks a lot @ahkole,
To link the correct dependency to your PR, in
pyproject.tomlremove the version:
'disk-objectstore~=1.2'->'disk-objectstore'And add a hook to your fork and branch
[tool.uv.sources] disk-objectstore = {git = "https://github.com/ahkole/disk-objectstore.git", branch = "XX"}and for
enviroment.ymljust get rid of the version:'disk-objectstore~=1.2'->'disk-objectstore'In the end, using
pre-commit runit automatically updates theuvfile
Ah, I wasn't aware this was possible. Thanks for the suggestion! I think this should be set correctly now.
Unrelated to this PR, but can I propose btw to update the "Setting up your development environment" page of the Contributor wiki to mention using uv as your Python package manager (for example as a recommendation)? As far as I can see that wiki still mostly mentions pip even though uv is much more convenient (as I have now learned).
Thanks a lot @ahkole,
Changes looks good to me, I run the workflow now to see if tests pass.
Ok, we're hittinh this github bug again, workflow will not run because I forgot to click on "Approve and run", before pushing :smiling_face_with_tear: I do a dirty trick
Good! the trick worked, now let's see the tests..
Codecov Report
:white_check_mark: All modified and coverable lines are covered by tests.
:white_check_mark: Project coverage is 79.26%. Comparing base (3a7d440) to head (ec4c7da).
:warning: Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #6847 +/- ##
==========================================
- Coverage 79.26% 79.26% -0.00%
==========================================
Files 566 566
Lines 43794 43795 +1
==========================================
Hits 34711 34711
- Misses 9083 9084 +1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@ahkole, thanks a gain for your contribution,
One of the tests is failing for sqlite storage backend.
To reproduce run this:
pytest --db-backend sqlite tests/restapi/test_statistics.py
While for psql gonna pass:
pytest --db-backend psql tests/restapi/test_statistics.py
It appears that restapi server has problems to properly read and load the data from a sqlite backend.
To debug, I'd suggest check out this file:
'src/aiida/storage/sqlite_dos/backend.py'
____________________________ test_count_consistency ____________________________
[gw1] linux -- Python 3.9.22 /home/runner/work/aiida-core/aiida-core/.venv/bin/python3
restapi_server = <function restapi_server.<locals>._restapi_server at 0x7f43030aff70>
server_url = <function server_url.<locals>._server_url at 0x7f43002acc10>
@pytest.mark.usefixtures('populate_restapi_database')
def test_count_consistency(restapi_server, server_url):
"""Test the consistency in values between full_type_count and statistics"""
server = restapi_server()
server_thread = Thread(target=server.serve_forever)
_server_url = server_url(port=server.server_port)
try:
server_thread.start()
type_count_response = requests.get(f'{_server_url}/nodes/full_types_count', timeout=10)
statistics_response = requests.get(f'{_server_url}/nodes/statistics', timeout=10)
finally:
server.shutdown()
> type_count_dict = linearize_namespace(type_count_response.json()['data'])
tests/restapi/test_statistics.py:51:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/restapi/test_statistics.py:31: in linearize_namespace
linearize_namespace(subspace, linear_namespace)
tests/restapi/test_statistics.py:31: in linearize_namespace
linearize_namespace(subspace, linear_namespace)
tests/restapi/test_statistics.py:31: in linearize_namespace
linearize_namespace(subspace, linear_namespace)
tests/restapi/test_statistics.py:31: in linearize_namespace
linearize_namespace(subspace, linear_namespace)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tree_namespace = {'counter': 5, 'full_type': None, 'label': 'Unregistered', 'namespace': '~no-entry-point~', ...}
linear_namespace = {'data.': 156, 'data.Data.': 20, 'data.core.': [136](https://github.com/aiidateam/aiida-core/actions/runs/14864011669/job/41740571178?pr=6847#step:7:137), 'data.core.array.': 5, ...}
def linearize_namespace(tree_namespace, linear_namespace=None):
"""Linearize a branched namespace with full_type, count, and subspaces"""
if linear_namespace is None:
linear_namespace = {}
full_type = tree_namespace['full_type']
> while full_type[-1] != '.':
E TypeError: 'NoneType' object is not subscriptable
tests/restapi/test_statistics.py:23: TypeError
Hi @khsrali ,
I have tried to run the test as you suggested but on my system the test passes, see below:
> pytest --verbose --db-backend sqlite tests/restapi/test_statistics.py
===================================================================================== test session starts =====================================================================================
platform linux -- Python 3.11.9, pytest-7.4.4, pluggy-1.5.0 -- .venv/bin/python3
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: .
configfile: pyproject.toml
plugins: asyncio-0.16.0, xdist-3.6.1, timeout-2.3.1, datadir-1.5.0, rerunfailures-12.0, cov-4.1.0, regressions-2.7.0, benchmark-4.0.0
timeout: 240.0s
timeout method: thread
timeout func_only: False
collected 1 item
tests/restapi/test_statistics.py::test_count_consistency PASSED [100%]
====================================================================================== warnings summary =======================================================================================
tests/restapi/test_statistics.py::test_count_consistency
.venv/lib/python3.11/site-packages/CifFile/CifFile_module.py:356: SyntaxWarning: "is not" with a literal. Did you mean "!="?
if do_imports is not 'No':
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================== slowest 50 durations =====================================================================================
(3 durations < 1s hidden. Use -vv to show these durations.)
================================================================================ 1 passed, 1 warning in 1.71s =================================================================================
I seem to be using a different Python version though (3.11 vs 3.9 in the results you shared). Don't know if that can make the difference?
yeah, strange, I tried to run locally with 3.9, also passes for me.
I don't understand why it fails on the CI :thinking:
Is there a way to look more into this type_count_response = requests.get(f'{_server_url}/nodes/full_types_count', timeout=10) request and figure out why full_type if none when this gets executed on the CI host?
Is there a way to look more into this type_count_response = requests.get(f'{_server_url}/nodes/full_types_count', timeout=10) request and figure out why full_type if none when this gets executed on the CI host?
Yes, that would be the way to go when debugging. The pain is that we cannot reproduce it locally, so it's more difficult to understand it and resolve it.
@eimrek, we don't understand why test fails here. I thought since you are an expert of restspi, maybe you can have a look (?) :smile:
hi! sorry, super busy with other stuff, so i won't be able to take a detailed look.
But this seems to be related to #6747 and #6763. Do you have #6763 included in this PR? maybe if you rebase on main this issue will be fixed?
@eimrek Thanks for the reply! If I check the history of my branch (https://github.com/ahkole/aiida-core/commits/objectstore-textio/), it does seem to include https://github.com/aiidateam/aiida-core/pull/6763. It was also last synced with main on May 1st, which was after the merge of https://github.com/aiidateam/aiida-core/pull/6763 I believe.
Is it somehow possible to run interactively on the CI host to try to debug the issue?
@ahkole tests pass now :man_shrugging: I didn't understand in the end what caused the issue. I added this incident to our flaky test collector.
For me this PR looks good already. Let me have a look at your other PR in diskobjectstore. Once that's merged, we'll need to make a release. Later we can come back here and merge this.
@khsrali that's great to hear! Let me know what you think of the disk-objectstore PR.