aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

Use TextIOWrapper for opening file from repository in text mode

Open ahkole opened this issue 7 months ago • 18 comments

In the current implementation, the contents of a file are wrapped in a StringIO if a user tries to open a file from the repository in text mode (i.e. with folderdata.open('somefile.txt', 'r')). This means the entire file is loaded into memory, which can cause a significant increase in memory usage for large text files. This PR proposed to instead wrap the binary stream in a TextIOWrapper for streaming based access in text mode to avoid having to load the entire file into memory.

A small change was also necessary for the cif-file interface, since apparently the CifParser from pymatgen only accepts filenames or StringIO streams.

This PR depends on a PR for the disk-objectstore dependency (see https://github.com/aiidateam/disk-objectstore/pull/192) that adds some attributes to custom streaming types that are required by TextIOWrapper.

ahkole avatar Apr 30 '25 13:04 ahkole

See also https://aiida.discourse.group/t/memory-usage-folderdata-open/524/22 for a discussion on this issue.

ahkole avatar Apr 30 '25 14:04 ahkole

I've added a small test that for the reading in text mode. I think that should cover the changes, right? The small change to the cif-interface is already covered by an existing test (that is also how I discovered that that needed to be changed).

Note btw that the changes in the disk-objectstore PR are needed for the tests to pass. I guess a bump of the version number is required in pyproject.toml once a new version of disk-objectstore with these changes has been released?

ahkole avatar May 01 '25 11:05 ahkole

Thanks a lot @ahkole,

To link the correct dependency to your PR, in pyproject.toml remove the version:

'disk-objectstore~=1.2' -> 'disk-objectstore'

And add a hook to your fork and branch

[tool.uv.sources]
disk-objectstore = {git = "https://github.com/ahkole/disk-objectstore.git", branch = "XX"}

and for enviroment.yml just get rid of the version: 'disk-objectstore~=1.2' -> 'disk-objectstore'

In the end, using pre-commit run it automatically updates the uv file

Ah, I wasn't aware this was possible. Thanks for the suggestion! I think this should be set correctly now.

ahkole avatar May 01 '25 18:05 ahkole

Unrelated to this PR, but can I propose btw to update the "Setting up your development environment" page of the Contributor wiki to mention using uv as your Python package manager (for example as a recommendation)? As far as I can see that wiki still mostly mentions pip even though uv is much more convenient (as I have now learned).

ahkole avatar May 01 '25 19:05 ahkole

Thanks a lot @ahkole,

Changes looks good to me, I run the workflow now to see if tests pass.

khsrali avatar May 06 '25 15:05 khsrali

Ok, we're hittinh this github bug again, workflow will not run because I forgot to click on "Approve and run", before pushing :smiling_face_with_tear: I do a dirty trick

khsrali avatar May 06 '25 15:05 khsrali

Good! the trick worked, now let's see the tests..

khsrali avatar May 06 '25 15:05 khsrali

Codecov Report

:white_check_mark: All modified and coverable lines are covered by tests. :white_check_mark: Project coverage is 79.26%. Comparing base (3a7d440) to head (ec4c7da). :warning: Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6847      +/-   ##
==========================================
- Coverage   79.26%   79.26%   -0.00%     
==========================================
  Files         566      566              
  Lines       43794    43795       +1     
==========================================
  Hits        34711    34711              
- Misses       9083     9084       +1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar May 06 '25 16:05 codecov[bot]

@ahkole, thanks a gain for your contribution,

One of the tests is failing for sqlite storage backend.

To reproduce run this: pytest --db-backend sqlite tests/restapi/test_statistics.py While for psql gonna pass: pytest --db-backend psql tests/restapi/test_statistics.py

It appears that restapi server has problems to properly read and load the data from a sqlite backend. To debug, I'd suggest check out this file: 'src/aiida/storage/sqlite_dos/backend.py'

____________________________ test_count_consistency ____________________________
[gw1] linux -- Python 3.9.22 /home/runner/work/aiida-core/aiida-core/.venv/bin/python3

restapi_server = <function restapi_server.<locals>._restapi_server at 0x7f43030aff70>
server_url = <function server_url.<locals>._server_url at 0x7f43002acc10>

    @pytest.mark.usefixtures('populate_restapi_database')
    def test_count_consistency(restapi_server, server_url):
        """Test the consistency in values between full_type_count and statistics"""
        server = restapi_server()
        server_thread = Thread(target=server.serve_forever)
    
        _server_url = server_url(port=server.server_port)
    
        try:
            server_thread.start()
            type_count_response = requests.get(f'{_server_url}/nodes/full_types_count', timeout=10)
            statistics_response = requests.get(f'{_server_url}/nodes/statistics', timeout=10)
        finally:
            server.shutdown()
    
>       type_count_dict = linearize_namespace(type_count_response.json()['data'])

tests/restapi/test_statistics.py:51: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/restapi/test_statistics.py:31: in linearize_namespace
    linearize_namespace(subspace, linear_namespace)
tests/restapi/test_statistics.py:31: in linearize_namespace
    linearize_namespace(subspace, linear_namespace)
tests/restapi/test_statistics.py:31: in linearize_namespace
    linearize_namespace(subspace, linear_namespace)
tests/restapi/test_statistics.py:31: in linearize_namespace
    linearize_namespace(subspace, linear_namespace)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

tree_namespace = {'counter': 5, 'full_type': None, 'label': 'Unregistered', 'namespace': '~no-entry-point~', ...}
linear_namespace = {'data.': 156, 'data.Data.': 20, 'data.core.': [136](https://github.com/aiidateam/aiida-core/actions/runs/14864011669/job/41740571178?pr=6847#step:7:137), 'data.core.array.': 5, ...}

    def linearize_namespace(tree_namespace, linear_namespace=None):
        """Linearize a branched namespace with full_type, count, and subspaces"""
        if linear_namespace is None:
            linear_namespace = {}
    
        full_type = tree_namespace['full_type']
>       while full_type[-1] != '.':
E       TypeError: 'NoneType' object is not subscriptable

tests/restapi/test_statistics.py:23: TypeError

khsrali avatar May 07 '25 08:05 khsrali

Hi @khsrali ,

I have tried to run the test as you suggested but on my system the test passes, see below:

> pytest --verbose --db-backend sqlite tests/restapi/test_statistics.py
===================================================================================== test session starts =====================================================================================
platform linux -- Python 3.11.9, pytest-7.4.4, pluggy-1.5.0 -- .venv/bin/python3
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: .
configfile: pyproject.toml
plugins: asyncio-0.16.0, xdist-3.6.1, timeout-2.3.1, datadir-1.5.0, rerunfailures-12.0, cov-4.1.0, regressions-2.7.0, benchmark-4.0.0
timeout: 240.0s
timeout method: thread
timeout func_only: False
collected 1 item                                                                                                                                                                              

tests/restapi/test_statistics.py::test_count_consistency PASSED                                                                                                                         [100%]

====================================================================================== warnings summary =======================================================================================
tests/restapi/test_statistics.py::test_count_consistency
  .venv/lib/python3.11/site-packages/CifFile/CifFile_module.py:356: SyntaxWarning: "is not" with a literal. Did you mean "!="?
    if do_imports is not 'No':

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================== slowest 50 durations =====================================================================================

(3 durations < 1s hidden.  Use -vv to show these durations.)
================================================================================ 1 passed, 1 warning in 1.71s =================================================================================

I seem to be using a different Python version though (3.11 vs 3.9 in the results you shared). Don't know if that can make the difference?

ahkole avatar May 07 '25 09:05 ahkole

yeah, strange, I tried to run locally with 3.9, also passes for me. I don't understand why it fails on the CI :thinking:

khsrali avatar May 07 '25 10:05 khsrali

Is there a way to look more into this type_count_response = requests.get(f'{_server_url}/nodes/full_types_count', timeout=10) request and figure out why full_type if none when this gets executed on the CI host?

ahkole avatar May 08 '25 10:05 ahkole

Is there a way to look more into this type_count_response = requests.get(f'{_server_url}/nodes/full_types_count', timeout=10) request and figure out why full_type if none when this gets executed on the CI host?

Yes, that would be the way to go when debugging. The pain is that we cannot reproduce it locally, so it's more difficult to understand it and resolve it.

@eimrek, we don't understand why test fails here. I thought since you are an expert of restspi, maybe you can have a look (?) :smile:

khsrali avatar May 08 '25 14:05 khsrali

hi! sorry, super busy with other stuff, so i won't be able to take a detailed look.

But this seems to be related to #6747 and #6763. Do you have #6763 included in this PR? maybe if you rebase on main this issue will be fixed?

eimrek avatar May 12 '25 12:05 eimrek

@eimrek Thanks for the reply! If I check the history of my branch (https://github.com/ahkole/aiida-core/commits/objectstore-textio/), it does seem to include https://github.com/aiidateam/aiida-core/pull/6763. It was also last synced with main on May 1st, which was after the merge of https://github.com/aiidateam/aiida-core/pull/6763 I believe.

ahkole avatar May 12 '25 13:05 ahkole

Is it somehow possible to run interactively on the CI host to try to debug the issue?

ahkole avatar May 12 '25 13:05 ahkole

@ahkole tests pass now :man_shrugging: I didn't understand in the end what caused the issue. I added this incident to our flaky test collector.

For me this PR looks good already. Let me have a look at your other PR in diskobjectstore. Once that's merged, we'll need to make a release. Later we can come back here and merge this.

khsrali avatar May 14 '25 15:05 khsrali

@khsrali that's great to hear! Let me know what you think of the disk-objectstore PR.

ahkole avatar May 14 '25 16:05 ahkole