heat Distributed Compressed Sparse Column Matrix

Due Diligence

General:
- [x] title of the PR is suitable to appear in the Release Notes
Implementation:
- [x] unit tests: all split configurations tested
- [x] unit tests: multiple dtypes tested
- [x] documentation updated where needed

Description

A new format similar to the one implemented in https://github.com/helmholtz-analytics/heat/pull/1028.

Most code from the previous implementation is common for both formats.

Steps:

[x] Implement a common base class
[x] Factory methods
[x] Sparse and Dense formats interconversion
[x] Arithmetic Operations
[x] Tests

NOTE: In the future, when arithmetic operations are supported by torch for sparse_csc, they can be enabled for DCSC_matrix by restoring the code removed in commit https://github.com/helmholtz-analytics/heat/pull/1377/commits/6d727affd5c9cff63f84896e2c8e77fcfe042335. The binary operator has already been modified to be generic for both DCSR_matrix and DCSC_matrix.

Type of change

New feature

Feb 20 '24 09:02 Mystic-Slice

👇 Click on the image for a new way to code review

Legend

Feb 20 '24 09:02 ghost

I just discovered something that I should have seen a long time ago.

torch only officially supports the matmul operation for its CSR and CSC formats. (https://pytorch.org/docs/stable/sparse.html#csr-tensor-operations) The elementwise operations (add and mul) are implemented for CSR(although not mentioned in their docs) but not for CSC.

It is still possible for us to implement matmul for ht.sparse module. Just wanted to know how we should proceed.

Mar 06 '24 05:03 Mystic-Slice

Thank you for the PR!

Mar 07 '24 09:03 github-actions[bot]

Thank you for the PR!

Mar 07 '24 09:03 github-actions[bot]

Codecov Report

Attention: Patch coverage is 92.85714% with 9 lines in your changes missing coverage. Please review.

Project coverage is 91.74%. Comparing base (ee0d72a) to head (90dc1d6). Report is 325 commits behind head on main.

Files	Patch %	Lines
heat/sparse/_operations.py	78.94%	4 Missing :warning:
heat/sparse/dcsx_matrix.py	93.02%	3 Missing :warning:
heat/sparse/factories.py	96.96%	1 Missing :warning:
heat/sparse/manipulations.py	95.00%	1 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1377      +/-   ##
==========================================
- Coverage   91.76%   91.74%   -0.03%     
==========================================
  Files          80       80              
  Lines       11640    11683      +43     
==========================================
+ Hits        10682    10719      +37     
- Misses        958      964       +6

Flag	Coverage Δ
unit	`91.74% <92.85%> (-0.03%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 07 '24 10:03 codecov[bot]

I just discovered something that I should have seen a long time ago.

torch only officially supports the matmul operation for its CSR and CSC formats. (https://pytorch.org/docs/stable/sparse.html#csr-tensor-operations) The elementwise operations (add and mul) are implemented for CSR(although not mentioned in their docs) but not for CSC.

It is still possible for us to implement matmul for ht.sparse module. Just wanted to know how we should proceed.

@Mystic-Slice thank you so much for this and apologies for the delay.

It's OK not to provide the element-wise operations for CSC if torch doesn't implement them yet.

The basic operation we want to be able to provide with this class, is matrix multiplication CSR @ CSC, when CSR is distributed (split=0) and CSC is not distributed. This operation requires no MPI communication, returns a distributed CSR. It would be a great starting point, and it doesn't need to be part of this PR.

Is this PR, the CSC class, ready for review @Mystic-Slice ?

Thanks again, we highly appreciate your contribution!

Apr 22 '24 03:04 ClaudiaComito

Thank you for the PR!

Apr 22 '24 04:04 github-actions[bot]

Hi @ClaudiaComito Sorry for the delay from my side too. The implementation of DCSC_matrix is complete. I just need to write a few tests. I will complete them asap.

If tests aren't a blocker for review, we can proceed with that.

Apr 22 '24 15:04 Mystic-Slice

Thank you for the PR!

Apr 22 '24 16:04 github-actions[bot]

Hi @ClaudiaComito Sorry for the delay from my side too. The implementation of DCSC_matrix is complete. I just need to write a few tests. I will complete them asap.

If tests aren't a blocker for review, we can proceed with that.

Hi @Mystic-Slice, thanks, the tests make the review so much easier, so please go ahead and sketch the tests first. Thanks a lot!

Apr 23 '24 07:04 ClaudiaComito

Thank you for the PR!

Apr 23 '24 16:04 github-actions[bot]

Thank you for the PR!

Apr 23 '24 16:04 github-actions[bot]

@ClaudiaComito The tests are done.

Apr 23 '24 17:04 Mystic-Slice

PR is ready for review

Apr 23 '24 17:04 Mystic-Slice

In the future, when arithmetic operations are supported by torch for sparse_csc, they can be enabled for DCSC_matrix by restoring the code removed in commit https://github.com/helmholtz-analytics/heat/commit/6d727affd5c9cff63f84896e2c8e77fcfe042335. The binary operator has already been modified to be generic for both DCSR_matrix and DCSC_matrix.

Apr 23 '24 18:04 Mystic-Slice

Waiting on #1443 (solves test_random failure on AMD runner among other things)

May 06 '24 03:05 ClaudiaComito

Thank you for the PR!

May 08 '24 11:05 github-actions[bot]

Thank you for the PR!

May 27 '24 03:05 github-actions[bot]

@Mystic-Slice on the CUDA runner, with PyTorch 2.2, we get the following error:

___________ ERROR collecting heat/sparse/tests/test_manipulations.py ___________
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/runner.py:341: in from_call
    result: Optional[TResult] = func()
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/runner.py:372: in <lambda>
    call = CallInfo.from_call(lambda: list(collector.collect()), "collect")
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/python.py:531: in collect
    self._inject_setup_module_fixture()
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/python.py:545: in _inject_setup_module_fixture
    self.obj, ("setUpModule", "setup_module")
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/python.py:310: in obj
    self._obj = obj = self._getobj()
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/python.py:528: in _getobj
    return self._importtestmodule()
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/python.py:617: in _importtestmodule
    mod = import_path(self.path, mode=importmode, root=self.config.rootpath)
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/pathlib.py:565: in import_path
    importlib.import_module(module_name)
/opt/conda/envs/heat_dev/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1050: in _gcd_import
    ???
<frozen importlib._bootstrap>:1027: in _find_and_load
    ???
<frozen importlib._bootstrap>:992: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:241: in _call_with_frames_removed
    ???
<frozen importlib._bootstrap>:1050: in _gcd_import
    ???
<frozen importlib._bootstrap>:1027: in _find_and_load
    ???
<frozen importlib._bootstrap>:1006: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:688: in _load_unlocked
    ???
<frozen importlib._bootstrap_external>:883: in exec_module
    ???
<frozen importlib._bootstrap>:241: in _call_with_frames_removed
    ???
heat/sparse/tests/__init__.py:1: in <module>
    from .test_arithmetics_csr import *
<frozen importlib._bootstrap>:1027: in _find_and_load
    ???
<frozen importlib._bootstrap>:1006: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:688: in _load_unlocked
    ???
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
heat/sparse/tests/test_arithmetics_csr.py:4: in <module>
    import heat as ht
heat/__init__.py:20: in <module>
    from . import utils
heat/utils/__init__.py:5: in <module>
    from . import data
heat/utils/data/__init__.py:7: in <module>
    from . import mnist
heat/utils/data/mnist.py:7: in <module>
    from torchvision import datasets
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/torchvision/__init__.py:6: in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/torchvision/_meta_registrations.py:26: in <module>
    def meta_roi_align(input, rois, spatial_scale, pooled_height, pooled_width, sampling_ratio, aligned):
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/torchvision/_meta_registrations.py:18: in wrapper
    if torchvision.extension._has_ops():
E   AttributeError: partially initialized module 'torchvision' has no attribute 'extension' (most likely due to a circular import)

Many test files return the same error. But we can't reproduce it with other PRs. Can you check it out? The ROCm runner runs fine, by the way.

May 27 '24 13:05 ClaudiaComito

I was able to recreate this error on my local machine. It occurs when the torch and torchvision version do not match. In my case, specifically, torch==2.2 and torchvision==0.18.0 (which corresponds to torch==2.3). When I made the versions match, they work perfectly fine.

I believe the torch version on cuda runner is 2.2 and not 2.3. But I am not sure.

Is there a way to check that?

May 27 '24 14:05 Mystic-Slice

I was able to recreate this error on my local machine. It occurs when the torch and torchvision version do not match. In my case, specifically, torch==2.2 and torchvision==0.18.0 (which corresponds to torch==2.3). When I made the versions match, they work perfectly fine.

I believe the torch version on cuda runner is 2.2 and not 2.3. But I am not sure.

Is there a way to check that?

~~@mtar can you look into this? thanks~~ never mind

Jun 03 '24 09:06 ClaudiaComito

Thank you for the PR!

Jun 03 '24 09:06 github-actions[bot]

@Mystic-Slice this looks amazing. Will you add your name to the CITATION.cff file, under the # release contributors - add as needed header? Thanks!

Jun 05 '24 14:06 ClaudiaComito

Done!

Jun 05 '24 17:06 Mystic-Slice

Thank you for the PR!

Jun 05 '24 17:06 github-actions[bot]

Thank you for the PR!

Jun 06 '24 07:06 github-actions[bot]

@Mystic-Slice one single test fails with torch<2, see below. From my side it's absolutely OK to skip the test when torch<2

_________________________ TestDCSC_matrix.test_astype __________________________

self = <heat.sparse.tests.test_dcscmatrix.TestDCSC_matrix testMethod=test_astype>

    def test_astype(self):
        heat_sparse_csc = ht.sparse.sparse_csc_matrix(self.ref_torch_sparse_csc)
    
        # check starting invariant
        self.assertEqual(heat_sparse_csc.dtype, ht.float32)
    
        # check the copy case for uint8
>       as_uint8 = heat_sparse_csc.astype(ht.uint8)

heat/sparse/tests/test_dcscmatrix.py:273: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = , dtype = <class 'heat.core.types.uint8'>, copy = True

    def astype(self, dtype, copy=True) -> __DCSX_matrix:
        """
        Returns a casted version of this matrix.
        Casted matrix is a new matrix of the same shape but with given type of this matrix. If copy is ``True``, the
        same matrix is returned instead.
    
        Parameters
        ----------
        dtype : datatype
            HeAT type to which the matrix is cast
        copy : bool, optional
            By default the operation returns a copy of this matrix. If copy is set to ``False`` the cast is performed
            in-place and this matrix is returned
        """
        dtype = canonical_heat_type(dtype)
>       casted_matrix = self._array.type(dtype.torch_type())
E       RuntimeError: torch.copy_: only sparse compressed tensors with the same number of specified elements are supported.

heat/sparse/dcsx_matrix.py:309: RuntimeError

Jun 06 '24 08:06 ClaudiaComito

@ClaudiaComito referred https://discuss.pytorch.org/t/why-does-pytorch-needs-the-three-functions-to-type-and-type-as/20164 Seems like the .type() doesnt support conversion of a lot of Tensor types. I should have used .to() method instead. Made the change.

Jun 07 '24 07:06 Mystic-Slice

Thank you for the PR!

Jun 07 '24 07:06 github-actions[bot]

Seems like it didnt work I will just skip the test for torch < 2.0. Sorry for the delay. I thought the previous fix would work.

Jun 07 '24 09:06 Mystic-Slice

heat heat copied to clipboard

Distributed Compressed Sparse Column Matrix

Due Diligence

Description

Type of change

Legend

Codecov Report

heat
heat copied to clipboard