pytdml icon indicating copy to clipboard operation
pytdml copied to clipboard

Consider creating a new package for usage/research

Open photonbit opened this issue 9 months ago • 10 comments

There is a mismatch between pytdml package definition:

Python library for TrainingDML-AI encode/decode

pytdml is a pure python parser and encoder for training datasets based on OGC Training Data Markup Language for AI standard.

And the content of the code in the repository.

In order to allow a broader adoption of the package and the standard, having only the model definitions, the converters from and to other formats, import from string and import from file could be optimal.

That would allow an easier dependency management, reduced size of the package, and freedom to the users.

The purpose of this issue is to discuss the different sets of responsibilities currently in the package, and where they should belong. Most teams or people using the package will probably have their own data access layer, which could make having specific libraries like minio a source of dependency problem.

For example, in the current code we can observe that some files in pytdml.ml and pytdml.io require functions from datalibrary, which seem to be a purpose/architecture specific and separated code.

photonbit avatar Mar 13 '25 12:03 photonbit

I fully agree with the concerns you raised. The current core focus of PyTDML should strictly adhere to the OGC standard's model definitions and encoding/decoding functionalities. The integrated data access layer (e.g., the datalibrary module) was initially developed to support the LuojiaSet project's example data services, which may have blurred responsibility boundaries and introduced unnecessary dependencies.  In our team meeting this Friday, we will discussion on the scope of pytdml package and we will give the answer later.

Relifest avatar Mar 20 '25 04:03 Relifest

I will probably struggle to make Friday's meeting as I'm travelling, so I'd like to agree with this proposed approach.

pixalytics avatar Mar 20 '25 08:03 pixalytics

IMO, it would be fine to leave the package-specific functionalities centralized within this project, but they should be contained within their own importable submodules (e.g.: pytdml.io.minio, pytdml.io.datalibrary, pytdml.ml.torch, pytdml.ml.tensorflow, etc.), where dependencies for them are only loaded dynamically if available and invoked.

This is a fairly common practice in AI frameworks that dynamically enable distinct features based on the backends/capabilities that are installed (e.g.: https://github.com/pytorch/pytorch/tree/main/torch/backends, https://github.com/tensorflow/tensorflow/tree/master/tensorflow/python/distribute), and most users employing them would be familiar with concepts such as auto-detecting them or explicitly passing a backend parameter to select some IO parser or ML framework implementation, with the proposed file-import being a reasonable default if no explicit backend is specified.

The top level pytdml.ml, pytdml.io, etc. should only define abstract classes common to all backends, such that new ones can be contributed and extended over time.

Package extra dependencies could be configured to help facilitate installation by users, such as doing pip install pytdml[minio,torch] if that specific combination was desired.

fmigneault avatar Mar 20 '25 16:03 fmigneault

Thank you for the thoughtful suggestions! In fact, we’ve already implemented preliminary submodule isolation in the repository (e.g., separating core functionalities from PyTorch/TensorFlow integrations). However, we overlooked architectural concerns in the data access layer.  The current implementation (e.g., tdml_torch_data_pipe.py) relies on TrainingDML-encoded datasets from our internal sample platform. Without access to the database of this platform, these modules become non-functional. Exposing internal database logic in the repository poses ​security and maintainability risks.  ​I think we can remove data access layer code from the public repository and encapsulate these functionalities as service APIs in the future, enabling secure and platform-agnostic data access. For other components (e.g., ML framework support), your proposal for submodule isolation and dynamic dependency loading is great. I think we can consider prioritizing implementing this pattern.

Relifest avatar Mar 21 '25 09:03 Relifest

IMO, it would be fine to leave the package-specific functionalities centralized within this project, but they should be contained within their own importable submodules (e.g.: pytdml.io.minio, pytdml.io.datalibrary, pytdml.ml.torch, pytdml.ml.tensorflow, etc.), where dependencies for them are only loaded dynamically if available and invoked.

This is a fairly common practice in AI frameworks that dynamically enable distinct features based on the backends/capabilities that are installed (e.g.: https://github.com/pytorch/pytorch/tree/main/torch/backends, https://github.com/tensorflow/tensorflow/tree/master/tensorflow/python/distribute), and most users employing them would be familiar with concepts such as auto-detecting them or explicitly passing a backend parameter to select some IO parser or ML framework implementation, with the proposed file-import being a reasonable default if no explicit backend is specified.

The top level pytdml.ml, pytdml.io, etc. should only define abstract classes common to all backends, such that new ones can be contributed and extended over time.

Package extra dependencies could be configured to help facilitate installation by users, such as doing pip install pytdml[minio,torch] if that specific combination was desired.

My concern relies in the scope of the package, and if it should be an AI framework or restrict the base package to the description of the pacakge. Under the description, I see that what should be included is the pydantic classes and tools to load from files or strings, so it could be pytdml.from_json to load from a pytdml JSON file, and something like pytdml.stac.from_json, pytdml.coco.from_json or pytdml.yaml.from_json for the conversion formats, while leaving the handling of network outside of the pcakage. Both datalibrary and ml include applications of the standard, that are not included in the standard itself, but example usages of what can the standard be used for.

photonbit avatar Mar 24 '25 14:03 photonbit

That seems like a viable approach as well if the scope is considered too big for this package, but the end-user experience should be considered to make sure pytdml can be properly leveraged. If the end application of pytdml always require importing another AI framework plugin, let say pytdml-pytorch, and causes friction from misaligned definitions, it will negatively impact pytdml anyway.

On the other hand, if no such plugin is provided to help users employ pytdml swiftly with their ML framework, they might just decide it's not worth the effort, and discard pytdml entirely.

It's a fine line that must be dealt with carefully for good adoption.

fmigneault avatar Mar 24 '25 17:03 fmigneault

I see more fine lines in the case at hand. The most important being: is the current step a step forward? When thinking too many steps ahead, we might end up adding more assumptions than necessary for the current scenario, making it easy to mistake them for ground truths.

It is my understanding that the current team taking care of the package do not have capacity to design and implement a full-blown machine learning framework. Designing the package with that specific goal in mind might be an interesting thought experiment. However, I do believe that the separation approved by the SWG is a step forward that could perhaps lead to building a bridge towards integrating with existing machine learning frameworks or even creating its own.

photonbit avatar Apr 04 '25 09:04 photonbit

I see more fine lines in the case at hand. The most important being: is the current step a step forward? When thinking too many steps ahead, we might end up adding more assumptions than necessary for the current scenario, making it easy to mistake them for ground truths.

It is my understanding that the current team taking care of the package do not have capacity to design and implement a full-blown machine learning framework. Designing the package with that specific goal in mind might be an interesting thought experiment. However, I do believe that the separation approved by the SWG is a step forward that could perhaps lead to building a bridge towards integrating with existing machine learning frameworks or even creating its own.

I agree with you. I think the full-blown machine learning framework proposed in the previous meeting is a work that needs more time and energy. That's just an idea for the time being. Now, all we have to do is separate the dependency groups. We can solve the existing issue in this way first.

Relifest avatar Apr 08 '25 08:04 Relifest

I'm not sure to follow where the "full-blown machine learning framework" proposal comes from? What I think pytdml should provide (and limit itself) is to offer sufficient converter tools to obtain, for example, a torch.utils.data.Dataset similar to how MNIST or COCO interfaces are provided (https://pytorch.org/tutorials/beginner/basics/data_tutorial.html), and maybe some very common dataset entries such as EuroSAT, ImageNet, etc. (https://pytorch.org/vision/stable/datasets.html). Once the Dataset is converted from PyTDML JSON/XML representations, all the rest of the ML framework is entirely in the ends of the relevant framework.

If the PyTDML package limits itself "only" to loading JSON/XML, everyone has to repeat this tedious task of interpreting the parameters TDML offers into the necessary representation of the ML framework. In a sense, there is no real advantage to use PyTDML in such case, since one can already parse the JSON/XML manually and do this job on their own without the extra unnecessary package dependency. For PyTDML to be relevant and be used by the community, it has to offer "something more" that facilitates the job of the developer to integrate it seamlessly in their AI pipeline. If one can simply call pytdml.ml.to_torch (or whatever) without having to deal with PyTDML versioning/parsing, then there is added value. Once the Dataset instance is obtained, it is irrelevant that it originated from PyTDML to run a "typical AI pipeline", and therefore, PyTDML should not be involved in the rest of that process/ML-framework (nor needs to define anything more for them).

fmigneault avatar Apr 09 '25 15:04 fmigneault

I am sorry if my choice of words could be interpreted as an opposition to your proposal. My intention is to frame the effort with some considerations in mind:

  • Even if we do have the 1.0 version of the standard, it is currently under discussion, and I believe those discussions will shape many differences from the current approved version, likely making it more aligned with STAC, for example.
  • I do believe that your proposals are indeed very reasonable, and that they should be the shape of how to use the standard in the future. I was particularly amazed by the proposal you made for the code sprint.
  • It’s important to address the state of the library, which was developed with a particular use case in mind.
  • I do not think it is uncommon to have to install different packages to support different backends.
  • And we should acknowledge the current capacity of the team maintaining the package.

With that and all, I see the proposal to limit the functionality to the description of the package as a strategic move toward implementing the functionality you suggested.

photonbit avatar Apr 29 '25 18:04 photonbit