unitxt icon indicating copy to clipboard operation
unitxt copied to clipboard

Eliminating Manual Class Registration in Unitxt with Import Paths

Open elronbandel opened this issue 10 months ago • 14 comments

Problem Statement

In Unitxt, every artifact in the catalog includes a __type__ field in its JSON representation. This field stores the class that was used to instantiate the artifact, which is necessary for loading it back into a Python instance.

Currently, Unitxt relies on a class registry that maps a prettified class name to its actual class. The __type__ field stores the prettified name, and when an artifact is loaded, this name is used to look up the original class in the registry.

However, this approach introduces several challenges:

  1. Manual Class Registration – Any class that might appear in the catalog must be registered in advance.
  2. Import Dependencies – Users must explicitly import all custom classes used in the catalog within any code accessing it. This can be difficult to debug and communicate to users.
  3. Ongoing Maintenance – Users frequently encounter this issue and must manually maintain the solution.

Proposed Solution

Instead of storing a prettified name, we propose changing the __type__ field to store:

  • A full import path (e.g., "unitxt.loaders.LoadHF") for globally available classes.
  • A relative import path (e.g., ".MyOperator") based on a registered folder.

By default, the current working directory will be automatically registered, making the system more intuitive for small projects running locally.

Benefits of the Proposed Change

  1. No More Manual Class Registration – Libraries using Unitxt will no longer need to register their classes manually.
  2. Improved Usability for Small Projects – Projects operating within a single working directory will work seamlessly using relative imports.
  3. Support for Larger Projects – Projects without a formal package structure can register their main directories and use relative imports.

This change will make Unitxt more user-friendly, reduce setup complexity, and improve error handling.

elronbandel avatar Feb 04 '25 20:02 elronbandel

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Apr 11 '25 02:04 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 25 '25 02:04 github-actions[bot]

This is still worked on .Right?

yoavkatz avatar Apr 26 '25 04:04 yoavkatz

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar May 27 '25 02:05 github-actions[bot]

Still is still important.

yoavkatz avatar May 27 '25 08:05 yoavkatz

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Jun 27 '25 03:06 github-actions[bot]

Still worked on in #1713

yoavkatz avatar Jun 30 '25 04:06 yoavkatz

@yoavkatz , the look of the catalog (as suggested in PR #1713) is evidently not backward compatible. the __type__ field is defined differently: module and class (PR) vs snake (thus far). @elronbandel identified this as a problem preventing from accepting. The PR contains a utility that runs all the prepare files to 'face-lift' the catalog. Utility that needs to be run once per project (over the prepare files of the project). So backward compatibility is resolved within minutes. Please share your view about this issue.

dafnapension avatar Jul 14 '25 14:07 dafnapension

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Aug 23 '25 02:08 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Sep 07 '25 02:09 github-actions[bot]

This is still an important issue that people struggle with.

yoavkatz avatar Sep 07 '25 06:09 yoavkatz

Hi @yoavkatz , I already prepared a PR to solve this, as to all unitxt classes. The version uses a new unitxt catalog, where each __type__ is expressed as a dict of module and class. However, being backward compatible, it can also live with the current unitxt catalog where each __type__ is expressed as snake_case of the relevant class (no reflection of any module). The way the PR finds the module is (mainly) a simple grep over the files under src/unitxt.
In other words: we can offer the users a version that does not invoke register_all_artifacts upfront (in the __init__), employing, instead, and only for the needed classes - a grep. No other change to the code or the catalog. Will that be of interest to you until @elronbandel will decide what to do with the above mentioned full PR?

dafnapension avatar Sep 18 '25 12:09 dafnapension

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Oct 21 '25 02:10 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Nov 04 '25 02:11 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] avatar Dec 05 '25 03:12 github-actions[bot]