forte icon indicating copy to clipboard operation
forte copied to clipboard

Global DataPack

Open zhanyuanucb opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe. Entries are currently stored on a per-DataPack basis, which means that a DataPack should correspond to only one data point (for example, could be one document). This local info can be useful sometimes, but more often, global info of the whole pipeline or even among multiple pipelines is more useful. For instance, we are more interested in the BLEU score on the whole corpus, instead of the scores for a few documents.
We need a way to collect and store data across the pipeline.

Describe the solution you'd like Introduce Global DataPack and Reducer

  • Global DataPack collects all the necessary data across the pipeline
  • There will be a reserved key for Global DataPack in the pipeline resources.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

zhanyuanucb avatar Feb 04 '22 16:02 zhanyuanucb

Nice issue! But I feel like this issue is a bit too large. We could consider implementing Global DataPack and Reducer as separate issues (or other separation in your own favor)?

Btw, for the issue title, maybe we can talk about the general global data pack instead of focusing on metrics, by the end of the day, this utility is useful for other cases too.

hunterhector avatar Feb 04 '22 17:02 hunterhector

We could consider implementing Global DataPack and Reducer as separate issues (or other separation in your own favor)?

Agree. I'll change this issue to Global DataPack only, and make another one for Reducer. I will change the issue title accordingly.

zhanyuanucb avatar Feb 04 '22 21:02 zhanyuanucb