sktime icon indicating copy to clipboard operation
sktime copied to clipboard

[ENH] Add `@targetlabel` identifier for `.ts` files.

Open achieveordie opened this issue 2 years ago • 1 comments

Reference Issues/PRs

Fixes #3412.

What does this implement/fix? Explain your changes.

Add an elif block to read for @targetlabel in two places - _read_header() and load_from_tsfile_to_dataframe(). The implementation is very similar to @classlabel but it doesn't accept additional values (since regression data has no class labels).

Does your contribution introduce a new dependency? If yes, which one?

No.

What should a reviewer concentrate their feedback on?

@TonyBagnall, I found a minor typo in one of the keys of meta_data of _read_header(). It probably should be has_class_label instead of class_label. Since it was written by you, please check if I misunderstood.

  • [x] Remove a possible silent bug due to a typo.
  • [x] Add @targetlabel as a valid identifier.
  • [ ] Add a test for regression data. For this, I want to add Covid19 Death Rate Dataset from tseregression. I need confirmation if the licence associated with it doesn't clash with our usage. This dataset would go inside datasets/data/. If this is not possible, I could also create a dummy regression dataset for testing purposes, similar to UnitTest.

Any other comments?

There is a major issue with our current implementation. When timestamps are present, code requires that we have class labels present(a list of possible labels like [0, 1, 2]). This is contrary to regression data. Since this might require major changes throughout, we could add it into the workflow as mentioned here. This PR only solves those cases where there is regression data without timestamps.

PR checklist

For all contributions
  • [ ] I've added unit tests and made sure they pass locally.
  • [x] The PR title starts with either [ENH], [MNT], [DOC], or [BUG] indicating whether the PR topic is related to enhancement, maintenance, documentation, or bug.

achieveordie avatar Sep 17 '22 08:09 achieveordie

this looks good and is funnily enough exactly what I need for another project. I was looking to put a TSR dataset in, perhaps this one? https://zenodo.org/record/3902690#.Yyxln3bMIQ8 I think its smallest in the Monash archive http://tseregression.org/ there is a guide to adding a new dataset in the developers guide https://github.com/alan-turing-institute/sktime/blob/main/docs/source/developer_guide/add_dataset.rst

TonyBagnall avatar Sep 22 '22 13:09 TonyBagnall

Yes, we both are referring to the same dataset. I wanted to keep the size small as well, so I chose the covid dataset.

This PR only solves those cases where there is regression data without timestamps.

I could extend this PR to correct this via an additional variable. Let me explain the issue:

  • Issue: Current loader does not work with regression data that have timestamps.
  • Reason: When timestamps are present, the code does an additional check to see if a label is present in the list of labels (they are provided with @classlabel for classification data). There are no such labels for regression data, resulting in a bug.
  • Solution: The easiest way to resolve this is to add another boolean variable to skip this step if the data is regressional. A more complete solution might include reworking the entire control flow.

I could add another variable to resolve it since there doesn't seem to be any concrete need to choose the latter solution. What do you think?

achieveordie avatar Sep 23 '22 08:09 achieveordie

thanks, I think the best way forward is to get this PR in, then raise an issue about loading regression problems with time stamps. You can add the step you mention on this PR as well if you want.

TonyBagnall avatar Sep 23 '22 10:09 TonyBagnall

will just wait until the new release before putting this in, there is a feature freeze in place

TonyBagnall avatar Sep 23 '22 12:09 TonyBagnall

dataset license is fine, is creative commons

fkiraly avatar Oct 01 '22 20:10 fkiraly