disco *: framework agnostic loaders

on the road of supporting ONNX, we first need to extract TFJS from the code base, making discojs framework agnostic.

adding a Dataset type, wrapping standard AsyncGenerator with some helpers (map, zip, split, batch, ...)
- for now, disco.fit, validator.{assess,predict}, requires more type informations (via TypedDataset) as theses are supporting any dataset for any task. it should be enforced at some point via Task splitting so that we can drop it.
- now, any users of disco have to use it, keeping the remaining Datasplit, Data, tf.* internal to disco (and removing theses when preprocessing and task typing are reworked)
also exposes trivial types: Image, Tabular & Text
redo loaders to be small functions rather than cumbersome-to-use classes
added some convertors as a potential new way to do preprocessing
- flat functions, that encodes type changes instead of wrapping in tf.TensorContainer (pretty much equivalent to any)
webapp are now typing its dataset, having a clear separation between Image & Tabular, with Data component emitting changes

Jun 05 '24 12:06 tharvik

Clicking next on a task without connecting data doesn't display anthing anymore at the Training step (while it used to display the training board)

as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.

Connecting data, clicking next to the Training board now displays the training things correctly but if I go back I can't clear the connected files anymore.

good catch! I indeed forgot to propagate the "clear file" event

Connecting the LUS COVID dataset and training fails with error Error: Based on the provided shape, [224,224,3], the tensor should have 150528 values but has 200704, maybe linked to the new image loading function? A similar errors occurs on simple face: Error: Based on the provided shape, [200,200,3], the tensor should have 120000 values but has 160000

indeed, I mixed RGBA & RGB images, added a real Image type and related remove_alpha convertor.

Training on titanic works well, however the webapp fails to display the testing page correctly afterward and the Test button doesn't appear.

right, thanks for catching that, I missplaced a toRaw in there. NB: this is needed as VueJS proxies every value, but it doesn't work with JS private fields (# prefixed values).

Training on the Skin condition and CIFAR10 tasks fails with error TypeError: valueScalar is undefined, you can find the sample skin condition dataset here.

ho oupsi, I'm generating an empty dataset when uploading a folder, should have added a throwing TODO, thanks for catching that.

There is no TextDatasetInput.vue so there is no way to connect data to the wikitext task in the webapp

correct, there isn't any. we can add one in a next iteration but that out of scope of the PR. from what I remember, one student should have worked on that but it wasn't clear if it happened in the end.

I've been terribly confused with the naming of Dataset, Data, DataSplit and tf.data.Dataset. A list of specific things I don't like about the current abstractions:

I totally agree with you, it's not understandable. that's also why this PR is introducing a new one that should replace all of theses in its next iterations (when I reworked the convertors).

Dataset is not a data structure composed of Data elements

that would be the central (and only) way to have a serie of data (image/tabular/text/...) with transformations applied on it.

Data is actually a dataset (and it is Data that wraps tf.data.Dataset rather than Dataset)

it will be dropped, replaced by Dataset<Image>, Dataset<Tabular> & Dataset<Text> and some convertors applied on it.

DataSplit represents a train-test split of Data rather than a dataset

it will be replaced by the return type of Dataset<T>.split() ie [Dataset<T>, Dataset<T>].

dataset ends up being an attribute of data like in data.train.preprocess().batch().dataset

it will be real values, not fields, by shaping Dataset<T> via convertors functions. for example, Dataset<Tabular> -> extract_columns -> convert_to_numbers -> Dataset<[features: number[], label: number]>

Can we either add more class documentation to clarify the relationships or rework the oriented-object architecture in a way that makes things self-explanatory? This is a bit beyond the scope of this PR but it feels like the right time to address it? Feel free to ignore otherwise I'm also happy to create an additional PR myself.

as I'm hoping to remove most of theses soon, it seems a bit premature to update the doc now. I'll update the doc accordingly in the next one, but feel free to setup a PR if you feel that's it's taking too long.

Jun 11 '24 08:06 tharvik

Sounds good! Is there a quick fix to support text dataset? Merging this PR means that wikitext will not be available on discolab.ai until a new PR implements text datasets.

Jun 11 '24 08:06 JulienVig

as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.

I quite liked simply displaying an error message when training without data ("input a dataset first"). That allowed new users to walk around and see things when they don't want to train an actual model. What do you think of it?

Jun 11 '24 14:06 JulienVig

Is there a quick fix to support text dataset? Merging this PR means that wikitext will not be available on discolab.ai until a new PR implements text datasets.

hum, I think that's doable, not for the webapp Tester (it's not really supported now anyway) but for the others parts, yes.

as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.

I quite liked simply displaying an error message when training without data ("input a dataset first"). That allowed new users to walk around and see things when they don't want to train an actual model. What do you think of it?

hum, users are weird, but yeah, make sense, it's very demo-y; adding it back.

Jun 12 '24 15:06 tharvik

so, it's alive again! way smaller to review thanks to extracted scraps. description updated.

Aug 12 '24 12:08 tharvik

There seems to be something wrong with the image data loaders: connecting images (by group for lus_covid and by csv for mnist) results in an error (Something went wrong on our side) when clicking train (both alone and collaboratively). Titanic and the llm task work fine.

Aug 13 '24 09:08 JulienVig

There seems to be something wrong with the image data loaders

right, I missplaced a toRaw call. should be fixed. and added a test for lus_covid's training to catch this earlier.

Aug 13 '24 12:08 tharvik

Impressive work! You've virtually rewritten most of the codebase haha

haha, slowly getting there! when changing something as fundamental as tfjs, there is some good rewrite indeed.

The evaluation seems to have an issue with data loading:

that should be fixed now, and some cypress tests to ensure that also. we still have a memory leak of one tensor every round. I didn't wanted to sink too much time into it.

Aug 15 '24 10:08 tharvik

Training on lus_covid and then testing the model displays 0% accuracy and 0 samples visited

Trying to test/predict a model on titanic data by connecting the titanic_test.csv or titanic_train.csv doesn't display enough and there is a toaster error

Idem when trying to test a language model

haa, one of the dark corner of disco. so I finally took hold of it and rework the whole discojs' validation stack

predict yield the prediction for every line of the dataset
test yield a boolean if the prediction is correct
no more duplicated logic between the two, no more materialisation of the dataset's values in the results, no more mutability :sun_with_face:
specialize Tester into PredictSteps & TestSteps
simplify validation's store
- we can even remove it down the road, if we rework the URL to actually encode the model ID in it

The Stop training button doesn't work when the training fails (for example if you train collaboratively with mnist with a single user, the user timeouts waiting for other contributions. After the timeout, hitting stop training doesn't work)

arf, too bad, it comes from the AbortController only reacting when between await points. and as Disco is never returning when an error occurs (that should be the case but I was already kinda yack shaving), it hangs on the for await. throwing allows for propagation of the failure. I'm reintroducing it and we can change it again when Disco behaves a bit more nicely.

(At some point I also got the error "multiple dataset entered info=render function thrown undefined" but I can't reproduce it...)

hum, that shouldn't happen. it comes when entering multiple type of dataset in test/predict, which shouldn't be possible. I tried a bit to reproduce it but wasn't able to.

Aug 20 '24 12:08 tharvik

huge thanks, massive change here!

Aug 21 '24 13:08 martinjaggi

Yay we've done it! 🎉

Aug 22 '24 10:08 JulienVig

congrats!

Aug 22 '24 12:08 martinjaggi