*: framework agnostic loaders
on the road of supporting ONNX, we first need to extract TFJS from the code base, making discojs framework agnostic.
- adding a Dataset type, wrapping standard AsyncGenerator with some helpers (map, zip, split, batch, ...)
- for now,
disco.fit,validator.{assess,predict}, requires more type informations (viaTypedDataset) as theses are supporting any dataset for any task. it should be enforced at some point viaTasksplitting so that we can drop it. - now, any users of disco have to use it, keeping the remaining
Datasplit,Data,tf.*internal to disco (and removing theses when preprocessing and task typing are reworked)
- for now,
- also exposes trivial types:
Image,Tabular&Text - redo loaders to be small functions rather than cumbersome-to-use classes
- added some
convertorsas a potential new way to do preprocessing- flat functions, that encodes type changes instead of wrapping in
tf.TensorContainer(pretty much equivalent toany)
- flat functions, that encodes type changes instead of wrapping in
- webapp are now typing its dataset, having a clear separation between Image & Tabular, with
Datacomponent emitting changes
- Clicking next on a task without connecting data doesn't display anthing anymore at the Training step (while it used to display the training board)
as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.
- Connecting data, clicking next to the Training board now displays the training things correctly but if I go back I can't clear the connected files anymore.
good catch! I indeed forgot to propagate the "clear file" event
- Connecting the LUS COVID dataset and training fails with error
Error: Based on the provided shape, [224,224,3], the tensor should have 150528 values but has 200704, maybe linked to the new image loading function? A similar errors occurs on simple face:Error: Based on the provided shape, [200,200,3], the tensor should have 120000 values but has 160000
indeed, I mixed RGBA & RGB images, added a real Image type and related remove_alpha convertor.
- Training on titanic works well, however the webapp fails to display the testing page correctly afterward and the Test button doesn't appear.
right, thanks for catching that, I missplaced a toRaw in there. NB: this is needed as VueJS proxies every value, but it doesn't work with JS private fields (# prefixed values).
- Training on the Skin condition and CIFAR10 tasks fails with error
TypeError: valueScalar is undefined, you can find the sample skin condition dataset here.
ho oupsi, I'm generating an empty dataset when uploading a folder, should have added a throwing TODO, thanks for catching that.
- There is no TextDatasetInput.vue so there is no way to connect data to the wikitext task in the webapp
correct, there isn't any. we can add one in a next iteration but that out of scope of the PR. from what I remember, one student should have worked on that but it wasn't clear if it happened in the end.
I've been terribly confused with the naming of
Dataset,Data,DataSplitandtf.data.Dataset. A list of specific things I don't like about the current abstractions:
I totally agree with you, it's not understandable. that's also why this PR is introducing a new one that should replace all of theses in its next iterations (when I reworked the convertors).
- Dataset is not a data structure composed of Data elements
that would be the central (and only) way to have a serie of data (image/tabular/text/...) with transformations applied on it.
- Data is actually a dataset (and it is
Datathat wrapstf.data.Datasetrather thanDataset)
it will be dropped, replaced by Dataset<Image>, Dataset<Tabular> & Dataset<Text> and some convertors applied on it.
DataSplitrepresents a train-test split ofDatarather than a dataset
it will be replaced by the return type of Dataset<T>.split() ie [Dataset<T>, Dataset<T>].
- dataset ends up being an attribute of data like in
data.train.preprocess().batch().dataset
it will be real values, not fields, by shaping Dataset<T> via convertors functions. for example, Dataset<Tabular> -> extract_columns -> convert_to_numbers -> Dataset<[features: number[], label: number]>
Can we either add more class documentation to clarify the relationships or rework the oriented-object architecture in a way that makes things self-explanatory? This is a bit beyond the scope of this PR but it feels like the right time to address it? Feel free to ignore otherwise I'm also happy to create an additional PR myself.
as I'm hoping to remove most of theses soon, it seems a bit premature to update the doc now. I'll update the doc accordingly in the next one, but feel free to setup a PR if you feel that's it's taking too long.
Sounds good! Is there a quick fix to support text dataset? Merging this PR means that wikitext will not be available on discolab.ai until a new PR implements text datasets.
as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.
I quite liked simply displaying an error message when training without data ("input a dataset first"). That allowed new users to walk around and see things when they don't want to train an actual model. What do you think of it?
Is there a quick fix to support text dataset? Merging this PR means that wikitext will not be available on discolab.ai until a new PR implements text datasets.
hum, I think that's doable, not for the webapp Tester (it's not really supported now anyway) but for the others parts, yes.
as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.
I quite liked simply displaying an error message when training without data ("input a dataset first"). That allowed new users to walk around and see things when they don't want to train an actual model. What do you think of it?
hum, users are weird, but yeah, make sense, it's very demo-y; adding it back.
so, it's alive again! way smaller to review thanks to extracted scraps. description updated.
There seems to be something wrong with the image data loaders: connecting images (by group for lus_covid and by csv for mnist) results in an error (Something went wrong on our side) when clicking train (both alone and collaboratively). Titanic and the llm task work fine.
There seems to be something wrong with the image data loaders
right, I missplaced a toRaw call. should be fixed. and added a test for lus_covid's training to catch this earlier.
Impressive work! You've virtually rewritten most of the codebase haha
haha, slowly getting there! when changing something as fundamental as tfjs, there is some good rewrite indeed.
The evaluation seems to have an issue with data loading:
that should be fixed now, and some cypress tests to ensure that also. we still have a memory leak of one tensor every round. I didn't wanted to sink too much time into it.
- Training on lus_covid and then testing the model displays 0% accuracy and 0 samples visited
- Trying to test/predict a model on titanic data by connecting the titanic_test.csv or titanic_train.csv doesn't display enough and there is a toaster error
- Idem when trying to test a language model
haa, one of the dark corner of disco. so I finally took hold of it and rework the whole discojs' validation stack
- predict yield the prediction for every line of the dataset
- test yield a boolean if the prediction is correct
- no more duplicated logic between the two, no more materialisation of the dataset's values in the results, no more mutability :sun_with_face:
- specialize Tester into PredictSteps & TestSteps
- simplify validation's store
- we can even remove it down the road, if we rework the URL to actually encode the model ID in it
- The Stop training button doesn't work when the training fails (for example if you train collaboratively with mnist with a single user, the user timeouts waiting for other contributions. After the timeout, hitting stop training doesn't work)
arf, too bad, it comes from the AbortController only reacting when between await points. and as Disco is never returning when an error occurs (that should be the case but I was already kinda yack shaving), it hangs on the for await. throwing allows for propagation of the failure. I'm reintroducing it and we can change it again when Disco behaves a bit more nicely.
- (At some point I also got the error "multiple dataset entered info=render function thrown undefined" but I can't reproduce it...)
hum, that shouldn't happen. it comes when entering multiple type of dataset in test/predict, which shouldn't be possible. I tried a bit to reproduce it but wasn't able to.
huge thanks, massive change here!
Yay we've done it! 🎉
congrats!