tribuo icon indicating copy to clipboard operation
tribuo copied to clipboard

Whats the best route to save the predictions into a csv file (using Tribuo classes)

Open neomatrix369 opened this issue 4 years ago • 9 comments

Ask the question What's the best route to save the predictions into a csv file (using Tribuo classes). Say I have a List<Prediction<Regressor>>

One way could be to iterate thru the list of items and write it to the disk via some FileXxxx() class.

Is your question about a specific Tribuo class? List<Prediction<Regressor>> and Dataset (one of it's concrete subclasses)

neomatrix369 avatar Jan 28 '21 18:01 neomatrix369

There isn't a helper to write out a csv file of predictions. You can save the dataset back out using CSVSaver, but that won't have the predicted values in it.

It's roughly two lines by converting the list into a stream.

First write out the dimension headers from the output info inside the model, and then predictions.stream().map(Prediction::getOutput).map(Regressor::getValues).map(Arrays::toString).map(s -> s.substring(1,s.length()-1)).forEach(writer::println). Admittedly that's a little ugly as it has to strip off the [ and ] that Arrays.toString() puts on, so there is a cleaner way with a slightly more complex lambda that combines those two operations.

Craigacp avatar Jan 28 '21 19:01 Craigacp

Alternatively there is Regressor.getSerializableForm() which produces an output string DIM-0=<value>,...,DIM-N=<value> depending on how exactly you want the output to look. This format is the one that's easily consumed by RegressionFactory.generateOutput.

Craigacp avatar Jan 28 '21 19:01 Craigacp

There isn't a helper to write out a csv file of predictions. You can save the dataset back out using CSVSaver, but that won't have the predicted values in it.

It's roughly two lines by converting the list into a stream.

First write out the dimension headers from the output info inside the model, and then predictions.stream().map(Prediction::getOutput).map(Regressor::getValues).map(Arrays::toString).map(s -> s.substring(1,s.length()-1)).forEach(writer::println). Admittedly that's a little ugly as it has to strip off the [ and ] that Arrays.toString() puts on, so there is a cleaner way with a slightly more complex lambda that combines those two operations.

It would be nice to have a method that allows this, cause it's something we all probably want to do as part of a pipeline. I can think of many usecases, I;m already in the middle of one such use case.

neomatrix369 avatar Jan 28 '21 19:01 neomatrix369

Ok. I'm not sure where such a method should live. We have done this in the past when writing out classification outputs for comparison against other systems, but it lives in the main method - https://github.com/oracle/tribuo/blob/main/Classification/Experiments/src/main/java/org/tribuo/classification/experiments/ConfigurableTrainTest.java#L169.

Any suggestions on where it should go? It needs to be specialised to each Output type, so I guess it could be a method on the OutputFactory?

Craigacp avatar Jan 28 '21 19:01 Craigacp

Ok. I'm not sure where such a method should live. We have done this in the past when writing out classification outputs for comparison against other systems, but it lives in the main method - https://github.com/oracle/tribuo/blob/main/Classification/Experiments/src/main/java/org/tribuo/classification/experiments/ConfigurableTrainTest.java#L169.

Any suggestions on where it should go? It needs to be specialised to each Output type, so I guess it could be a method on the OutputFactory?

Let me try to work a workflow from a user perspective, I think some of the low-level (granular) calls could be brought to a higher-level (wrapped with higher-level functions) so we don't have to do a lot of x.y.z() to get to the results - there is a bit of a cognitive overload as well when it comes to getting from one part of the flow to the other.

neomatrix369 avatar Jan 28 '21 19:01 neomatrix369

Also, another question sort of related to this one, say I have this block of code:

var mutableValidationDataset =  new MutableDataset(wineSource);
for (var i: mutableValidationDataset.getData()) {
     System.out.println(i); 
}

I'm not able to get hold of each of the example in the mutableValidationDataset. I tried mutableValidationDataset.getData().get(0) but this does not give me any method I can make use of, I'm referring https://tribuo.org/learn/4.0/javadoc/org/tribuo/impl/ArrayExample.html. It would nice to be able to iterate through the features and target fields.

neomatrix369 avatar Jan 28 '21 19:01 neomatrix369

Also, another question sort of related to this one, say I have this block of code:

var mutableValidationDataset =  new MutableDataset(wineSource);
for (var i: mutableValidationDataset.getData()) {
     System.out.println(i); 
}

I'm not able to get hold of each of the example in the mutableValidationDataset. I tried mutableValidationDataset.getData().get(0) but this does not give me any method I can make use of, I'm referring https://tribuo.org/learn/4.0/javadoc/org/tribuo/impl/ArrayExample.html. It would nice to be able to iterate through the features and target fields.

Assuming that's the complete snippet then it's because you forgot the type parameter on MutableDataset (probably should be MutableDataset<Regressor> but it might also infer it properly from the source so MutableDataset<> could work). Then because you forgot the type the JVM washed off all the generics so the Dataset implements Iterable not Iterable<Example<T>> and the type inference inferred Object as the type for i.

You won't get ArrayExample back, the contract is for Example but there aren't many methods just on ArrayExample.

Craigacp avatar Jan 28 '21 20:01 Craigacp

I used your tips and some workarounds to get my solutions but ideally, it would be good to have them via cleaner methods (flows) i.e. class/instance level methods to get to the stuff we need from the input data as well as the prediction classes.

neomatrix369 avatar Jan 28 '21 22:01 neomatrix369

What else did you need apart from the regression outputs? The features and ground truth outputs should be simple to access.

Craigacp avatar Jan 28 '21 22:01 Craigacp