tribuo icon indicating copy to clipboard operation
tribuo copied to clipboard

About csvLoader.loadDataSource

Open pablo3p opened this issue 2 years ago • 4 comments

Hi there,

From this tutorial on regression: https://github.com/oracle/tribuo/blob/main/tutorials/regression-tribuo-v4.ipynb

var wineSource = csvLoader.loadDataSource(Paths.get("winequality-red.csv"),"quality");

This wineSource, is a data structure, but don't see enough documentation. I am assuming that wineSource here, is a tabular data structure, and hoping that it is similar to Python Pandas DataFrame.

If that is the case, is there a Print-Method, so one can print to the terminal to see the data.

There is not much out there on this.

Kind Regards,

Pablo

pablo3p avatar May 16 '23 15:05 pablo3p

CSVLoader returns a CSVDataSource. The DataSource interface doesn't have much in the way of accessor methods, you should construct a MutableDataset from that data source which will populate the feature & output information objects that you can query. If you want to print out the examples you can iterate the data source and print each Example object.

Tribuo has a row-wise view of data, and doesn't provide a data frame style interface. If you want something more like a dataframe in Java then I think JTablesaw is supposed to be good for that, but I've not used it much.

Craigacp avatar May 17 '23 01:05 Craigacp

Hi there, thanks for your quick reply. SO when passing in data, I want to make sure that it is proper, so it looks like there is no way to determine that once it is loaded and creates a CSVDataSource. I would prefer to load then the data from CSV into something like JTablesaw, and from JTablesaw pass that into a Tribuo DataSource. Wondering if this is possible? Hope you can let me know.

P.

pablo3p avatar May 17 '23 01:05 pablo3p

You can inspect the examples after they have been loaded to make sure the pipeline is valid. I recommend looking at CSVDataSource rather than using CSVLoader as it's more flexible. There's a columnar data tutorial which explains the mechanisms - https://tribuo.org/learn/4.3/tutorials/columnar-tribuo-v4.html.

We don't currently support loading from JTablesaw into Tribuo because we can't capture the necessary provenance & reproducibility information out of a tablesaw dataset. It would be pretty useful to have though, but due to the provenance issues we've not got around to it.

Craigacp avatar May 17 '23 02:05 Craigacp

Hi, thanks again. The link you provided seems to have a lot of useful concepts etc.

Yes, to have something like JTablesaw, and have that first load the CSV and then pass it onto like the CSVDataSource, I think would be really good, because you can pass on the responsibility of the "integrity" of the data to the Data Science person, because they are the subject matter experts, and they should be able to look into the DataFrame(in this case JTablesaw) and then decide that the data is in proper shape to pass into the CSVDataSource data structure. Allowing for "Human Intervention" especially at the Data-source part of the Data Pipeline, is very valuable to allow the Data Science person more control in the Data Quality aspect of the Data Pipeline. This type or kind, should be an option and should be available in Tribuo. So just wanted to elaborate on my thinking on this. Thanks again for all your great help, really appreciate it. Best Regards,

P

pablo3p avatar May 17 '23 03:05 pablo3p