Andrei Zhabinski comments

Results 180 comments of


Andrei Zhabinski

Large datasets

Basically, code for reading a particular datasets is unlikely to change often, so we can write testing code, run it locally and comment out before merging. This way it will...

Large datasets

Using environment variables sounds like a perfect solution! Regarding subdatasets and custom CI, I guess it's much more involved and may not work for all the users. Or maybe I...

Large datasets

Sorry for infrequent replies - I've got overwhelmed with other projects I'm inn charge of. I agree that `UInt8` is rarely useful in practice, but it's unclear what it use...

Large datasets

Does it mean a new dataset will need to implement `traintensor` in addition to `traindata`? > For Food-101 i don't think it matters because the data doesn't seem to be...

Large datasets

> What do you think about the idea that after download we repack the data into a HDF5 file? HDF5 supports compression as well as reading individual "datasets" (in our...

DataFrames.jl to Spark dataframe

DataFrmes.jl is definitely the way to go, but the integration isn't done yet. In the simplest case, you can convert rows of `DataFrames.DataFrame` to `Spark.Row`s and use `Spark.createDataFrame(...)` to convert...

No SparkContext defined

Can you point to the page with this example? `SparkContext` has been removed, and I can't find any mentions of it in the docs.

No SparkContext defined

For whatever reason JuliaHub doesn't want to update the README of the project and still points to the old documentation. I tried to fix it by re-triggering the TagBot, but...

No SparkContext defined

> Given the changes that I can see now in the docs, it looks like the SparkContext was taken out of the project. Is that correct? Yes.

Use Apache Arrow for interprocess communication

Yes! From my observations interprocess communication is the main performance killer for RDD API, so switching to Arrow should be the most important improvement in a while. Although, I did't...