Romain Beaumont

Results 2294 comments of Romain Beaumont

tried this: ``` spark = ( SparkSession.builder.config("spark.driver.memory", "16G") .config("spark.task.resource.gpu.amount", "1") .config("spark.executor.resource.gpu.amount", "2") .config("spark.worker.resource.gpu.amount", "2") .config("spark.driver.resource.gpu.amount", "2") .config("spark.driver.resourcesFile", "/home/ubuntu/gpufile") .config("spark.executor.resourcesFile", "/home/ubuntu/gpufile") .config("spark.worker.resourcesFile", "/home/ubuntu/gpufile") .config("spark.executor.resource.gpu.discoveryScript", "/home/ubuntu/clip-retrieval/getGpusResources.sh") .config("spark.worker.resource.gpu.discoveryScript", "/home/ubuntu/clip-retrieval/getGpusResources.sh") .config("spark.driver.resource.gpu.discoveryScript", "/home/ubuntu/clip-retrieval/getGpusResources.sh") .master("local[" +...

related https://github.com/rom1504/img2dataset/issues/56 I'm thinking of implementing the download+resize inside img2dataset since these features are already there. I think to pass it to pytorch a good way would be to add...

the filtering / retrieving from an index part would however make more sense to live here, so clip-retrieval could depend on img2dataset and use its UrlStreamingDataset to provide a FilteredUrlStreamingDataset...

https://github.com/rom1504/img2dataset/issues/82 Could be interesting to investigate this path 1. img2dataset is a (multi instance per machine) rest service that takes as input a path towards an url shard, and return...

new idea: rethink all these tools as dataflow/stream/transformers taking an input a collection and producing an output collection with optional caching and pressure push back reader: * url/meta in parquet,...

related https://github.com/webdataset/webdataset/blob/main/notebooks/openimages.ipynb

let's first try and check how to read in parallel a large file with fsspec

reading a large file with fsspec works by seeking and reading up to a length, it's much faster

next step will be implementing a clean embedding-reader package

independently I think that https://towardsdatascience.com/data-pipelines-with-apache-beam-86cd8eb55fd8 looks good