Romain Beaumont
Romain Beaumont
basically done now, possible to send arbitrary embedding to back
could be fun to figure out how to do it in the front as well
The error is aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host github.com:443 ssl:default [Connection reset by peer] Is GitHub available where you're running this ? On Fri, May 27, 2022, 10:42 takusaitoh ***@***.***>...
Do you still have that problem? All this function is doing is downloading these small files.
yeah indeed, see https://github.com/rom1504/clip-retrieval/issues/2 the current implementation of clip-retrieval filter does not scale to large datasets there are several ways to implement it with different trade offs that may make...
yeah I definitely intend to get this happening. I won't have time for now, but I'll probably get it done in a week or something
Interesting! Filtering the tars is going to be pretty slow but that's indeed one possibility I've been working on #31 recently, I will probably include filtering options into that too
For datasets of the kind of laion (very large) it doesn't matter a lot of we lose some % of the data, that's why redownloading subsets seem to make sense...
https://github.com/rom1504/clip-retrieval/blob/main/notebook/simple_filter.ipynb added a v0 going towards this this is just a POC, I will release a better version later on, but this works already
still intend to make batch metadata retrieval much faster so this is more viable