Andrea Francis Soria Jimenez

Results 30 comments of Andrea Francis Soria Jimenez

Looks like it is an issue with duckdb con itself; it also happened for filter: ![image](https://github.com/huggingface/dataset-viewer/assets/5564745/f2509160-43d9-40de-bd0d-b9ae1ecdd221)

> storing the indexes and additional columns into the .duckdb file, I think that could be possible if we create the data table and then the index and finally remove...

I tried with Arabic and Russian stemmer as in the Duckdb [doc](https://duckdb.org/docs/extensions/full_text_search.html) but I wasn't able to perform a simple query using FTS. I posted an issue here https://github.com/duckdb/duckdb/issues/10254

https://github.com/duckdb/duckdb/issues/10254 has been fixed, but I think we will need to solve https://github.com/huggingface/datasets-server/issues/1914 and find of a way to not break search when updating duckdb version.

https://pypi.org/project/duckdb/0.9.3.dev2934/ pre-release looks to have fixed FTS for non ascii characters, is this a version we can currently use? or should we wait for an official release?

Do we still need to work on this? I have seen that using porter stemmer works in other languages like [Arabic](https://huggingface.co/datasets/s3h/gec-arabic/viewer/default/train?q=+%D8%A8%D9%85%D9%86%D8%B7%D9%82%D8%A9+%D8%A7%D9%84%D8%B3%D9%88%D8%A7%D8%AF%D9%8A+%D9%88%D8%A8%D8%AF%D8%B9%D9%85+.&row=5718) and [russian](https://huggingface.co/datasets/sberquad/viewer/sberquad/train?q=%D0%BF%D1%80%D0%BE%D1%82%D0%B5%D1%80%D0%BE%D0%B7%D0%BE%D0%B9%D1%81%D0%BA%D0%B8%D1%85)

> expose assets and cached-assets ? Should not be served by API anymore right ? Yes, it should be served by S3+CloudFront if I am not wrong

For image/audio maybe compute statistics of its metadata? Like size, dimensions (image) -> numerical statistics

Hi @samansmink, I loved the native HF implementation and am preparing some documents to share on the Dataset Viewer page: https://huggingface.co/docs/datasets-server/duckdb (still in progress). I was trying the credential_chain provider,...

I did a high-level but detailed investigation about our FTS feature [internal document](https://docs.google.com/document/d/1-6sbntvpoitg2Cn_aXUbzFun4pfvZSBUeR4pACB33hw/edit?usp=sharing) Summary: **Is the query not performant enough, and can it be improved?** Using a new approach with...