Sylvain Lesage

Results 175 issues of Sylvain Lesage

See https://huggingface.slack.com/archives/C039P47V1L5/p1713172703779839 > Am I correct in assuming that if you specify a "config" in a dataset, only the given config is downloaded, but if you specify a split, all...

question
P2

It was useful to do management of the NFS, then of the EFS. But now, it only accesses the parquet metadata EFS. Not sure we need it anymore, we can...

good first issue
blocked-by-upstream
infra
P2

Reported here: https://huggingface.co/datasets/Asap7772/persona_gpt4_paired_margin1_allsplit/discussions/1#66187c367d8c2cd6fbce4a19 Error is: ``` {"error": "Could not read the parquet files: Some index in row_group_indices is 0, which is either < 0 or >= num_row_groups(0)"} ``` I think...

bug
P1

It would remove a single point of failure. Also: we could have one healthcheck per service (API, admin). Currently the `ELB-HealthChecker/2.0` healthcheck is responded by the API (/healthcheck), and it...

infra
P2

We need to change: - here - in moon-landing - in datasets? - in blog posts, observable, notion, google colabs?...

documentation
P1

Now that we call the project "Dataset viewer", we should make it clear in the docs and READMEs that: - that the project powers the Hub's dataset viewer - people...

documentation
P1

Today, it's 8. Let's try increasing it and see if it speeds up the backfill job. The current throughput is 577 datasets/minute.

question
P2
prod

From https://huggingface.slack.com/archives/C04HZ32QV17/p1711698013265029 > Is there a time-out for webhooks? I take about 1 min to respond and I'm getting 500. > yes 30 seconds > you should respond immediately and...

improvement / optimization
P2

See the `schema` column on https://huggingface.co/datasets/motherduckdb/duckdb-text2sql-25k. Clicking on any of the 'classes' leads to an error The erroneous URL is: https://datasets-server.huggingface.co/filter?dataset=motherduckdb%2Fduckdb-text2sql-25k&config=default&split=train&offset=0&length=100&where=schema%3D%27CREATE+TABLE+%22venue%22+%28%0A++%22venueId%22+INTEGER+NOT+NULL%2C%0A++%22venueName%22+VARCHAR%28100%29%2C%0A++%22venueInfo%22+JSON%2C%0A++PRIMARY+KEY+%28%22venueId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22author%22+%28%0A++%22authorId%22+INTEGER+NOT+NULL%2C%0A++%22authorName%22+VARCHAR%2850%29%2C%0A++%22authorPublications%22+INT%5B%5D%2C%0A++PRIMARY+KEY+%28%22authorId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22dataset%22+%28%0A++%22datasetId%22+INTEGER+NOT+NULL%2C%0A++%22datasetName%22+VARCHAR%2850%29%2C%0A++%22datasetInfo%22+STRUCT%28v+VARCHAR%2C+i+INTEGER%29%2C%0A++PRIMARY+KEY+%28%22datasetId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22journal%22+%28%0A++%22journalId%22+INTEGER+NOT+NULL%2C%0A++%22journalName%22+VARCHAR%28100%29%2C%0A++%22journalInfo%22+MAP%28INT%2C+DOUBLE%29%2C%0A++PRIMARY+KEY+%28%22journalId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22keyphrase%22+%28%0A++%22keyphraseId%22+INTEGER+NOT+NULL%2C%0A++%22keyphraseName%22+VARCHAR%2850%29%2C%0A++%22keyphraseInfo%22+VARCHAR%2850%29%5B%5D%2C%0A++PRIMARY+KEY+%28%22keyphraseId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22paper%22+%28%0A++%22paperId%22+INTEGER+NOT+NULL%2C%0A++%22title%22+VARCHAR%28300%29%2C%0A++%22venueId%22+INTEGER%2C%0A++%22year%22+INTEGER%2C%0A++%22numCiting%22+INTEGER%2C%0A++%22numCitedBy%22+INTEGER%2C%0A++%22journalId%22+INTEGER%2C%0A++%22paperInfo%22+UNION%28num+INT%2C+str+VARCHAR%29%2C%0A++PRIMARY+KEY+%28%22paperId%22%29%2C%0A++FOREIGN+KEY%28%22journalId%22%29+REFERENCES+%22journal%22%28%22journalId%22%29%2C%0A++FOREIGN+KEY%28%22venueId%22%29+REFERENCES+%22venue%22%28%22venueId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22cite%22+%28%0A++%22citingPaperId%22+INTEGER+NOT+NULL%2C%0A++%22citedPaperId%22+INTEGER+NOT+NULL%2C%0A++%22citeInfo%22+INT%5B%5D%2C%0A++PRIMARY+KEY+%28%22citingPaperId%22%2C%22citedPaperId%22%29%2C%0A++FOREIGN+KEY%28%22citedpaperId%22%29+REFERENCES+%22paper%22%28%22paperId%22%29%2C%0A++FOREIGN+KEY%28%22citingpaperId%22%29+REFERENCES+%22paper%22%28%22paperId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22paperDataset%22+%28%0A++%22paperId%22+INTEGER%2C%0A++%22datasetId%22+INTEGER%2C%0A++%22paperDatasetInfo%22+JSON%2C%0A++PRIMARY+KEY+%28%22datasetId%22%2C+%22paperId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22paperKeyphrase%22+%28%0A++%22paperId%22+INTEGER%2C%0A++%22keyphraseId%22+INTEGER%2C%0A++%22paperKeyphraseInfo%22+JSON%2C%0A++PRIMARY+KEY+%28%22keyphraseId%22%2C%22paperId%22%29%2C%0A++FOREIGN+KEY%28%22paperId%22%29+REFERENCES+%22paper%22%28%22paperId%22%29%2C%0A++FOREIGN+KEY%28%22keyphraseId%22%29+REFERENCES+%22keyphrase%22%28%22keyphraseId%22%29%0A%29%3B%0A%0ACREATE+TABLE+%22writes%22+%28%0A++%22paperId%22+INTEGER%2C%0A++%22authorId%22+INTEGER%2C%0A++%22writesInfo%22+JSON%2C%0A++PRIMARY+KEY+%28%22paperId%22%2C%22authorId%22%29%2C%0A++FOREIGN+KEY%28%22paperId%22%29+REFERENCES+%22paper%22%28%22paperId%22%29%2C%0A++FOREIGN+KEY%28%22authorId%22%29+REFERENCES+%22author%22%28%22authorId%22%29%0A%29%3B%27 ```json {"error":"Parameter 'where' contains invalid symbols"} ``` It's because...

question
api
P2

Requires https://github.com/huggingface/datasets/issues/6438, to support GeoParquet. We could support more formats. Possibly requires geopandas as a dependency.

blocked-by-upstream
P2
modality