dataset-viewer Improve search functionality

Given that we have already implemented the search feature on the API and hub, let's discuss here what improvements can be made. Some suggestions:

[ ] Investigate if it is really used (Maybe it is needed a re design of the idea)
[ ] Investigate execution time

Mar 14 '24 14:03 AndreaFrancis

Maybe we can improve execution time once more (hf_transfer + more CPU), do some comms et check the usage evolution ? I'll also ask about it to viewer users at HF

Mar 14 '24 14:03 lhoestq

i also think it will be beneficial to improve execution time but i don't have any ideas on how to do this... but i think it's worth working on this before adding any new improvements like other languages

Mar 14 '24 14:03 polinaeterna

I did a high-level but detailed investigation about our FTS feature internal document Summary: Is the query not performant enough, and can it be improved? Using a new approach with a stage table reduced Memory usage by 2.44% and time execution by 7.9%. It might not be too impactful for a 9.1 GB dataset index file but could help for big datasets, so yes, we can use this approach. I also tried adding a primary key on the __hf_index_id field to speed up the join with the stage table, but looking at the query plan, it always gets a full table scan. Though I found the reason, using a primary key won't help, as mentioned on the official page https://duckdb.org/docs/guides/performance/schema#microbenchmark-the-effect-of-primary-keys:

“In this case, primary keys will only have a (small) positive effect on highly selective queries such as when filtering on a single identifier. They do not have an effect on join and aggregation operators.”

“For best bulk load performance, avoid defining primary key constraints if possible.”

Are the resources enough, and can they be improved? Does the current machine need more Memory? I don't think so. Running on an isolated m6a.4xlarge and c4.8xlarge with 64 GiB of RAM wasn't too different from the limit we have in the search pod (30 GiB). Does the current machine need more CPU? Running the experiments in a machine with a higher CPU does not help so, I would say no. Does the current machine Storage impact? Yes, the current machine with shared storage took too much time (3 to 4 times more) than when running the same queries on a machine from the same AWS instance but with local storage. Also see from: https://duckdb.org/docs/guides/performance/environment#disk

“In general, network-based storage will result in slower DuckDB workloads than using local disks. This includes network disks such as NFS, network drives such as SMB and Samba, and network-backed cloud disks such as AWS EBS. However, different network disks can have vastly varying IO performance, ranging from very slow to almost as fast as local. Therefore, for optimal performance, only use network disks that can provide high IO performance.”

I consider at this point, we could ask the infra team if it is possible to have a different storage type with better I/O capacity or to try another architecture so that we can avoid shared storage. Another suggestion is to keep the connection open. After running each query, the query plan is cached in memory, helping to speed up the result (It is just an idea). From the official duckdb page:

“DuckDB will perform best when reusing the same database connection many times. Disconnecting and reconnecting on every query will incur some overhead, which can reduce performance when running many small queries. DuckDB also caches some data and metadata in memory, and that cache is lost when the last open connection is closed. Frequently, a single connection will work best, but a connection pool may also be used.”

Finally, the most important piece to speed up the FTS feature is to improve the way duckdb connects to the shared EFS. Not sure how to achieve this in terms of infra, if not possible, I would say it is time to start thinking in another architecture.

WDYT @huggingface/datasets-server ?

Apr 04 '24 23:04 AndreaFrancis

Good analysis, it gives ideas to try to improve. If I understood correctly:

don't close the connection after every request
change the type of network storage (or use local storage? we would lose the ability to share the "downloaded" index files though)

Apr 05 '24 10:04 severo

don't close the connection after every request

Maybe we should have a pool of active connections. This will need more memory in all pods, and we could close those connections without activity from time to time. Maybe it is possible, but as I said, we will need more Memory. How many? I don't know. Let's consider having 10 datasets at the same time, each one 10 GB. We should provision like 100 GB each pod and apply mechanisms to avoid OOM.

change the type of network storage

I'm not sure how difficult it is to address it by the infra side. Maybe increase network bandwidth? Change EFS from General Purpose to Max I/O mode? I would really go for this approach since it looks to be the least invasive for our current architecture but not sure how much it could affect in terms of costs. Can you advise @rtrompier @XciD ?

(or use local storage? we would lose the ability to share the "downloaded" index files though)

Maybe copy the index file from shared to local storage and execute duckdb on it (it might be faster than downloading it each time, but it will still take time to copy). This would also imply doing some cleaning in the local storage. By the way, what is the size of the pod storage?

Apr 05 '24 14:04 AndreaFrancis

By the way, what is the size of the pod storage?

Nearly nothing, but we can do the same as for the workers (NVMe)

Apr 05 '24 15:04 severo

Just did a quick test from one of the current prod machines on this C4 index.duckdb file (8GB)

Download from HF using hf_transfer to local disk: ~40sec

Copy from shared disk to local disk: ~1min

So maybe we should use fast local disks only, and cache the files locally to not have to redownload all the time ?

Apr 05 '24 15:04 lhoestq

I am not an infra expert, but I was thinking of changing our current instance to something like this:

Our current capacity

Instance type m6a.4xlarge: 16 vCPU - 64 GiB
7 replicas with shared storage
2 Uvicorn workers per pod
Capacity per process: 8 CPU (16), 32 GiB

Proposed

Instance type m6a.16xlarge: 64 vCPU - 256 GiB
Only 2 replicas with NVME 1 TB each one (Considering reducing cache files from 3 days of TTL to 6 hours)
7 Uvicorn workers per pod
Capacity per process: 9 CPU (64), 36 GiB

Tasks for migration - [infra] Create NVME and attach it to the search pods - [infra] Change instance type from m6a.4xlarge to m6a.16xlarge - Change values.duckDBIndex.cacheDirectory variable to point to /tmp instead of shared instance - Change chart to have 7 uvicorn workers, 2 replicas and allocate up to 36 GiB Memory? (I don't understand if this is correct. Why can requests/limits be different than what is available and not the same? - Delete all references to "/storage/duckdb-index" shared instance - Set duckdb to do not use all the threads / evaluate if it is optimum, maybe set to only 16 threads max to avoid using all the 64 in each uvicorn worker

Apr 08 '24 12:04 AndreaFrancis

Sounds good, this would provide a local fast disk and fix the /search speed issues. Let's check with someone from infra to validate and check the prices

Apr 08 '24 13:04 lhoestq

For some reason, running index files bigger than 10 GB in the current search pod with the EBS local storage is too slow (even slower than shared disk). I tried running the the same queries on the heavy-worker(r6id.4xlarge) machine, and it is also faster. Maybe a r6id.16xlarge instance would be better?

Apr 10 '24 17:04 AndreaFrancis

Benchmark results from running in different machines/storage:

machine	AWS instance type	storage	time (Seconds)	dataset	index file
search pod	m6a.4xlarge	shared	17	wikimedia/wikipedia-20231101.en-train	9.1 GB
search pod	m6a.4xlarge	local	3	m6a.4xlarge	9.1 GB
search pod	m6a.4xlarge	shared	4	MohamedRashad/Arabic-CivitAi-Images - default - train	580 MB
search pod	m6a.4xlarge	local	1.5	MohamedRashad/Arabic-CivitAi-Images - default - train	580 MB
search pod	m6a.4xlarge	shared	42	jp1924/VIsualQuestionAnswering - default - train	23 GB
worker pod	r6id.4xlarge	local	11.38	jp1924/VIsualQuestionAnswering - default - train	23 GB
search pod	m6a.4xlarge	local	108	jp1924/VIsualQuestionAnswering - default - train	23 GB
search pod	m6a.4xlarge	local	2	asoria/search_test - default - train	27 MB
search pod	m6a.4xlarge	shared	0.58	asoria/search_test - default - train	27 MB
search pod	m6a.4xlarge	shared	44	ad6398/Deepmind-CodeContest-Unrolled-default-train	20 GB
worker pod	r6id.4xlarge	local	9.3	ad6398/Deepmind-CodeContest-Unrolled-default-train	20 GB
search pod	m6a.4xlarge	local	80	ad6398/Deepmind-CodeContest-Unrolled-default-train	20 GB

Apr 10 '24 18:04 AndreaFrancis

Maybe search pods having EBS while worker pods having NVMe explains the difference (I don't understand much, but I believe it's not the same, right @rtrompier?)

confirmed today: EBS uses the network, NVMe is local to the machine

Apr 10 '24 19:04 severo

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 05 '24 15:05 github-actions[bot]

fixed, right @AndreaFrancis ? Feel free to close

Jul 30 '24 16:07 severo

dataset-viewer dataset-viewer copied to clipboard

Improve search functionality

dataset-viewer
dataset-viewer copied to clipboard