zed icon indicating copy to clipboard operation
zed copied to clipboard

Searching multiple pools with "from *" is much slower

Open SoftTools59654 opened this issue 1 year ago • 4 comments

When searching a text in a data about 100 GB and 80 GB When using the from* command to read both files, the speed of reading data from the m2 hard drive is about 40MB. But when the same text is read without the from command, the speed of reading the file reaches about 2 GB. While the CPU is fully engaged, the file loading speed from the hard drive is very low There is a lot of time difference in using the from command and searching one by one in the file. I tested this case several times and this case was there.

SoftTools59654 avatar Mar 11 '24 18:03 SoftTools59654

Repro is with Zed commit 38763f8.

While I don't have the benefit of the user's specific data to test with, I've been able to reproduce a similar effect locally on my Macbook using the data from s3://brim-sampledata/wrccdc/zng/wrccdc.zng.

I first created separate pools to each hold a subset of the wrccdc data.

$ zed -version
Version: v1.14.0-16-g38763f82

$ zed create wrccdc1
pool created: wrccdc1 2det49I4UKZauguZUCoKy3PyKKl

$ zq '_path=="conn"' wrccdc.zng | zed load -use wrccdc1 -
(1/1) 20.82GB 0B/s
2deuG2mkjlka5VL3kOThO7wHfO2 committed

$ zed create wrccdc2
pool created: wrccdc2 2deuL3vhFQikxnty2CdIZsIFUkr

$ zq '_path!="conn"' wrccdc.zng | zed load -use wrccdc2 -
(1/1) 2.98GB 0B/s
2deuWwPS3FaaS5qHhZ5RGjFs6x5 committed

FWIW the wrccdc1 pool is 520132400 records and 20 GB in size and the wrccdc2 pool is 52937660 and 3.0 GB in size.

Doing a "needle in a haystack" search for a value that I know is present in both pools, we get fast search performance on each individually.

$ time zed query 'from wrccdc1 | CuYcyl29a2g8pYV1s' 
{_path:"conn",ts:2017-03-24T19:59:23.231562Z,uid:"CuYcyl29a2g8pYV1s",id:{orig_h:10.10.7.65,orig_p:64135(port=uint16),resp_h:184.72.208.63,resp_p:80(port)},proto:"tcp"(=zenum),service:"http",duration:182.422ms,orig_bytes:438(uint64),resp_bytes:588(uint64),conn_state:"SF",local_orig:null(bool),local_resp:null(bool),missed_bytes:0(uint64),history:"ShADadfF",orig_pkts:5(uint64),orig_ip_bytes:650(uint64),resp_pkts:5(uint64),resp_ip_bytes:800(uint64),tunnel_parents:null(|[string]|)}

real	0m13.831s
user	0m0.037s
sys	0m0.027s

$ time zed query 'from wrccdc2 | CuYcyl29a2g8pYV1s' 
{_path:"files",ts:2017-03-24T19:59:23.399396Z,fuid:"F4NAir1FuTH6Oax2J4",tx_hosts:|[184.72.208.63]|,rx_hosts:|[10.10.7.65]|,conn_uids:|["CuYcyl29a2g8pYV1s"]|,source:"HTTP",depth:0(uint64),analyzers:|["MD5","SHA1","SHA256"]|,mime_type:"text/json",filename:null(string),duration:0s,local_orig:null(bool),is_orig:false,seen_bytes:35(uint64),total_bytes:null(uint64),missing_bytes:0(uint64),overflow_bytes:0(uint64),timedout:false,parent_fuid:null(string),md5:"dd82bbeb20a441bf55e62d42255de7f4",sha1:"90b01d6f0c6891651e8a729892e92592dd581762",sha256:"da1791b1b8361af6383790c6e77ef6eb149b22ed7e2d59b61ac2deb7884edf4d",extracted:null(string),extracted_cutoff:null(bool),extracted_size:null(uint64)}
{_path:"http",ts:2017-03-24T19:59:23.306076Z,uid:"CuYcyl29a2g8pYV1s",id:{orig_h:10.10.7.65,orig_p:64135(port=uint16),resp_h:184.72.208.63,resp_p:80(port)},trans_depth:1(uint64),method:"GET",host:"lc68.dsr.livefyre.com",uri:"/livecountping/98236508/663740012478?__=bye2msvgpn91&routed=1&jid&siteId=366439&networkId=livefyre.com",referrer:"http://www.asigra.com/blog/what-multi-tenancy-how-secure-it",version:"1.1",user_agent:"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0",request_body_len:0(uint64),response_body_len:35(uint64),status_code:200(uint64),status_msg:"OK",info_code:null(uint64),info_msg:null(string),tags:|[]|(|[zenum=string]|),username:null(string),password:null(string),proxied:null(|[string]|),orig_fuids:null([string]),orig_filenames:null([string]),orig_mime_types:null([string]),resp_fuids:["F4NAir1FuTH6Oax2J4"],resp_filenames:null([string]),resp_mime_types:["text/json"]}

real	0m2.133s
user	0m0.033s
sys	0m0.021s

However, doing the search across both pools at once with from * as described by the user does indeed result in a significant slow-down bringing the search time from a combined ~15 seconds to 11+ minutes.

$ time zed query 'from * | CuYcyl29a2g8pYV1s'
{_path:"conn",ts:2017-03-24T19:59:23.231562Z,uid:"CuYcyl29a2g8pYV1s",id:{orig_h:10.10.7.65,orig_p:64135(port=uint16),resp_h:184.72.208.63,resp_p:80(port)},proto:"tcp"(=zenum),service:"http",duration:182.422ms,orig_bytes:438(uint64),resp_bytes:588(uint64),conn_state:"SF",local_orig:null(bool),local_resp:null(bool),missed_bytes:0(uint64),history:"ShADadfF",orig_pkts:5(uint64),orig_ip_bytes:650(uint64),resp_pkts:5(uint64),resp_ip_bytes:800(uint64),tunnel_parents:null(|[string]|)}
{_path:"files",ts:2017-03-24T19:59:23.399396Z,fuid:"F4NAir1FuTH6Oax2J4",tx_hosts:|[184.72.208.63]|,rx_hosts:|[10.10.7.65]|,conn_uids:|["CuYcyl29a2g8pYV1s"]|,source:"HTTP",depth:0(uint64),analyzers:|["MD5","SHA1","SHA256"]|,mime_type:"text/json",filename:null(string),duration:0s,local_orig:null(bool),is_orig:false,seen_bytes:35(uint64),total_bytes:null(uint64),missing_bytes:0(uint64),overflow_bytes:0(uint64),timedout:false,parent_fuid:null(string),md5:"dd82bbeb20a441bf55e62d42255de7f4",sha1:"90b01d6f0c6891651e8a729892e92592dd581762",sha256:"da1791b1b8361af6383790c6e77ef6eb149b22ed7e2d59b61ac2deb7884edf4d",extracted:null(string),extracted_cutoff:null(bool),extracted_size:null(uint64)}
{_path:"http",ts:2017-03-24T19:59:23.306076Z,uid:"CuYcyl29a2g8pYV1s",id:{orig_h:10.10.7.65,orig_p:64135(port=uint16),resp_h:184.72.208.63,resp_p:80(port)},trans_depth:1(uint64),method:"GET",host:"lc68.dsr.livefyre.com",uri:"/livecountping/98236508/663740012478?__=bye2msvgpn91&routed=1&jid&siteId=366439&networkId=livefyre.com",referrer:"http://www.asigra.com/blog/what-multi-tenancy-how-secure-it",version:"1.1",user_agent:"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0",request_body_len:0(uint64),response_body_len:35(uint64),status_code:200(uint64),status_msg:"OK",info_code:null(uint64),info_msg:null(string),tags:|[]|(|[zenum=string]|),username:null(string),password:null(string),proxied:null(|[string]|),orig_fuids:null([string]),orig_filenames:null([string]),orig_mime_types:null([string]),resp_fuids:["F4NAir1FuTH6Oax2J4"],resp_filenames:null([string]),resp_mime_types:["text/json"]}

real	11m10.353s
user	0m0.213s
sys	0m0.082s

And if I drop those pools and load the entire data set into one pool, it runs fast whether I target it directly or via from *.

$ zed create wrccdc
pool created: wrccdc 2dew93Xeb0JfVr9l8cZXtcbtVKE

$ zed load -use wrccdc wrccdc.zng 
(1/1) 23.80GB/23.80GB 0B/s 100.00%
2dexIZCPsegXBLF2yGT2ICu3vq0 committed

$ time zed query 'from wrccdc | CuYcyl29a2g8pYV1s'
{_path:"files",ts:2017-03-24T19:59:23.399396Z,fuid:"F4NAir1FuTH6Oax2J4",tx_hosts:|[184.72.208.63]|,rx_hosts:|[10.10.7.65]|,conn_uids:|["CuYcyl29a2g8pYV1s"]|,source:"HTTP",depth:0(uint64),analyzers:|["MD5","SHA1","SHA256"]|,mime_type:"text/json",filename:null(string),duration:0s,local_orig:null(bool),is_orig:false,seen_bytes:35(uint64),total_bytes:null(uint64),missing_bytes:0(uint64),overflow_bytes:0(uint64),timedout:false,parent_fuid:null(string),md5:"dd82bbeb20a441bf55e62d42255de7f4",sha1:"90b01d6f0c6891651e8a729892e92592dd581762",sha256:"da1791b1b8361af6383790c6e77ef6eb149b22ed7e2d59b61ac2deb7884edf4d",extracted:null(string),extracted_cutoff:null(bool),extracted_size:null(uint64)}
{_path:"http",ts:2017-03-24T19:59:23.306076Z,uid:"CuYcyl29a2g8pYV1s",id:{orig_h:10.10.7.65,orig_p:64135(port=uint16),resp_h:184.72.208.63,resp_p:80(port)},trans_depth:1(uint64),method:"GET",host:"lc68.dsr.livefyre.com",uri:"/livecountping/98236508/663740012478?__=bye2msvgpn91&routed=1&jid&siteId=366439&networkId=livefyre.com",referrer:"http://www.asigra.com/blog/what-multi-tenancy-how-secure-it",version:"1.1",user_agent:"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0",request_body_len:0(uint64),response_body_len:35(uint64),status_code:200(uint64),status_msg:"OK",info_code:null(uint64),info_msg:null(string),tags:|[]|(|[zenum=string]|),username:null(string),password:null(string),proxied:null(|[string]|),orig_fuids:null([string]),orig_filenames:null([string]),orig_mime_types:null([string]),resp_fuids:["F4NAir1FuTH6Oax2J4"],resp_filenames:null([string]),resp_mime_types:["text/json"]}
{_path:"conn",ts:2017-03-24T19:59:23.231562Z,uid:"CuYcyl29a2g8pYV1s",id:{orig_h:10.10.7.65,orig_p:64135(port=uint16),resp_h:184.72.208.63,resp_p:80(port)},proto:"tcp"(=zenum),service:"http",duration:182.422ms,orig_bytes:438(uint64),resp_bytes:588(uint64),conn_state:"SF",local_orig:null(bool),local_resp:null(bool),missed_bytes:0(uint64),history:"ShADadfF",orig_pkts:5(uint64),orig_ip_bytes:650(uint64),resp_pkts:5(uint64),resp_ip_bytes:800(uint64),tunnel_parents:null(|[string]|)}

real	0m17.494s
user	0m0.040s
sys	0m0.078s

$ time zed query 'from * | CuYcyl29a2g8pYV1s'
{_path:"files",ts:2017-03-24T19:59:23.399396Z,fuid:"F4NAir1FuTH6Oax2J4",tx_hosts:|[184.72.208.63]|,rx_hosts:|[10.10.7.65]|,conn_uids:|["CuYcyl29a2g8pYV1s"]|,source:"HTTP",depth:0(uint64),analyzers:|["MD5","SHA1","SHA256"]|,mime_type:"text/json",filename:null(string),duration:0s,local_orig:null(bool),is_orig:false,seen_bytes:35(uint64),total_bytes:null(uint64),missing_bytes:0(uint64),overflow_bytes:0(uint64),timedout:false,parent_fuid:null(string),md5:"dd82bbeb20a441bf55e62d42255de7f4",sha1:"90b01d6f0c6891651e8a729892e92592dd581762",sha256:"da1791b1b8361af6383790c6e77ef6eb149b22ed7e2d59b61ac2deb7884edf4d",extracted:null(string),extracted_cutoff:null(bool),extracted_size:null(uint64)}
{_path:"http",ts:2017-03-24T19:59:23.306076Z,uid:"CuYcyl29a2g8pYV1s",id:{orig_h:10.10.7.65,orig_p:64135(port=uint16),resp_h:184.72.208.63,resp_p:80(port)},trans_depth:1(uint64),method:"GET",host:"lc68.dsr.livefyre.com",uri:"/livecountping/98236508/663740012478?__=bye2msvgpn91&routed=1&jid&siteId=366439&networkId=livefyre.com",referrer:"http://www.asigra.com/blog/what-multi-tenancy-how-secure-it",version:"1.1",user_agent:"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0",request_body_len:0(uint64),response_body_len:35(uint64),status_code:200(uint64),status_msg:"OK",info_code:null(uint64),info_msg:null(string),tags:|[]|(|[zenum=string]|),username:null(string),password:null(string),proxied:null(|[string]|),orig_fuids:null([string]),orig_filenames:null([string]),orig_mime_types:null([string]),resp_fuids:["F4NAir1FuTH6Oax2J4"],resp_filenames:null([string]),resp_mime_types:["text/json"]}
{_path:"conn",ts:2017-03-24T19:59:23.231562Z,uid:"CuYcyl29a2g8pYV1s",id:{orig_h:10.10.7.65,orig_p:64135(port=uint16),resp_h:184.72.208.63,resp_p:80(port)},proto:"tcp"(=zenum),service:"http",duration:182.422ms,orig_bytes:438(uint64),resp_bytes:588(uint64),conn_state:"SF",local_orig:null(bool),local_resp:null(bool),missed_bytes:0(uint64),history:"ShADadfF",orig_pkts:5(uint64),orig_ip_bytes:650(uint64),resp_pkts:5(uint64),resp_ip_bytes:800(uint64),tunnel_parents:null(|[string]|)}

real	0m18.229s
user	0m0.042s
sys	0m0.047s

philrz avatar Mar 14 '24 15:03 philrz

Thank you for the perfect text you wrote

Even this problem exists during import and export

Maximum hard drive reading speed of 120 MB from 2GB In most cases, CPU consumption is less than 40% of the total power The amount of RAM used will gradually increase until it occupies the entire RAM

which increases the time of import and export

SoftTools59654 avatar Mar 16 '24 16:03 SoftTools59654