devextreme-query-mongodb icon indicating copy to clipboard operation
devextreme-query-mongodb copied to clipboard

Research perf improvements for "contains" searches

Open oliversturm opened this issue 5 years ago • 11 comments

contains currently uses a regex. I don't remember precisely why I did it that way in the past, but I believe I researched other options back then and found that they lacked some functionality. However, regex searches are obviously not the most performant ones with large data volumes.

This StackOverflow discussion mentions the option of using $indexOfCP:

db.test.aggregate([{$match: 
    {$expr: { $gt: [{ $indexOfCP: [ "$A", "Star Wars" ] }, -1]}}
    }, {$project: {A:1}}])

There is also a suggestion of adding $toUpper as needed for case insensitive searches. Current MongoDB also supports $substrBytes and $substrCP. (Edit: these two have completely different purposes.)

oliversturm avatar Nov 03 '20 15:11 oliversturm

Strangely, the docs do not mention whether any of these functions take advantage of existing indexes. The regex based query does not, understandably, but do these other functions? Is there a MongoDB docs page that describes index use per operator?

oliversturm avatar Nov 03 '20 15:11 oliversturm

A simple $regex query with a basic collection, unchanged after a data import, looks like this:

> db.data.explain("executionStats").aggregate([{"$match": { "ProductName": { "$regex": "^contoso battery charger", "$options": "i" } } }])
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "demo.data",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"ProductName" : {
				"$regex" : "^contoso battery charger",
				"$options" : "i"
			}
		},
		"queryHash" : "0D240FCD",
		"planCacheKey" : "0D240FCD",
		"optimizedPipeline" : true,
		"winningPlan" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"ProductName" : {
					"$regex" : "^contoso battery charger",
					"$options" : "i"
				}
			},
			"direction" : "forward"
		},
		"rejectedPlans" : [ ]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 1563,
		"executionTimeMillis" : 543,
		"totalKeysExamined" : 0,
		"totalDocsExamined" : 1000000,
		"executionStages" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"ProductName" : {
					"$regex" : "^contoso battery charger",
					"$options" : "i"
				}
			},
			"nReturned" : 1563,
			"executionTimeMillisEstimate" : 1,
			"works" : 1000002,
			"advanced" : 1563,
			"needTime" : 998438,
			"needYield" : 0,
			"saveState" : 7812,
			"restoreState" : 7812,
			"isEOF" : 1,
			"direction" : "forward",
			"docsExamined" : 1000000
		}
	},
	"serverInfo" : {
		"host" : "b1c70a1e87a3",
		"port" : 27017,
		"version" : "4.2.0",
		"gitVersion" : "a4b751dcf51dd249c5865812b390cfd1c0129c30"
	},
	"ok" : 1
}

oliversturm avatar Nov 03 '20 17:11 oliversturm

A search for a specific field also doesn't use an index - obviously, we don't have one.

> db.data.explain("executionStats").aggregate([{"$match": { "ProductName": "Contoso Battery charger - bike E200 Black" } }])
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "demo.data",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"ProductName" : {
				"$eq" : "Contoso Battery charger - bike E200 Black"
			}
		},
		"queryHash" : "C070B184",
		"planCacheKey" : "C070B184",
		"optimizedPipeline" : true,
		"winningPlan" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"ProductName" : {
					"$eq" : "Contoso Battery charger - bike E200 Black"
				}
			},
			"direction" : "forward"
		},
		"rejectedPlans" : [ ]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 492,
		"executionTimeMillis" : 426,
		"totalKeysExamined" : 0,
		"totalDocsExamined" : 1000000,
		"executionStages" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"ProductName" : {
					"$eq" : "Contoso Battery charger - bike E200 Black"
				}
			},
			"nReturned" : 492,
			"executionTimeMillisEstimate" : 1,
			"works" : 1000002,
			"advanced" : 492,
			"needTime" : 999509,
			"needYield" : 0,
			"saveState" : 7812,
			"restoreState" : 7812,
			"isEOF" : 1,
			"direction" : "forward",
			"docsExamined" : 1000000
		}
	},
	"serverInfo" : {
		"host" : "b1c70a1e87a3",
		"port" : 27017,
		"version" : "4.2.0",
		"gitVersion" : "a4b751dcf51dd249c5865812b390cfd1c0129c30"
	},
	"ok" : 1
}

oliversturm avatar Nov 03 '20 17:11 oliversturm

With an index, the specific query obviously uses it:

> db.data.createIndex({ProductName:1})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}
> db.data.explain("executionStats").aggregate([{"$match": { "ProductName": "Contoso Battery charger - bike E200 Black" } }])
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "demo.data",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"ProductName" : {
				"$eq" : "Contoso Battery charger - bike E200 Black"
			}
		},
		"queryHash" : "C070B184",
		"planCacheKey" : "55C5EB1A",
		"optimizedPipeline" : true,
		"winningPlan" : {
			"stage" : "FETCH",
			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"ProductName" : 1
				},
				"indexName" : "ProductName_1",
				"isMultiKey" : false,
				"multiKeyPaths" : {
					"ProductName" : [ ]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"ProductName" : [
						"[\"Contoso Battery charger - bike E200 Black\", \"Contoso Battery charger - bike E200 Black\"]"
					]
				}
			}
		},
		"rejectedPlans" : [ ]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 492,
		"executionTimeMillis" : 4,
		"totalKeysExamined" : 492,
		"totalDocsExamined" : 492,
		"executionStages" : {
			"stage" : "FETCH",
			"nReturned" : 492,
			"executionTimeMillisEstimate" : 0,
			"works" : 493,
			"advanced" : 492,
			"needTime" : 0,
			"needYield" : 0,
			"saveState" : 3,
			"restoreState" : 3,
			"isEOF" : 1,
			"docsExamined" : 492,
			"alreadyHasObj" : 0,
			"inputStage" : {
				"stage" : "IXSCAN",
				"nReturned" : 492,
				"executionTimeMillisEstimate" : 0,
				"works" : 493,
				"advanced" : 492,
				"needTime" : 0,
				"needYield" : 0,
				"saveState" : 3,
				"restoreState" : 3,
				"isEOF" : 1,
				"keyPattern" : {
					"ProductName" : 1
				},
				"indexName" : "ProductName_1",
				"isMultiKey" : false,
				"multiKeyPaths" : {
					"ProductName" : [ ]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"ProductName" : [
						"[\"Contoso Battery charger - bike E200 Black\", \"Contoso Battery charger - bike E200 Black\"]"
					]
				},
				"keysExamined" : 492,
				"seeks" : 1,
				"dupsTested" : 0,
				"dupsDropped" : 0
			}
		}
	},
	"serverInfo" : {
		"host" : "b1c70a1e87a3",
		"port" : 27017,
		"version" : "4.2.0",
		"gitVersion" : "a4b751dcf51dd249c5865812b390cfd1c0129c30"
	},
	"ok" : 1
}

oliversturm avatar Nov 03 '20 17:11 oliversturm

... and so does the regex query:

> db.data.explain("executionStats").aggregate([{"$match": { "ProductName": { "$regex": "^contoso battery charger", "$options": "i" } } }])
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "demo.data",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"ProductName" : {
				"$regex" : "^contoso battery charger",
				"$options" : "i"
			}
		},
		"queryHash" : "0D240FCD",
		"planCacheKey" : "829CC92A",
		"optimizedPipeline" : true,
		"winningPlan" : {
			"stage" : "FETCH",
			"inputStage" : {
				"stage" : "IXSCAN",
				"filter" : {
					"ProductName" : {
						"$regex" : "^contoso battery charger",
						"$options" : "i"
					}
				},
				"keyPattern" : {
					"ProductName" : 1
				},
				"indexName" : "ProductName_1",
				"isMultiKey" : false,
				"multiKeyPaths" : {
					"ProductName" : [ ]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"ProductName" : [
						"[\"\", {})",
						"[/^contoso battery charger/i, /^contoso battery charger/i]"
					]
				}
			}
		},
		"rejectedPlans" : [ ]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 1563,
		"executionTimeMillis" : 703,
		"totalKeysExamined" : 1000000,
		"totalDocsExamined" : 1563,
		"executionStages" : {
			"stage" : "FETCH",
			"nReturned" : 1563,
			"executionTimeMillisEstimate" : 7,
			"works" : 1000001,
			"advanced" : 1563,
			"needTime" : 998437,
			"needYield" : 0,
			"saveState" : 7812,
			"restoreState" : 7812,
			"isEOF" : 1,
			"docsExamined" : 1563,
			"alreadyHasObj" : 0,
			"inputStage" : {
				"stage" : "IXSCAN",
				"filter" : {
					"ProductName" : {
						"$regex" : "^contoso battery charger",
						"$options" : "i"
					}
				},
				"nReturned" : 1563,
				"executionTimeMillisEstimate" : 6,
				"works" : 1000001,
				"advanced" : 1563,
				"needTime" : 998437,
				"needYield" : 0,
				"saveState" : 7812,
				"restoreState" : 7812,
				"isEOF" : 1,
				"keyPattern" : {
					"ProductName" : 1
				},
				"indexName" : "ProductName_1",
				"isMultiKey" : false,
				"multiKeyPaths" : {
					"ProductName" : [ ]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"ProductName" : [
						"[\"\", {})",
						"[/^contoso battery charger/i, /^contoso battery charger/i]"
					]
				},
				"keysExamined" : 1000000,
				"seeks" : 1,
				"dupsTested" : 0,
				"dupsDropped" : 0
			}
		}
	},
	"serverInfo" : {
		"host" : "b1c70a1e87a3",
		"port" : 27017,
		"version" : "4.2.0",
		"gitVersion" : "a4b751dcf51dd249c5865812b390cfd1c0129c30"
	},
	"ok" : 1
}

Even without the '^', the index is still being used. This is unexpected - I need to check this further.

oliversturm avatar Nov 03 '20 17:11 oliversturm

Ah yes... it's a bit strange. Even with the $regex condition, the execution plan mentions the index. It appears to use it for the query as well. However, when the index is used with a specific field query, I see these stats:

	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 492,
		"executionTimeMillis" : 4,
		"totalKeysExamined" : 492,
		"totalDocsExamined" : 492,

The number of docs examined went down from 1 million, and the keys are exactly the same number. The execution time is down to about 1% of the non-index query.

When using the $regex query, the stats show this:

	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 1563,
		"executionTimeMillis" : 703,
		"totalKeysExamined" : 1000000,
		"totalDocsExamined" : 1563,

The regex case "examines" a million (all) keys, not just a few - so the index doesn't actually do anything to speed up the regex case and the execution time remains high. In fact, in several runs it was never faster than the non-index search I documented above (543 millis).

It is not surprising to me that the index doesn't improve performance for the $regex query. It is only surprising that the chosen plan uses that index, especially since it appears to slow down performance. I would have expected MongoDB to realize that the index won't improve things for the $regex query, and to decide against using it.

oliversturm avatar Nov 04 '20 10:11 oliversturm

Now testing $indexOfCP:

> db.data.explain("executionStats").aggregate([{"$match": { $expr: {$gt:[ {$indexOfCP: ["$ProductName", "Contoso Battery charger"] } , -1 ] } } }])
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "demo.data",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"$expr" : {
				"$gt" : [
					{
						"$indexOfCP" : [
							"$ProductName",
							{
								"$const" : "Contoso Battery charger"
							}
						]
					},
					{
						"$const" : -1
					}
				]
			}
		},
		"queryHash" : "0C022C46",
		"planCacheKey" : "0C022C46",
		"optimizedPipeline" : true,
		"winningPlan" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"$expr" : {
					"$gt" : [
						{
							"$indexOfCP" : [
								"$ProductName",
								{
									"$const" : "Contoso Battery charger"
								}
							]
						},
						{
							"$const" : -1
						}
					]
				}
			},
			"direction" : "forward"
		},
		"rejectedPlans" : [ ]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 1563,
		"executionTimeMillis" : 2038,
		"totalKeysExamined" : 0,
		"totalDocsExamined" : 1000000,
		"executionStages" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"$expr" : {
					"$gt" : [
						{
							"$indexOfCP" : [
								"$ProductName",
								{
									"$const" : "Contoso Battery charger"
								}
							]
						},
						{
							"$const" : -1
						}
					]
				}
			},
			"nReturned" : 1563,
			"executionTimeMillisEstimate" : 140,
			"works" : 1000002,
			"advanced" : 1563,
			"needTime" : 998438,
			"needYield" : 0,
			"saveState" : 7812,
			"restoreState" : 7812,
			"isEOF" : 1,
			"direction" : "forward",
			"docsExamined" : 1000000
		}
	},
	"serverInfo" : {
		"host" : "b1c70a1e87a3",
		"port" : 27017,
		"version" : "4.2.0",
		"gitVersion" : "a4b751dcf51dd249c5865812b390cfd1c0129c30"
	},
	"ok" : 1
}

It does not appear that this query uses the existing index on the field. I guess it could, so not sure why this doesn't happen. Also notable: the execution takes more than three times as long as the $regex method.

oliversturm avatar Nov 04 '20 11:11 oliversturm

I think the reason why $indexOfCP doesn't use the index is that its first parameter is an arbitrary string expression. MongoDB does not recognize the fact that I'm "just" pulling in a value from an indexed field - it could, since the Expression syntax $FieldName is a specific thing according to docs, but clearly it does not treat this as a special case where an index could be used.

oliversturm avatar Nov 04 '20 12:11 oliversturm

Additional idea: MongoDB's built-in text search. This is based on a special index with a special search construct.

> db.data.createIndex({ProductName: "text"})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 2,
	"numIndexesAfter" : 3,
	"ok" : 1
}
> db.data.explain("executionStats").aggregate([{"$match": { $text: { $search: "\"contoso battery charger\"", $caseSensitive: false }  } }])
{
	"stages" : [
		{
			"$cursor" : {
				"query" : {
					"$text" : {
						"$search" : "\"contoso battery charger\"",
						"$caseSensitive" : false
					}
				},
				"fields" : {
					"$textScore" : {
						"$meta" : "textScore"
					}
				},
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "demo.data",
					"indexFilterSet" : false,
					"parsedQuery" : {
						"$text" : {
							"$search" : "\"contoso battery charger\"",
							"$language" : "english",
							"$caseSensitive" : false,
							"$diacriticSensitive" : false
						}
					},
					"queryHash" : "83098EE1",
					"planCacheKey" : "7E2D582B",
					"winningPlan" : {
						"stage" : "PROJECTION_DEFAULT",
						"transformBy" : {
							"$textScore" : {
								"$meta" : "textScore"
							}
						},
						"inputStage" : {
							"stage" : "TEXT",
							"indexPrefix" : {
								
							},
							"indexName" : "ProductName_text",
							"parsedTextQuery" : {
								"terms" : [
									"batteri",
									"charger",
									"contoso"
								],
								"negatedTerms" : [ ],
								"phrases" : [
									"contoso battery charger"
								],
								"negatedPhrases" : [ ]
							},
							"textIndexVersion" : 3,
							"inputStage" : {
								"stage" : "TEXT_MATCH",
								"inputStage" : {
									"stage" : "TEXT_OR",
									"inputStages" : [
										{
											"stage" : "IXSCAN",
											"keyPattern" : {
												"_fts" : "text",
												"_ftsx" : 1
											},
											"indexName" : "ProductName_text",
											"isMultiKey" : true,
											"isUnique" : false,
											"isSparse" : false,
											"isPartial" : false,
											"indexVersion" : 2,
											"direction" : "backward",
											"indexBounds" : {
												
											}
										},
										{
											"stage" : "IXSCAN",
											"keyPattern" : {
												"_fts" : "text",
												"_ftsx" : 1
											},
											"indexName" : "ProductName_text",
											"isMultiKey" : true,
											"isUnique" : false,
											"isSparse" : false,
											"isPartial" : false,
											"indexVersion" : 2,
											"direction" : "backward",
											"indexBounds" : {
												
											}
										},
										{
											"stage" : "IXSCAN",
											"keyPattern" : {
												"_fts" : "text",
												"_ftsx" : 1
											},
											"indexName" : "ProductName_text",
											"isMultiKey" : true,
											"isUnique" : false,
											"isSparse" : false,
											"isPartial" : false,
											"indexVersion" : 2,
											"direction" : "backward",
											"indexBounds" : {
												
											}
										}
									]
								}
							}
						}
					},
					"rejectedPlans" : [ ]
				},
				"executionStats" : {
					"executionSuccess" : true,
					"nReturned" : 1563,
					"executionTimeMillis" : 1413,
					"totalKeysExamined" : 301234,
					"totalDocsExamined" : 279712,
					"executionStages" : {
						"stage" : "PROJECTION_DEFAULT",
						"nReturned" : 1563,
						"executionTimeMillisEstimate" : 121,
						"works" : 580951,
						"advanced" : 1563,
						"needTime" : 579387,
						"needYield" : 0,
						"saveState" : 4540,
						"restoreState" : 4540,
						"isEOF" : 1,
						"transformBy" : {
							"$textScore" : {
								"$meta" : "textScore"
							}
						},
						"inputStage" : {
							"stage" : "TEXT",
							"nReturned" : 1563,
							"executionTimeMillisEstimate" : 121,
							"works" : 580951,
							"advanced" : 1563,
							"needTime" : 579387,
							"needYield" : 0,
							"saveState" : 4540,
							"restoreState" : 4540,
							"isEOF" : 1,
							"indexPrefix" : {
								
							},
							"indexName" : "ProductName_text",
							"parsedTextQuery" : {
								"terms" : [
									"batteri",
									"charger",
									"contoso"
								],
								"negatedTerms" : [ ],
								"phrases" : [
									"contoso battery charger"
								],
								"negatedPhrases" : [ ]
							},
							"textIndexVersion" : 3,
							"inputStage" : {
								"stage" : "TEXT_MATCH",
								"nReturned" : 1563,
								"executionTimeMillisEstimate" : 120,
								"works" : 580951,
								"advanced" : 1563,
								"needTime" : 579387,
								"needYield" : 0,
								"saveState" : 4540,
								"restoreState" : 4540,
								"isEOF" : 1,
								"docsRejected" : 278149,
								"inputStage" : {
									"stage" : "TEXT_OR",
									"nReturned" : 279712,
									"executionTimeMillisEstimate" : 88,
									"works" : 580951,
									"advanced" : 279712,
									"needTime" : 301238,
									"needYield" : 0,
									"saveState" : 4540,
									"restoreState" : 4540,
									"isEOF" : 1,
									"docsExamined" : 279712,
									"inputStages" : [
										{
											"stage" : "IXSCAN",
											"nReturned" : 15519,
											"executionTimeMillisEstimate" : 1,
											"works" : 15520,
											"advanced" : 15519,
											"needTime" : 0,
											"needYield" : 0,
											"saveState" : 4540,
											"restoreState" : 4540,
											"isEOF" : 1,
											"keyPattern" : {
												"_fts" : "text",
												"_ftsx" : 1
											},
											"indexName" : "ProductName_text",
											"isMultiKey" : true,
											"isUnique" : false,
											"isSparse" : false,
											"isPartial" : false,
											"indexVersion" : 2,
											"direction" : "backward",
											"indexBounds" : {
												
											},
											"keysExamined" : 15519,
											"seeks" : 1,
											"dupsTested" : 15519,
											"dupsDropped" : 0
										},
										{
											"stage" : "IXSCAN",
											"nReturned" : 6003,
											"executionTimeMillisEstimate" : 0,
											"works" : 6004,
											"advanced" : 6003,
											"needTime" : 0,
											"needYield" : 0,
											"saveState" : 4540,
											"restoreState" : 4540,
											"isEOF" : 1,
											"keyPattern" : {
												"_fts" : "text",
												"_ftsx" : 1
											},
											"indexName" : "ProductName_text",
											"isMultiKey" : true,
											"isUnique" : false,
											"isSparse" : false,
											"isPartial" : false,
											"indexVersion" : 2,
											"direction" : "backward",
											"indexBounds" : {
												
											},
											"keysExamined" : 6003,
											"seeks" : 1,
											"dupsTested" : 6003,
											"dupsDropped" : 0
										},
										{
											"stage" : "IXSCAN",
											"nReturned" : 279712,
											"executionTimeMillisEstimate" : 22,
											"works" : 279713,
											"advanced" : 279712,
											"needTime" : 0,
											"needYield" : 0,
											"saveState" : 4540,
											"restoreState" : 4540,
											"isEOF" : 1,
											"keyPattern" : {
												"_fts" : "text",
												"_ftsx" : 1
											},
											"indexName" : "ProductName_text",
											"isMultiKey" : true,
											"isUnique" : false,
											"isSparse" : false,
											"isPartial" : false,
											"indexVersion" : 2,
											"direction" : "backward",
											"indexBounds" : {
												
											},
											"keysExamined" : 279712,
											"seeks" : 1,
											"dupsTested" : 279712,
											"dupsDropped" : 0
										}
									]
								}
							}
						}
					}
				}
			}
		}
	],
	"ok" : 1
}

oliversturm avatar Nov 04 '20 12:11 oliversturm

Relevant part is this:

				"executionStats" : {
					"executionSuccess" : true,
					"nReturned" : 1563,
					"executionTimeMillis" : 1413,
					"totalKeysExamined" : 301234,
					"totalDocsExamined" : 279712,

Short: this method takes at least twice as long as the regex search. It enables complex search functionality, but it's certainly not very performant in the test case.

oliversturm avatar Nov 04 '20 12:11 oliversturm

At this point I'm at a loss. It appears that MongoDB does not support any substring search mechanism that actually uses an index efficiently. The suggested method $indexOfCP is much slower than regular expressions and so is the specialized text search index. I wonder if the balance would shift with larger document counts, perhaps I'll take the time to test this. Meanwhile I think a collection of 1 million documents is a good test case - just large enough for significant time to be spent, but small enough to still be representative of a "typical" application use case.

I have not found a search method that could be used as a better alternative for the current regex-based implementations of the contains, startsWith and endsWith operators.

oliversturm avatar Nov 04 '20 12:11 oliversturm