Spark-MongoDB icon indicating copy to clipboard operation
Spark-MongoDB copied to clipboard

Question - Applying query filters while loading collections

Open gohilankit opened this issue 8 years ago • 2 comments

I am using this API to query from a large Mongo DB collection. Is there any way I can specify query filters to load selected documents as dataframe and not the whole collection. Probably some kind of equivalent of find({'key':'value'}) or more complex queries in MongoDB) I'm currently using version spark-mongodb_2.10:0.11.0. I'm querying in PySprak using the below command and the load() method would just load the whole collection, which is taking a lot of time.

reader = sqlContext.read.format("com.stratio.datasource.mongodb") data = reader.options(host='10.219.51.10:27017', database='ProductionEvents', collection='srDates').load()

gohilankit avatar Jun 10 '16 22:06 gohilankit

Hi @gohilankit

There are several ways to specify filters, the easiest one is just using the filter function

Thus, you should do something like data.filter(df.age > 3).collect(). The filter would be pushed down to Mongo when possible.

DataFrames are lazy, so will be executed when the Spark action (collect, first, take...) is performed

darroyocazorla avatar Jun 14 '16 07:06 darroyocazorla

Do you means that by this, the Spark will load the filtered data instead of the whole data set of the target collection? So as the Spark SQL performed like this?

DeeeFOX avatar Sep 30 '16 07:09 DeeeFOX