exporters icon indicating copy to clipboard operation
exporters copied to clipboard

Support list of filters

Open bbotella opened this issue 8 years ago • 12 comments

Right now, the library base export manager only support one filter before transforming, and one filter after transforming. I think we should support list of filters both before and after transforming. This would allow us to use different filtering approaches:

i.e. Right now, we can't filter all the items with "country_code"=="us" and remove duplicates at the same time.

bbotella avatar Jun 24 '16 12:06 bbotella

I think this could be supported adding yet another filter class like MultipleFilter that allows you to specify combinations of filters.

eliasdorneles avatar Jun 24 '16 12:06 eliasdorneles

I agree, this would be a very useful thing for some exports.

tsrdatatech avatar Jun 24 '16 13:06 tsrdatatech

@eliasdorneles What about always treating them as a "list of filters", just like we do with notifications. Of course, we would allow single filter configurations as we do now.

Filters would be applied FIFO, and an item only passes if all the filters pass. Thoughts @tsrdatatech?

bbotella avatar Jul 04 '16 21:07 bbotella

My problem with a list of filters is that it's ambiguous (e.g. will the result be an AND or an OR?) -- essentially, the same problem we have currently with the KeyValue filters. I prefer a new MultipleFilter class that allow you to choose combinations of these, allowing user to combine filters in a way like (FilterA AND FilterB) OR FilterC.

A good source of inspiration is MongoDB filtering API.

eliasdorneles avatar Jul 04 '16 21:07 eliasdorneles

Loving that "filter composition" proposal. Will go with it.

bbotella avatar Jul 04 '16 21:07 bbotella

There is some discussion in https://github.com/scrapinghub/exporters/pull/312

eliasdorneles avatar Jul 21 '16 17:07 eliasdorneles

My approach is mostly based on Mongo Query.

First of all I think that the filter key in the config should be a list, named filters to be clear. After that I think that a filter composition could be simply as:

"filters": [
            {"name": "exporters.filters.key_value_regex_filter.KeyValueRegexFilter",
             "options": {
                 "keys": [
                     {"name": "country", "value": "United States"}
                 ]
             },
             {"name": "exporters.filters.key_value_regex_filter.KeyValueRegexFilter",
             "options": {
                 "keys": [
                     {"name": "city", "value": "New York"}
                 ]
             },
             {"or": [
                {"name": "exporters.filters.key_value_regex_filter.KeyValueRegexFilter",
                 "options": {
                     "keys": [
                         {"name": "country", "value": "Canada"}
                     ]
                 },
                {"name": "exporters.filters.key_value_regex_filter.KeyValueRegexFilter",
                 "options": {
                     "keys": [
                         {"name": "city", "value": "Montreal"}
                     ]
                 },
             ]}
        ]

The config about could be translated to: (country == 'United States' and city == 'New York') or (country == 'Canada' and city == 'Montreal'

You could also nest two or filters:

"filters": [
            {"name": "exporters.filters.key_value_regex_filter.KeyValueRegexFilter",
             "options": {
                 "keys": [
                     {"name": "country", "value": "United States"}
                 ]
             },
             {"name": "exporters.filters.key_value_regex_filter.KeyValueRegexFilter",
             "options": {
                 "keys": [
                     {"name": "city", "value": "New York"}
                 ]
             },
             {"or": [
                {"name": "exporters.filters.key_value_regex_filter.KeyValueRegexFilter",
                 "options": {
                     "keys": [
                         {"name": "country", "value": "Canada"}
                     ]
                 },
                 {"or": [
                    {"name": "exporters.filters.key_value_regex_filter.KeyValueRegexFilter",
                     "options": {
                     "keys": [
                         {"name": "city", "value": "Montreal"}
                     ]},
                    },
                    {"name": "exporters.filters.key_value_regex_filter.KeyValueRegexFilter",
                     "options": {
                     "keys": [
                         {"name": "status", "value": "Checked"}
                     ]},
                    },
                ]}
             ]}
        ]

The config about could be translated to: (country == 'United States' and city == 'New York') or (country == 'Canada' or (city == 'Montreal' and status == 'Checked'))

This is for sure tricky to implement, but I think that it's flexible and user-friendly enough. What do you guys think of this approach? cc: @eliasdorneles, @bbotella, @tsrdatatech

raphapassini avatar Jul 26 '16 19:07 raphapassini

Hey, that's pretty cool -- I like it! :+1: :+1:

eliasdorneles avatar Jul 26 '16 19:07 eliasdorneles

Well, I'll do a prof-of-concept of this idea, please, let me know if you guys have comments about that! :)

raphapassini avatar Jul 26 '16 19:07 raphapassini

I have some work done in multiple_fitlers branch at my fork Mostly of the code and test is done, @eliasdorneles I would appreciate if you can take a look and share with me your ideas. =)

I think tomorrow I'll update the docs and make the PR

raphapassini avatar Jul 27 '16 21:07 raphapassini

@raphapassini please create a PR, it's hard to review & discuss in branches or lone commits.

eliasdorneles avatar Jul 27 '16 22:07 eliasdorneles

PR - https://github.com/scrapinghub/exporters/pull/325

raphapassini avatar Aug 01 '16 17:08 raphapassini