data icon indicating copy to clipboard operation
data copied to clipboard

Record overlap: `rewe-shop`, `rewe-group-com`

Open mal-tee opened this issue 2 years ago • 7 comments

Both have "Rewe Markt GmbH" in the runs-Array. Seems like a mistake we should resolve?

mal-tee avatar Mar 11 '23 11:03 mal-tee

Thank you for opening this issue (based on my email).

WebworkrNet avatar Mar 11 '23 12:03 WebworkrNet

Should we turn this into a test? @baltpeter

mal-tee avatar Mar 24 '23 12:03 mal-tee

I haven't looked into that particular case yet. Are we sure that that is a mistake?

But, either way, we can't generally forbid two records having identical runs entries. There are already valid records where that is the case, e.g. the Amazon records for different companies:

https://github.com/datenanfragen/data/blob/master/companies/amazon-de.json https://github.com/datenanfragen/data/blob/master/companies/amazon-es.json

baltpeter avatar Mar 24 '23 13:03 baltpeter

I haven't looked into that particular case yet. Are we sure that that is a mistake?

Haven't looked either. :sweat_smile:

But, either way, we can't generally forbid two records having identical runs entries. There are already valid records where that is the case, e.g. the Amazon records for different companies:

master/companies/amazon-de.json master/companies/amazon-es.json

Yeah, we should only do that test if there is no overlap in the countries. :thinking:

mal-tee avatar Mar 24 '23 13:03 mal-tee

Yeah, we should only do that test if there is no overlap in the countries. thinking

If there is overlap in the countries, you mean, right?

But even then, I'm not sure whether there can never be a case where that is valid…

baltpeter avatar Mar 24 '23 13:03 baltpeter

If there is overlap in the countries, you mean, right?

Yes, oops.

I wrote a little script to implement this:

from collections import defaultdict
import os
import json

hashmap = defaultdict(list)

for file in os.listdir("companies/"):
    with open("companies/" + file, "r") as f:
        company = json.load(f)
        slug = company["slug"]
        hashmap[company["name"]].append(slug)
        if "runs" in company:
            for run in company["runs"]:
                hashmap[run].append(slug)

simple_overlap = {k: v for k, v in hashmap.items() if len(v) > 1}
print("simple", len(simple_overlap.keys()))
for name, slugs in simple_overlap.items():
    used_rvs = defaultdict(list)
    alls = set()
    for slug in slugs:
        with open("companies/" + slug + ".json", "r") as f:
            company = json.load(f)
            if "relevant-countries" in company:
                if company["relevant-countries"] == ["all"]:
                    alls.add(name)
                else:
                    for rv in company["relevant-countries"]:
                        used_rvs[rv].append(slug)
    filtered_overlap = {k: v for k,v in used_rvs.items() if len(v) > 2 or name in alls}
    if(filtered_overlap):
        print(name, filtered_overlap, alls)

simple 38
REWE Markt GmbH {'de': ['rewe-shop']} {'REWE Markt GmbH'}
Ideawise Limited {'de': ['gay-de', 'fetisch-de', 'poppen-de', 'kaufmich-com']} set()
Seven.One Entertainment Group GmbH {'de': ['sat1gold', 'prosieben', 'kabeleinsdoku', 'kabeleins']} set()
cpx online active AG {'de': ['optivel'], 'ch': ['optivel'], 'fr': ['optivel'], 'at': ['optivel']} {'cpx online active AG'}
Ingenico Payment Services GmbH {'de': ['ingenico-de']} {'Ingenico Payment Services GmbH'}
Ingenico Healthcare GmbH {'de': ['ingenico-de']} {'Ingenico Healthcare GmbH'}
  1. the initial case for this issue. Seems legit, since the websites are different.
  2. websites are different.
  3. same
  4. ...

Yeah, we'd also have to check if the websites are different. And probably every other key as well.


However, we can close this issue: The rewe group collision is okay, since the webpages are different.

mal-tee avatar Mar 24 '23 23:03 mal-tee

I see my original concern as unresolved. The database currently shows 2 officials for REWE Markt GmbH:

  • REWE Markt GmbH
  • REWE Zentralfinanz eG

As I understand it, this cannot be the case, as the unambiguity is missing. Which sources indicate that REWE Zentralfinanz eG is also responsible for REWE Markt GmbH? I have not been able to verify this so far.

WebworkrNet avatar Mar 26 '23 02:03 WebworkrNet