OSCI icon indicating copy to clipboard operation
OSCI copied to clipboard

Exclude company's own projects filter

Open vallode opened this issue 3 years ago • 13 comments

I think it would be pertinent to include a filter that excludes contributions to the company's own open source projects. As much as I enjoy seeing the numbers I feel like it would be amazing to see which companies contribute outside of their own circle of influence the most, this could shift the rankings somewhat and showcase a bit more of the open source community on the top lists.

Throwing this out there as an idea, absolutely understand if this is not relevant to this project but maybe something worth thinking about!

vallode avatar Mar 27 '21 15:03 vallode

The idea is pretty interesting.

There are also a number of primary questions that arise before the implementation of this idea.

Main question:

How to identify the repositories in relation to the company (a company's own repository or not)?

There is an option to use information about the organization (see OrgId). However, this is connected with the fact that you need to have a list of compliance of the company and the organization that belongs to it. It turns out that it is necessary to create such a list by hand for each company and constantly keep it up to date. And again, there is no certainty that this criterion is 100% valid.

Do you have any ideas on this?

vlad-isayko avatar Mar 29 '21 10:03 vlad-isayko

How to identify the repositories in relation to the company (a company's own repository or not)?

  1. If source repo belongs to company. Maintaining official repo status is no different that mantaining official list of domains.
  2. If all commits and merge requests are from the company

abitrolly avatar Mar 29 '21 12:03 abitrolly

1. If source repo belongs to company. Maintaining official repo status is no different that mantaining official list of domains.

I agree that at first glance, maintaining a list of repositories does not differ much from maintaining a list of companies. But the question arises about a significantly larger volume of repositories than companies and about a greater dynamics of the list of repositories than domains.

2. If all commits and merge requests are from the company

I didn't quite understand what it meant. Could you explain a little more broadly?

You are suggested to think that the company's own repository is those repositories in which commits are only from the company, right? Is this a necessary and/or sufficient condition?

vlad-isayko avatar Mar 29 '21 12:03 vlad-isayko

But the question arises about a significantly larger volume of repositories than companies and about a greater dynamics of the list of repositories than domains.

It could happen that the amount of non-owned repositories that companies are committing to is non-significant.

I didn't quite understand what it meant. Could you explain a little more broadly?

The repo where all commits are from corporate emails are definitely owned by the company. That's a sufficient condition for a filter. )

abitrolly avatar Mar 29 '21 13:03 abitrolly

Sorry for taking a while to respond, I simply don't have enough information on the workflow that OSCI uses (my bad) to elaborate further than what @abitrolly said. I would only ever consider a contribution to be in the company's full self-interest if the contribution landed on a repository that was owned by the company itself.

Is this a trivial task? Very unlikely, I think a "repo where all commits are from corporate emails" is too specific of a scenario and wouldn't affect the dataset very much (especially for the top dogs which is where my interest lies the most)...

We'd need a way to filter out contributions made from the organisation's own authors into the organisation's own repositories.

vallode avatar Mar 31 '21 14:03 vallode

We'd need a way to filter out contributions made from the organisation's own authors into the organisation's own repositories.

I agree. That would be sufficient.

abitrolly avatar Mar 31 '21 14:03 abitrolly

... or at least you could start small and list at least the number of the repositories collaborators of the organizations contribute to. If most of the organization contributors contribute to single or few repositories, this is a good indication of their efforts. :)

dzintars avatar May 01 '21 17:05 dzintars

I suggest the way to move forward on this issue is:

  1. pick a company at random
  2. look at the list of repos which OSCI is showing their employees contribute to
  3. try to define some logic (algorithm) defining which of these repos are "company repos" vs "non-company repos". As part of this task you will have to define what is a "company repo", that in itself will be challenging.
  4. Now pick another company at random and test the logic you came up with, refine it.
  5. And so on with additional companies until you have logic which appears to manage the general case.

It's important to understand that a perfect algorithm for this does not exist, just different directions to go, each with pros and cons. An empirical approach (if that's the right term) like I suggest above is necessary rather than defining a theoretical approach. Your goal has to be to iterate until you reach a logic which is "good enough" to show a general picture of activity across organizations. This was our experience defining the logic for OSCI itself. What looks easy at a high level gets very challenging when one tries to define the detail and algorithmize it.

patrickstephens2 avatar Sep 27 '21 14:09 patrickstephens2

As part of this task you will have to define what is a "company repo", that in itself will be challenging.

def outside_contributions():
    employees_committed
    contractors_committed
    robots_committed
    total_committed

    if (total_committed - employees_committed - contractors_committed - robots_committed > 0):
       return True

abitrolly avatar Sep 28 '21 07:09 abitrolly

Let's take company ACME. It creates and runs project X. This project is not under the ACME org on github, so programmatically not directly connectable to the company. The project has 100 contributors, 99 who work at ACME and 1 who is outside (perhaps it is an ex-employee who worked on this before leaving the company and continued after... I have seen such examples). Is this a company project?

patrickstephens2 avatar Sep 29 '21 11:09 patrickstephens2

What could be the simplest and probably not the most accurate insight? While getting perfect stats sounds sweet, most likely we will not get there right away. So... what could be done right now to make the index by 1% better? How about CLA's? Could those be considered as indication? If repo is requiring to submit CLA, could it be considered X org repository? Could manual PR process be implemented to metatag the repos? Like... community could submit PR's to this repository to mark/add indexed repos to one or the other category and even augment the metadata? While fully automated process is neat... i think mostly we are interested in like... 2-5K public repositories and those definitely could be meta-tagged manually over the time.

dzintars avatar Sep 29 '21 11:09 dzintars

Maybe the priority should be to publish the data that could make different kind of filters possible. Right now the site https://opensourceindex.io/ just links to this repo with no diagrams of the DB schema are no information if the Big Query datasets are being public.

abitrolly avatar Sep 29 '21 12:09 abitrolly

At our company, internally we gather public data on GitHub activity from employees who choose to opt-in regarding their GitHub activity and contributions, with the goal of identifying trends in contributions to projects outside of Microsoft's governance. Our data is skewed differently than this index, however, since we have an internal indicator of who our employees are on GitHub once they opt-in to tell us, vs having to determine it from profiles.

Our numbers for December 2021, for example, are significantly higher for 'total community' and other figures as a result of so many people being e-mail private on GitHub... but of that specific month's contributions, I tried pulling equivalent data, and around a third of our actively-open-contributing employees contributed to projects not governed by our company, yielding a number higher than the index but not majorly larger.

While the data is interesting, our key reason for differentiating "is it controlled by Microsoft or not" is to help encourage our employees' participation in communities to become eligible in our FOSS Fund and to evolve the culture.

I agree slicing off a company's controlled projects is an interesting pivot, but a murky gray area, especially given foundations and cross-industry collaborations and so on.

jeffwilcox avatar Feb 25 '22 12:02 jeffwilcox