What I see in these graphs of Github contribution

Context: Last week I shared a few graphs (1, 2, 3, 4) looking at data from our repositories on Github, extracted using this Gitribution app thing, as part of our work to dashboard contributor numbers for the Mozilla Foundation.

I didn’t comment on the graphs at the time because I wanted time for others to look at them without my opinions skewing what they might see. This follow up post is a walk-through of some things I see in the graphs/data.

The real value in looking at data is finding ways to make things better by challenging ourselves, and being honest about what the numbers show, so this will be as much about questions as answers…

Also, publishing this last week flagged up some missing repositories and identified some other members of staff so these graphs are based on the latest version of the data (there was no impact on shapes, but some numbers will be different).

What time of day do people contribute (UTC)?

By Hour of DayOur paid staff who are committing code are mostly in US/Canadian timezones and it make sense that most of their commits are during these hours (graphed by UTC). But, what caught my attention here is that the volunteer contribution times follow the same shape.

Questions to ask:

  • Do volunteer contributions follow the same shape because contributing code has a dependency on being able to talk in real time with staff? For example in IRC. If so, is this a bottleneck for contributing code?
  • If not, what is creating this shape for volunteer contributors? Perhaps it’s biased to timezones where more people are interested in the things we are building, and potentially biased by language? But looking at support for Maker Party and other activities there is a global audience for our tools.
  • What does a code contribution pathway look like for people in the 0300-1300UTC times? Is there anything we can do to make things easier or more appealing?

The shape of volunteer contributions

ShapeThe shape of this graph is pretty typical for any kind of volunteering or activity involving human interactions. It’s close to a power law graph with a long-tail.

If you’ve not looked at a data set like this before, don’t panic that so many people only make a single contribution. At the same time, don’t use the knowledge that this is typical not to ask questions about how we can be better.

Lots of people want to get involved in volunteering projects but often their good intentions don’t align with their actual available free time. I say this as someone who signs up for more things than fit into my available hours for personal projects.

The two questions I want to ask of this graph are:

  1. Where could our efforts to support contributors best influence the overall shape?
  2. What does this look like at 10 x scale?

So, starting with where we could influence shape… my opinion (no data here) says to think about people in this range.Shape HighlightTo the left of this highlighted area people are already making code contributions over and above even many staff. Shower them in endless gratitude! But I don’t think they don’t need practical help from us.  To the right of this highlighted area is the natural long tail. Supporting that bigger group of people for single-touch interactions is about clear documentation and easy to follow processes. But I think the group of people roughly highlighted in that graph are people we can reach out to. These people potentially have capacity to do more. We should find out what they are interested in, what they want to get out of contribution and build relationships with them. In practical terms, we have finite time to invest in direct relationships with contributors. I think this is an effective place to invest some of that time.

I think the second question is more  challenging. What does this look like at 10 x scale?

In 2013, ~50 people made a one-time contribution.

  • What do we need in place for 500 people to make a one-time code contribution?
  • Do we have 500 suitable ‘first’ bugs for 2014?
  • Is the amount of setup work required to contribute to our tools appropriate for people making a single contribution?
  • If not, is that a blocker to growing contributor numbers?

In 2013, there were ~1,500 code commits by volunteers.

  • What do we need in place for 15,000 activities on top of planned staff activity?
  • How does this much activity align towards a common product roadmap?
  • How is it scheduled, allocated, reviewed and shipped?

When planning to work with 10 x contributor numbers, possibly the biggest shift to consider is the ratio of staff to volunteers:

ContributorRatio

  • How does impact on time allocated for code reviews?
  • How do we write bugs?
  • How we prioritize bugs? Etc.
  • Even, what does an IRC channel or a dev maling list look like after this change?

Other questions to ask:

  • What do we think is the current ‘ceiling’ on our contributor numbers for people writing code?
    • Is it the number of developers who know about our tools and want to help? (i.e. a ‘marketing’ challenge to inspire more people)
    • Is it the amount of suitable work ready and available for people who want to help? (are we losing people who want to help because it’s too hard to get involved?)
    • Both? With any bias?

 What do you think?

I’m only one set of eyes on this, so please challenge my observations and feel free to build on this too.

Also, as the data in here is publicly accessible already I think I can publish this Tableau view as an interactive tool you can play with, but I need to check the terms first.

Contribution Graphs part 4: Contributions by Contributors over time

I’m posting a quick series of these without much comment on my part as I’d love to know what you see in each of them.

This is looking at activity in Github (commits and issues), for the repositories listed here. It’s an initial dive into the data, so don’t be afraid to ask questions of it, or request other cuts of this. In the not so distant future, we’ll be able to look at this kind of data across our combined contribution activities, so this is a bit of a taster.

Click for the full-size images.

Contributions by Contributors over time

Last but not least for today, I think there are some stories in this one…

Contributions by Contributors over Time

Is anything here a surprise? What do you see in this?

Contribution Graphs part 3: Distribution of contributions

I’m posting a quick series of these without much comment on my part as I’d love to know what you see in each of them.

This is looking at activity in Github (commits and issues), for the repositories listed here. It’s an initial dive into the data, so don’t be afraid to ask questions of it, or request other cuts of this. In the not so distant future, we’ll be able to look at this kind of data across our combined contribution activities, so this is a bit of a taster.

Click for the full-size images.

Distribution of contributions (excluding staff work)

Here are a couple of ways of visualizing this same data.

Distribution 2Distribution 1

Is anything here a surprise? What do you see in this?

Contribution Graphs part 2: By hour of the day

I’m posting a quick series of these without much comment on my part as I’d love to know what you see in each of them.

This is looking at activity in Github (commits and issues), for the repositories listed here. It’s an initial dive into the data, so don’t be afraid to ask questions of it, or request other cuts of this. In the not so distant future, we’ll be able to look at this kind of data across our combined contribution activities, so this is a bit of a taster.

Click for the full-size images.

By hour of the day

By hour of the day

Is anything here a surprise? What do you see in this?

Contribution Graphs part 1: Contributions over time

I’m posting a quick series of these without much comment on my part as I’d love to know what you see in each of them.

This is looking at activity in Github (commits and issues), for the repositories listed here. It’s an initial dive into the data, so don’t be afraid to ask questions of it, or request other cuts of this. In the not so distant future, we’ll be able to look at this kind of data across our combined contribution activities, so this is a bit of a taster.

Click for the full-size images.

Contributions over time

1 combined Over time

Broken down by teams

2 By team

Broken down further by repository

3 By Repo

Is anything here a surprise? What do you see in this?

Is being a member of the mozilla ‘organization’ on github a good proxy indicator of being staff?

Following on from the post about Gitribution, these are my notes around my initial exploration of the data extracted from Github.

One of the challenges of counting volunteer contributors to Mozilla is working out who is a volunteer and who is paid-staff. The concept of a volunteer contributor in itself is full of complications, as paid staff will volunteer their free time on other projects they care about, and contributors become employees, or employees will work using their personal email addresses and so on. The fidelity of tracking that would be required to *perfectly* identify when someone does something on a ‘voluntary’ basis would not be proportionate to the impact this would have on the usefulness of the final reporting. So perfect tracking is not the goal here.

My first pass at filtering out staff from contributor counts on github was to look at whether someone is a member of the mozilla organization on github. I thought this would be a good proxy for ‘staff’, and doing this gave us this breakdown:

Without manually checking usernames, this is how the data is split between staff and contributors
Without manually checking usernames, this is how the contribution counts are split between staff and contributors

However, in this non-staff contributor segment of the data, there are a few names I know are definitely staff, and as I don’t know all of Mozilla’s staff I assume others in here are staff too.

Some names definitely in the wrong buckets at significant scale
Some names here are definitely in the wrong buckets, with significant contribution numbers linked to them

So, it’s safe to say that the inverse of our question is false. That is: not being a member of the org on github is not a good enough proxy to say someone is not a paid member of staff.

This is less critical when counting the number of people. For example this is the split of volunteers to staff using this github membership status as the proxy measure:

There might be 10 people who technically need to move from the blue to the orange bar, but that's not important if the aim is growing the blue bar 10x without much change to the orange bar.
There might be ~10 people who technically need to move from the blue to the orange bar, but that’s not important if the aim is growing the blue bar 10x without much change to the orange bar.

But if we want to analyze contribution activity (we do!) I need to manually (with a little automation in Tableau) check these github accounts, and add those who are staff to an extra list within Gitribution to cross check when saving the data:

4 Manually Check
These are the most significant accounts to check for people who are staff

Getting back to the original question… Is being a member of the mozilla ‘organization’ on github a good proxy indicator of being staff? 

The quick no-data-query-required test is to click through to a few profiles and look for examples of people who are not staff: https://github.com/orgs/mozilla/members. I found a few on the first page alone. But as stated earlier, it can also be hard to tell! Mozillians are a connected bunch who often work on other projects too. However, I found enough people in that list employed at other organizations to assume they are not all staff (though in some cases they used to be staff but are not now).

So to answer the question in it’s strictest sense, the answer is no. Being a member of the github organisation is not a certain indicator of being a paid member of staff.

But our context is more specific than this, so I need to refine the question: Is being a member of the mozilla ‘organization’ on github a good proxy indicator of being staff with regards to people actively contributing to Foundation projects on Github?

For this we go back to the data to check the most significant buckets of activity…

These are the priority accounts to manually check as they could skew the overall stats
These are the priority accounts to manually check as they could skew the overall stats

I can manually check this list of usernames making up the biggest chunks of contribution activity from those marked as ‘staff’.

There are a couple of people in here who are not current staff (and some former staff with less than 100 activities), but this would not skew the data enough that we should need to maintain yet another list of exceptions. There is also a further ‘grey area’ in the overlap between Mozilla contribution, and CDOT-supported/funded contribution to Mozilla.

I think for now at least, I will leave this list as it is, and say that the check against membership of the github organization is a meaningful filter, but we also need to maintain an extra list of ‘further people who are staff’.

So, I made these amends to Gitribution. Rebuilt the database and ran the queries again which gets us to here:

Comparing contributor numbers of staff to volunteers is barely changed, but the contribution activity is significantly different and will make our next analysis phase more accurate.
Comparing contributor numbers of staff to volunteers has barely changed, but the contribution activity is significantly different and will make our next analysis phase more accurate.

With the data in reasonable shape, we can do some more interesting analysis, which we’ll save for another post.

Gitribution

Click to embiggen. This was a check to see how well being a member of the github organisation flags someone as being staff.
Click to embiggen. How well does being a member of a github organisation flag someone as being staff?

Over the last week or so I’ve been building a thing: Gitribution. It’s an attempt to understand contributions to Mozilla Foundation work that happen on Github. It’s not perfect yet, but it’s in a state to get feedback on now.

Why did I build this?

For these reasons (in this order):

  1. Counting: To extract counts of contributor numbers from Github across Foundation projects on an automated ongoing basis
  2. Testing: To demo the API format we need for other sources of data to power our interim contributor dashboard
  3. Learning: To learn a bit about node.js so I can support metrics work on other projects more directly when it’s helpful to (i.e. submitting pull-requests rather than just opening bugs)

1. Counting

The data in this tool is all public data from the Github API, but it’s been restructured so it can be queried in ways that answer questions specific to our goals this year, and has some additional categorization of repositories to query against individual teams. The Github API on it’s own couldn’t answer our questions directly.

This also gives me data in a format that can be explored visually in Tableau (I’ll share this in a follow up blog post). We can now count Github contributors, and also analyze contributions.

2. Testing

Part of our interim dashboard plans include a standard format for reporting on numbers of active and new contributors for a given activity. Building this tool was a way to test if that format makes sense. The output is an API that you can ping with a date and see:

  1. The number of unique usernames to contribute in the 12 months prior (excluding those users who are members of the Github organization that owns the repositories – ie Mozilla or openNews)
  2. The number of those who only contributed in the 7 days prior (i.e. new contributors)

You can test the API here (change the date, or the team names – currently webmaker, openbadge, openNews)

We can use this in the dashboard soon.

Learning

I know a lot more about node.js than I did last week. So that’s something :)

I started out writing this as though it was a python app using JavaScript syntax before grasping the full implications of node’s non-blocking model.

I descended into what I later found out is called callback hell and felt much better when I learned that callback hell is a shared experience!

I tried an extreme escape from callback hell by re-building the app in a fire-and-forget process that kicked off several thousand pings to the Github API and paid no attention to whether or not they succeeded (clearly not a winning solution).

And I’ve ended up with something that isn’t too hellish but uses callbacks to manage the process flow. The current process is pretty linear, so I was able to sense check what it’s doing but it also works mostly on one task at a time so isn’t getting the potential value out of node’s non-blocking model.

Next steps

  • Tweaks to the categorization of ‘members=staff’
    • See the attached image of contributions by username. There are some members of staff with many contributions who are not members of Mozilla on Github. This is not material when counting number of contributors in relation to targets, but when we analyze contribution activity those users with a lot of contributions skew the data significantly.
  • Check and correct the list of repos assigned to each team
    • Currently a best guess based on my limited knowledge and some time trawling through all the repos on the main Mozilla Github page
  • Work out how to use this with Science Lab projects
    • as Software Carpentry use Github as part of their training (which I love) it means the data in their account doesn’t represent the same kinds of activities in the other repos. I need to think about this.
  • Pick the brains of my knowledgeable colleagues and get a review of this code

What else is this good for?