3 simple things GitHub can do for science


The topic “GitHub for Science” has been explored quite a few times before (1, 2, 3, 4) and with good reason: it is quite exciting to envision what breakthroughs in scientific collaboration could come from GitHub backed explorations, with substantial capital
to invest and a formidable team to execute.

With GitHub’s founder saying that:

In science, I think there’s huge changes that can be made there as well. — Tom Preston-Werner

And their recent hire of Arfon Smith, known for co-founding Zooniverse, the largest citizen science hub with just under 900,000 users, it’s obvious that GitHub has big plans for science. But are there things they could be doing now and without much hassle? Yes, in fact, here’s three:

1. Assign DOIs to scientific code

It’s easy to use URLs for citing resources, but URLs are very ephemeral. One day that wonderful piece of code is there, and the next day it’s gone. Disappearing code is bad if it happens to regular old code, but can be downright tragic if science software is involved. A recent paper in Current Biology has shown that availability of data (which includes software) drops drastically after a few years, something we’ve sensed for a while, but it hurts even more to have it in black and white. You want that piece of code one undergrad wrote 10 years ago? Good luck, pal.

The Availability of Research Data Declines Rapidly with Article Age

What does this have to do with GitHub? Well, it’s in a perfect position to battle source-code rot. As it is, you can delete a repository on GitHub at any time and the code will be irreversibly lost. This is very bad if the repository contains scientific software. For means of preservation, researchers have started putting code on FigShare, which as a bonus also gives them a DOI that can be used for citing, bibliographies and tracking of impact. But FigShare, as great as it is, is no match for GitHub when it comes to reading, discussing and using code.

Alright, so what could GitHub do? They could add an option to enable Science Mode, which would: a) mint a DOI for a selected commit and b) permanently disable repository deletion.

Science Mode

2. Make repository traffic stats public

OK, so now we have a DOI that’s citable and allows easier impact tracking, and a guarantee that code will not *poof* go missing overnight and severely hamper attempts to reproduce that crucial paper’s findings. We’re on the right path, but we can do more!
There are already ways for measuring what impact a repository has had (the amount of stars, the amount of forks, tweets, etc.) and ImpactStory already tracks those:

But GitHub knows much more than that! They recently launched GitHub Traffic analytics, which shows you how many page views your repositories has had (a very good measure of reach). I imagine they also know how many times your repository has been cloned (a very good measure of use). Both of these are publicly inaccessible: you can only see Traffic analytics for repositories you own, and cloning statistics (if they exist) are still lurking in a Redis database somewhere. In terms of figuring out how important or impactful a piece of code is, more(data) == better!

When you enable Science Mode, both traffic and clone analytics for that repository could become public and accessible through the API.

This would enable alternative metrics tools like ImpactStory, AltMetric and ORCID to get a better idea about the use and reach of a piece of code. Talking about ORCID brings me to the last suggestion.

3. Integrate with ORCID

If GitHub wants to be serious about science it needs to be able to connect the dots and answer questions like: Which prominent researcher is this prolific GitHub user? With half a million researchers (and growing very quickly) using a non-proprietary and non-profit unique identifier named ORCID (Open Researcher and Contributor ID), would allow GitHub to uniquely identify and connect its users to individual researchers. This opens up a lot of options for both alternative metrics and discovery, as external services could also use this information in potentially very surprising and rewarding ways.

Further down the line this would enable GitHub to become much more “social” for scientists. When a researcher you cited several times in your papers creates a GitHub account, connects it to ORCID and uploads a novel piece of software that potentially saves you days of lab work, GitHub could put that in your feed. When you are reading remarkable code on GitHub and it belongs to a researcher, his papers, alternative metrics or bibliography could be just a click away. When you publish your in-depth analysis in Nature with tags “doge, meme, culture” with lots of code on GitHub, and someone else publishes an in-depth analysis in Science with tags “y u no, meme, society” with lots of code, GitHub could recommend you follow each other (on the off chance you great minds don’t know each other). But I digress.

GitHub could add ORCID to the user profile settings and set itself up for a lot of beautiful sciency things in the future, but also immediately benefit from uniquely identifying researchers.


That’s it. Three things GitHub could easily do to fortify its position in science:
1. Add Science Mode with ability to mint DOIs and prevent repository deletion,
2. Make traffic and clone stats public and accessible through the API,
3. Integrate with ORCID.

It will be very interesting to see what GitHubbers actually have in store for us in terms of science. Fingers crossed for mind == blown.


Do you agree, disagree, have comments, questions, suggestions, ideas? Tweet away at @juretriglav.