Thoughts on reproducibility of open scientific software

Years ago (yes, it’s been years, don’t remind me), I wrote an application for modeling the laser ablation ICP-MS mapping process. This is a thing where you make microscopic explosions with a laser and then the little bits of the stuff get blown into a blazing torch, which is as hot as the surface of the sun, which breaks up the little bits into bits tinier still and then these atom-sized bits (ions, really) get their mass measured in the mass spectrometer. If you do this right thousands of times, moving 1/10000ths of a centimeter at a time and blasting holes in your sample as you go, you can make pretty elemental maps:

Above is a 2D map of a rat’s hippocampus, where hotter colors represent a higher concentration of Zinc. Beautiful, no?

But I digress. Back to the software! The laser and the mass spectrometer are very complex pieces of machinery, with many many parameters to fiddle with and getting these parameters just right is mostly trial and error, which is expensive and time-consuming, given that a single map, similar to the one above, can take a day or more to complete. We tried to make it less fiddly, by first, coming up with a (very) simple mathematical model for the processes involved, and second, developing a GUI application to make this mathematical model more intuitive and usable.

I made a number of wrong decisions when writing this software, which severely limit its reusability today.

It’s not a user experience marvel, but it worked! This was only 4 years ago. Where is this software today? Sadly, it’s not as openly available or reproducible as I would like. We do send binaries to anyone who requests it, free of charge, but that’s introducing a completely needless barrier for something that should be openly available from the start, source code included. I confess, I was young and foolish, so I made a few wrong decisions then which severely limit its reusability today. You don’t have to repeat my mistakes, read on.

First, it’s written for a closed platform, i.e. using C# and .NET on Windows. Ugh.

Scientific software should always be written for an open platform

Writing your open scientific software for any closed platform is a problem, of course, because you’re effectively excluding a very large portion of scientists who might want to run it, but are not running your chosen platform. Write it in Python, Ruby, C, Go, JavaScript, Brainfuck (ok, maybe not), but make sure it can run on Linux, Windows and OS X without any hassle. If you write your software so that it runs on Linux, you’re pretty much already covered due to great free virtual machine tools available.

Scientific software should be open source

If the purpose of your scientific software is to help the cause of your paper or discovery, how can it be anything other than open source? If it’s a closed, opaque executable binary, how can it be appropriately scrutinized or appreciated? Packaging your paper with a binary executable, without any source code, is akin to drawing a black square over your Methods section and writing “It works, trust me.” instead.

Further more, you would be surprised how many people are willing to help you develop your software, once it is open source under a permissive license (e.g. the MIT license). The power of social coding is coming to science too, or rather, it’s already here.

Scientific software should follow good software development practices

This is a massive subtopic, but it is true that a lot of software in academia is written poorly, my software included. I don’t mean “It doesn’t work”-type of poorly, but it’s missing automated tests, doesn’t follow software design principles, isn’t version controlled, isn’t documented well, isn’t maintainable because of 5000+ line classes, etc. It’s not anyone’s fault that this is the case, as scientists mostly don’t have a strong developer background, nor did they need it in the past. But the tide of scientific software is definitely approaching and there’s a lot we can do to educate scientists about good practices in software development now, before it’s too late and we’re all swimming in a sea of unmaintainable code. One of the best things to happen in this area in a long while is the Software Carpentry project, which aims to teach scientists about much of what I’ve just talked about. Software Carpentry, if you’re reading this: You guys are freakin’ awesome!

Making it right

Knowing what I know now, I would have written my software differently. But given that it is not going to be rewritten anytime soon, what can we do now to make it a better member of the community?

First, let’s make it open source under a permissive license: Boom, it is done.

Second, we need to make it run anywhere. This is a tough one, since our software is written for Windows. How do you run Windows apps on another platform? Easy, use wine. “Oh,” I hear you say, “but I need to install and compile wine and what if I change my operating system from OS X to Ubuntu to Red Hat, how will it work then? I’ll need to set it up all over again. This is worse than that one time I had to set up LaTeX!” I hear you, if only there were tools available, which specialize in consistent environments for software... Ladies and gentlemen, I’m sure you’ve heard of Vagrant. In combination with VirtualBox, Vagrant offers an open, free, fast and reproducible way of creating virtual machines with specified properties. In our case, what we need is an Ubuntu virtual machine with wine installed. And because our project includes a Vagrantfile, running it on virtually any platform out there is a matter of:

$> git clone git@github.com:jure/LAICPMSmodel.git
$> vagrant up
$> vagrant ssh -- -X
vagrant@precise64:~$> wine /vagrant/bin/Release/LAICPMSmodel.exe  

And voila:

My software has come back to life!

TL;DR:

  1. Scientific software should always be written for an open platform
  2. Scientific software should be open source
  3. Scientific software should follow good software development practices

In another post coming soon, I’ll return to the topic of running scientific software in consistent frictionless environments using Vagrant, because oh boy! is that exciting for reproducibility.

Do you agree, disagree, have comments, questions, suggestions, ideas? Tweet away at @juretriglav.