Git workflows for scientific projects and when we use them
Writing source code is a significant part of data-intensive biomedical research. Everything from cleaning and pre-processing data to generating publication figures can be accomplished programmatically. Increasingly, funding agencies and journals require researchers to share their code. To pick a few examples, the Data Lab’s parent organization, Alex’s Lemonade Stand Foundation (ALSF), has such a requirement for awardees, and PLoS Computational Biology requires authors to make code underlying results and conclusions available.
It can feel scary to share your code! I know I felt that way as a graduate student and postdoc. A practice called code review, which I was introduced to as a postdoc in the Greene Lab, can make it less scary. Code review is a quality assurance practice in software engineering where someone who did not author code reviews it (i.e., reads it or runs it) to check for things like readability and correctness. We practice it daily at the Data Lab. If you’re considering adopting code review in your own academic research or collaborative work environment, I want to tell you about some workflows that I wish I had known about ~5 years ago.
⚠️This blog post assumes some familiarity with Git basics like branches.
Preparing your source code to say Hello World
Our science team practices code review for our exploratory analyses, processing workflows, and even documentation. Smart folks have written about analytical code review – we recommend Parker. 2017. in particular. To borrow an idea from Parker borne out by my experience, when you know your code will be reviewed, you tend to write it as if other people will look at it. Usually, this means more robust documentation and readable code. Code that’s publicly available, documented, and easy to follow is more likely to benefit others.
Lucky for us, there’s tooling out there that is geared towards code review. We use the version control system Git and the cloud-based platform GitHub at the Data Lab. GitHub has a feature called pull requests, which enables an author to post their code for reviewers to ask questions and make suggestions. Once your reviewer agrees with your implementation, they approve your pull request, and you can add your code to the default branch or “main copy” of the project.
Code review doesn’t just benefit the community. It also benefits practitioners! These benefits tend to show up in the medium to long term when you can more easily extend or adapt your code to do something new. However, some shorter-term risks are associated with code review, such as grinding things to a halt and overburdening your team. So, I want to share some ways of thinking about science team projects that I’ve found helpful for reaping the benefits of code review while still getting things done.
Project archetypes and when you’ll find them
If you’ve worked on scientific or analytical projects, chances are you know they can vary in their setup and scope. There are, however, recurrent patterns in my experience, which I call “project archetypes.” (No, not these project archetypes.) They overlap with software development projects from other domains, but this framing works for me!
As a bit of background, GitHub has functionality for releases – stopping points where development is frozen and code can be archived. A release can serve as a stopping point for a project’s developers but a starting point for others who want to use their code.
Analysis project
Repositories that contain processing and analysis code intended for a future manuscript typically fall under the analysis project archetype. It’s wonderful if code in analysis project repositories is consumed and reused by others, but the top priorities are correctness and “showing your work” – ensuring traceability and providing a record of the processing and analysis steps underlying your results (Peng and Hicks. 2021.). This is why ALSF’s policy on source code sharing includes all source code (not just code for new methods), and some recent policy recommendations about data sharing include code sharing as an extension of data sharing requirements.
Typically, we’re not too worried about releases for analysis project repositories until it’s time to submit a preprint, complete a round of revisions, etc., and we want to archive the code at the point in time associated with a set of results.
Packages and workflows
Sometimes, we write code that we expect others might use regularly, such as an R package or Nextflow workflow. I call this the packages and workflows archetype because I’m not creative. When we’re working in a package or workflow repository, we need to be able to develop new functionality without putting out something that’s not ready for general usage. This allows people to use the latest release – a version of the code with a particular set of (usually) stable features – while development continues in the repository. (People can also use older releases to maintain reproducibility in their projects!)
Documentation
Most documentation we write here is in plain text files under version control, so it’s practically code! Repositories that exclusively contain user-facing documentation and whatever files are required to serve it fall under the documentation archetype. The considerations for this archetype overlap substantially with the packages and workflow archetype. Namely, we want to be able to write documentation at roughly the same time new features get implemented but before they are released. (If you have ever waited too long to write the methods section of your paper, you know about the headache we’re trying to avoid.) Again, releases are important for documentation – the latest release is what is public-facing, and we can continue to write new sections as needed.
Git workflows for project archetypes
Now that we’ve introduced the types of repositories we encounter, let’s talk about the Git workflows or strategies that we implement for them. As a reminder: all of our code and documentation will go through code review, and we need to be mindful of keeping things moving sustainably. If I were to write and have someone review all of the documentation associated with a new feature simultaneously, they would probably get worn out before they read all of it. Furthermore, making people wait a long time before getting feedback on an implementation can counter your goals and slow down progress overall if you need to ask them to make substantial changes. Every strategy we use is geared toward allowing for iterative development while meeting a project’s requirements.
This is an area where, in my opinion, it’s easy to get caught up in different terminology for strategies. So, I’m opting for terms I can tie to concrete examples in Data Lab repositories the science team maintains below.
Feature branch workflow
In a feature branch workflow, there’s one long-lived branch (often called [.inline-snippet]main[.inline-snippet]) that serves as the “official copy” of the project. When someone wants to add to or update the project code base, they create a new branch for development off of [.inline-snippet]main[.inline-snippet]. They file a pull request requesting to merge their changes into the official copy of the project.
We find this most useful for analysis projects, where keeping a record of the code underlying results is most important. Even without code review, using feature branches can help organize your work.
We use the feature branch workflow in the [.inline-snippet]sc-data-integration[.inline-snippet] repository, which contains analyses that support decisions we’ve made around cell typing that will soon land in the Single-cell Pediatric Cancer Atlas (ScPCA) Portal. We don’t use this code in production; we just want to track it internally and surface this work to others.
Development and main workflow
The development and main workflow consists of two long-lived branches that we’ll call [.inline-snippet]development[.inline-snippet] and [.inline-snippet]main[.inline-snippet], but this naming convention isn’t a strict requirement. The [.inline-snippet]development[.inline-snippet] branch is the project’s working copy, and the [.inline-snippet]main[.inline-snippet] branch is public-facing or “ready for prime time.” When someone wants to add to or update the project code base, they create a new branch for development off of [.inline-snippet]development[.inline-snippet]. They file a pull request requesting to merge their changes into [.inline-snippet]development[.inline-snippet] when it is ready for review. When changes that have accumulated in [.inline-snippet]development[.inline-snippet] are ready to be “published,” we file a pull request from [.inline-snippet]development[.inline-snippet] to [.inline-snippet]main[.inline-snippet]. Sometimes, we explicitly follow this merge into [.inline-snippet]main[.inline-snippet] with a release, but not always.
This workflow is most useful for packages, workflows, or documentation. Including a [.inline-snippet]development[.inline-snippet] branch means we can iteratively add or update code (which helps with speedy review) without putting something half-baked out into the world.
We use this workflow for the [.inline-snippet]scpca-nf[.inline-snippet] repository, which contains the production workflow for the ScPCA Portal, and the [.inline-snippet]scpca-docs[.inline-snippet] repository, which contains our user-facing documentation for the ScPCA Portal.
Forking workflow
A forking workflow is usually only necessary when people with read-only access to a project’s repository contribute to it. That can happen, for example, if you’re fixing a typo in someone else’s public documentation project. Contributors work in their own fork, or copy, of the official project repository. The forking strategy can be used with any other workflow – it’s just a matter of what branch a pull request coming from a fork targets ([.inline-snippet]development[.inline-snippet], [.inline-snippet]main[.inline-snippet], or something else entirely).
We used this model and a feature branch workflow for the OpenPBTA project in both the [.inline-snippet]OpenPBTA-analysis[.inline-snippet] and [.inline-snippet]OpenPBTA-manuscript[.inline-snippet] repositories, which are most similar to the analysis project archetype.
We’d love to hear from you
We recently covered these Git workflows and much more in a workshop on Git and GitHub for researchers from the Children’s Hospital of Philadelphia. If you’re interested in a similar workshop or chatting with the Data Lab about your own Git practices, reach out to us at [email protected].
Writing source code is a significant part of data-intensive biomedical research. Everything from cleaning and pre-processing data to generating publication figures can be accomplished programmatically. Increasingly, funding agencies and journals require researchers to share their code. To pick a few examples, the Data Lab’s parent organization, Alex’s Lemonade Stand Foundation (ALSF), has such a requirement for awardees, and PLoS Computational Biology requires authors to make code underlying results and conclusions available.
It can feel scary to share your code! I know I felt that way as a graduate student and postdoc. A practice called code review, which I was introduced to as a postdoc in the Greene Lab, can make it less scary. Code review is a quality assurance practice in software engineering where someone who did not author code reviews it (i.e., reads it or runs it) to check for things like readability and correctness. We practice it daily at the Data Lab. If you’re considering adopting code review in your own academic research or collaborative work environment, I want to tell you about some workflows that I wish I had known about ~5 years ago.
⚠️This blog post assumes some familiarity with Git basics like branches.
Preparing your source code to say Hello World
Our science team practices code review for our exploratory analyses, processing workflows, and even documentation. Smart folks have written about analytical code review – we recommend Parker. 2017. in particular. To borrow an idea from Parker borne out by my experience, when you know your code will be reviewed, you tend to write it as if other people will look at it. Usually, this means more robust documentation and readable code. Code that’s publicly available, documented, and easy to follow is more likely to benefit others.
Lucky for us, there’s tooling out there that is geared towards code review. We use the version control system Git and the cloud-based platform GitHub at the Data Lab. GitHub has a feature called pull requests, which enables an author to post their code for reviewers to ask questions and make suggestions. Once your reviewer agrees with your implementation, they approve your pull request, and you can add your code to the default branch or “main copy” of the project.
Code review doesn’t just benefit the community. It also benefits practitioners! These benefits tend to show up in the medium to long term when you can more easily extend or adapt your code to do something new. However, some shorter-term risks are associated with code review, such as grinding things to a halt and overburdening your team. So, I want to share some ways of thinking about science team projects that I’ve found helpful for reaping the benefits of code review while still getting things done.
Project archetypes and when you’ll find them
If you’ve worked on scientific or analytical projects, chances are you know they can vary in their setup and scope. There are, however, recurrent patterns in my experience, which I call “project archetypes.” (No, not these project archetypes.) They overlap with software development projects from other domains, but this framing works for me!
As a bit of background, GitHub has functionality for releases – stopping points where development is frozen and code can be archived. A release can serve as a stopping point for a project’s developers but a starting point for others who want to use their code.
Analysis project
Repositories that contain processing and analysis code intended for a future manuscript typically fall under the analysis project archetype. It’s wonderful if code in analysis project repositories is consumed and reused by others, but the top priorities are correctness and “showing your work” – ensuring traceability and providing a record of the processing and analysis steps underlying your results (Peng and Hicks. 2021.). This is why ALSF’s policy on source code sharing includes all source code (not just code for new methods), and some recent policy recommendations about data sharing include code sharing as an extension of data sharing requirements.
Typically, we’re not too worried about releases for analysis project repositories until it’s time to submit a preprint, complete a round of revisions, etc., and we want to archive the code at the point in time associated with a set of results.
Packages and workflows
Sometimes, we write code that we expect others might use regularly, such as an R package or Nextflow workflow. I call this the packages and workflows archetype because I’m not creative. When we’re working in a package or workflow repository, we need to be able to develop new functionality without putting out something that’s not ready for general usage. This allows people to use the latest release – a version of the code with a particular set of (usually) stable features – while development continues in the repository. (People can also use older releases to maintain reproducibility in their projects!)
Documentation
Most documentation we write here is in plain text files under version control, so it’s practically code! Repositories that exclusively contain user-facing documentation and whatever files are required to serve it fall under the documentation archetype. The considerations for this archetype overlap substantially with the packages and workflow archetype. Namely, we want to be able to write documentation at roughly the same time new features get implemented but before they are released. (If you have ever waited too long to write the methods section of your paper, you know about the headache we’re trying to avoid.) Again, releases are important for documentation – the latest release is what is public-facing, and we can continue to write new sections as needed.
Git workflows for project archetypes
Now that we’ve introduced the types of repositories we encounter, let’s talk about the Git workflows or strategies that we implement for them. As a reminder: all of our code and documentation will go through code review, and we need to be mindful of keeping things moving sustainably. If I were to write and have someone review all of the documentation associated with a new feature simultaneously, they would probably get worn out before they read all of it. Furthermore, making people wait a long time before getting feedback on an implementation can counter your goals and slow down progress overall if you need to ask them to make substantial changes. Every strategy we use is geared toward allowing for iterative development while meeting a project’s requirements.
This is an area where, in my opinion, it’s easy to get caught up in different terminology for strategies. So, I’m opting for terms I can tie to concrete examples in Data Lab repositories the science team maintains below.
Feature branch workflow
In a feature branch workflow, there’s one long-lived branch (often called [.inline-snippet]main[.inline-snippet]) that serves as the “official copy” of the project. When someone wants to add to or update the project code base, they create a new branch for development off of [.inline-snippet]main[.inline-snippet]. They file a pull request requesting to merge their changes into the official copy of the project.
We find this most useful for analysis projects, where keeping a record of the code underlying results is most important. Even without code review, using feature branches can help organize your work.
We use the feature branch workflow in the [.inline-snippet]sc-data-integration[.inline-snippet] repository, which contains analyses that support decisions we’ve made around cell typing that will soon land in the Single-cell Pediatric Cancer Atlas (ScPCA) Portal. We don’t use this code in production; we just want to track it internally and surface this work to others.
Development and main workflow
The development and main workflow consists of two long-lived branches that we’ll call [.inline-snippet]development[.inline-snippet] and [.inline-snippet]main[.inline-snippet], but this naming convention isn’t a strict requirement. The [.inline-snippet]development[.inline-snippet] branch is the project’s working copy, and the [.inline-snippet]main[.inline-snippet] branch is public-facing or “ready for prime time.” When someone wants to add to or update the project code base, they create a new branch for development off of [.inline-snippet]development[.inline-snippet]. They file a pull request requesting to merge their changes into [.inline-snippet]development[.inline-snippet] when it is ready for review. When changes that have accumulated in [.inline-snippet]development[.inline-snippet] are ready to be “published,” we file a pull request from [.inline-snippet]development[.inline-snippet] to [.inline-snippet]main[.inline-snippet]. Sometimes, we explicitly follow this merge into [.inline-snippet]main[.inline-snippet] with a release, but not always.
This workflow is most useful for packages, workflows, or documentation. Including a [.inline-snippet]development[.inline-snippet] branch means we can iteratively add or update code (which helps with speedy review) without putting something half-baked out into the world.
We use this workflow for the [.inline-snippet]scpca-nf[.inline-snippet] repository, which contains the production workflow for the ScPCA Portal, and the [.inline-snippet]scpca-docs[.inline-snippet] repository, which contains our user-facing documentation for the ScPCA Portal.
Forking workflow
A forking workflow is usually only necessary when people with read-only access to a project’s repository contribute to it. That can happen, for example, if you’re fixing a typo in someone else’s public documentation project. Contributors work in their own fork, or copy, of the official project repository. The forking strategy can be used with any other workflow – it’s just a matter of what branch a pull request coming from a fork targets ([.inline-snippet]development[.inline-snippet], [.inline-snippet]main[.inline-snippet], or something else entirely).
We used this model and a feature branch workflow for the OpenPBTA project in both the [.inline-snippet]OpenPBTA-analysis[.inline-snippet] and [.inline-snippet]OpenPBTA-manuscript[.inline-snippet] repositories, which are most similar to the analysis project archetype.
We’d love to hear from you
We recently covered these Git workflows and much more in a workshop on Git and GitHub for researchers from the Children’s Hospital of Philadelphia. If you’re interested in a similar workshop or chatting with the Data Lab about your own Git practices, reach out to us at [email protected].