Working reproducibly with others on OpenScPCA
Earlier this year, we launched the Open Single-cell Pediatric Cancer Atlas (OpenScPCA) project, a collaborative project to openly analyze the data in the Single-cell Pediatric Cancer Atlas Portal on GitHub. We hope this project will bring transparently and expertly assigned cell type labels to the data in the Portal, help the community understand the strengths and limitations of applying existing single-cell methods to pediatric cancer data, and, frankly, allow us to meet more scientists in our community working with single-cell data (maybe you? đ).
From experience, we know the importance of ensuring reproducibility in an open analysis project. Weâre also aware that this is challenging. Luckily, we werenât starting entirely from scratch. OpenScPCA builds off the success of a project we published last year: the Open Pediatric Brain Tumor Atlas (OpenPBTA) project, which was co-organized with the Center for Data-Driven Discovery in Biomedicine (D3b) at the Childrenâs Hospital of Philadelphia. Weâre not satisfied with just building on our success hereâwe want to make sure weâre applying the lessons learned along the way. In fact, Jo Lynne Rokita's blog post details what D3b learned, and I wrote about what the Data Lab learned from my blog post.
Besides taking contributor-facing documentation more seriouslyâyou can read more about it in Deepa Prasadâs most recent blog postâand making grants available, we think a few key differences in our approach to OpenScPCA vs. OpenPBTA are worth sharing. This blog post will focus on techniques we're using to ensure reproducibility over time and our testing strategy, which is just about checking that code can execute, not correctness (we use peer review to help with that!).Â
If it ainât broken⊠wait, are you sure it's not broken?
If an analysis project is around long enough, something is bound to break. In our experience, open, collaborative projects are long-livedâwe want the code to be useful for a while! Combine that with multiple people from different labs writing code to analyze the data, and you have created the perfect conditions for things to break in expected and unexpected ways. In this way, OpenScPCA and OpenPBTA are not differentâwe need to make sure the analysis code still runs as new data releases come out and that underlying results are reproducible over time. (Reproducibility and its benefits and goals are more extensively discussed in Peng and Hicks. 2021.)
Are we doomed to try to play (and win đ€) bug-whack-a-mole when we go to publish parts of OpenScPCA? No! We can borrow a collection of strategies from our colleagues in software development to mitigate these problems, namely continuous integration/continuous delivery (CI/CD) and dependency management (i.e., tracking and pinning the versions of libraries or packages our code depends on).
Ensuring reproducibility in OpenPBTA
In OpenPBTA, our strategies for ensuring reproducibility throughout the project were simple. We had one monolithic Docker image containing all the dependencies for every analysis and one CI/CD workflow to check that all code could be run. We created test data that included all features for a subset of samples. The workflow would first build the Docker image, run a container from the recently-built image, download the test data, and then run every analysis module using the test data within the container. Before merging a pull request, the workflow had to pass regardless of what files the author was altering in the pull request.

Our approach mitigated or at least made us aware of some problems that arise in analytical projects. For example, running code on test data ensured that the syntax was correct, and if, for example, code made assumptions about the presence of specific samples (i.e., by hardcoding sample identifiers), a workflow failure would let us know we needed to make the code more general. If an analysis moduleâs dependencies were missing from the Docker container, the workflow would fail, and we would know we needed to add the dependencies. Ultimately, we could successfully run all the steps contributing to the OpenPBTA publication in the project Docker container using the final data release.
The approachâs simplicity was appealing, but as the project continued, it created pain points. Large Docker images take a long time to build and download. As the number of analysis modules grew, the number of steps in the workflow also increased. In the latter stages of the project, the workflow would sometimes time out (i.e., fail because it took too long), and even when it finished in the required time, it could be frustrating to wait over an hour to find out that there was a syntax error in the pull request. Heading into OpenScPCA, we wanted to retain the benefits of our OpenPBTA reproducibility strategy while alleviating these pain points.
Ensuring reproducibility in OpenScPCA
Like in OpenPBTA, we expect analyses to be organized into subdirectories or modules. Often, these modules will be entirely independent of each other. For example, if we are adding cell type labels to two projectsâparticularly if we use different methodsâthe code can be organized into separate modules. If Project Aâs cell typing code is broken, that has no bearing on Project Bâs cell typing code. We can use this independence to our advantage when creating our dependency management and testing strategy.
In OpenScPCA, every module has its own environmentâi.e., the set of dependencies required to run its codeâand its own CI/CD workflows: one to run the moduleâs code on test data and another to test building and push the module-specific Docker image to AWS Elastic Container Registry. To help contributors and project organizers with setup, we use a script to initialize new modules with the required files for managing environments and testing. We use a mix of renv and conda to manage dependencies, depending on the language used in the module, and we can use both of these technologies in conjunction with Docker. When we test a moduleâs code using the test data, which is a set of 100 simulated cells for each library, we will activate the environment or run it in the moduleâs container when applicable. These workflows only get triggered when a pull request changes relevant files: if Project Aâs cell typing code changes, only Project Aâs cell typing code gets tested. We hope this will allow us to develop and merge code more efficiently.

Running all the code on every pull request in OpenPBTA did have an advantageâif a module hadnât been touched in a while but no longer worked with a new data release, weâd find out pretty quickly. If we only test OpenScPCA code when it changes, we could be in a situation where code has been broken for months without us being aware. To guard against this possibility, we also build Docker images and run the module code using the test data on a monthly schedule. Then, we have an opportunity to fix any problems before we attempt to rerun something way down the line and are unpleasantly surprised.

Of course, given our focus on cell type labels, we must also prepare for the possibility that modules will build off one another (i.e., an analysis uses cell type labels someone else generated as input). Keep your eyes peeled for Josh Shapiroâs part II of this blog post for more on how weâre thinking about this problem!
Join us for OpenScPCA!
Are you a researcher with experience or expertise in pediatric cancer, single-cell data, labeling cell types or cell states, and/or pan-cancer analyses? We invite you to explore and contribute your ideas to OpenScPCA!
- Read our documentation for more on ensuring reproducibility.
- Explore the OpenScPCA-analysis repository on GitHub.
- Join the conversation on GitHub Discussions, the community forum for OpenScPCA.Â
- Let us know youâre interested by filling out the OpenScPCA intake form.
Earlier this year, we launched the Open Single-cell Pediatric Cancer Atlas (OpenScPCA) project, a collaborative project to openly analyze the data in the Single-cell Pediatric Cancer Atlas Portal on GitHub. We hope this project will bring transparently and expertly assigned cell type labels to the data in the Portal, help the community understand the strengths and limitations of applying existing single-cell methods to pediatric cancer data, and, frankly, allow us to meet more scientists in our community working with single-cell data (maybe you? đ).
From experience, we know the importance of ensuring reproducibility in an open analysis project. Weâre also aware that this is challenging. Luckily, we werenât starting entirely from scratch. OpenScPCA builds off the success of a project we published last year: the Open Pediatric Brain Tumor Atlas (OpenPBTA) project, which was co-organized with the Center for Data-Driven Discovery in Biomedicine (D3b) at the Childrenâs Hospital of Philadelphia. Weâre not satisfied with just building on our success hereâwe want to make sure weâre applying the lessons learned along the way. In fact, Jo Lynne Rokita's blog post details what D3b learned, and I wrote about what the Data Lab learned from my blog post.
Besides taking contributor-facing documentation more seriouslyâyou can read more about it in Deepa Prasadâs most recent blog postâand making grants available, we think a few key differences in our approach to OpenScPCA vs. OpenPBTA are worth sharing. This blog post will focus on techniques we're using to ensure reproducibility over time and our testing strategy, which is just about checking that code can execute, not correctness (we use peer review to help with that!).Â
If it ainât broken⊠wait, are you sure it's not broken?
If an analysis project is around long enough, something is bound to break. In our experience, open, collaborative projects are long-livedâwe want the code to be useful for a while! Combine that with multiple people from different labs writing code to analyze the data, and you have created the perfect conditions for things to break in expected and unexpected ways. In this way, OpenScPCA and OpenPBTA are not differentâwe need to make sure the analysis code still runs as new data releases come out and that underlying results are reproducible over time. (Reproducibility and its benefits and goals are more extensively discussed in Peng and Hicks. 2021.)
Are we doomed to try to play (and win đ€) bug-whack-a-mole when we go to publish parts of OpenScPCA? No! We can borrow a collection of strategies from our colleagues in software development to mitigate these problems, namely continuous integration/continuous delivery (CI/CD) and dependency management (i.e., tracking and pinning the versions of libraries or packages our code depends on).
Ensuring reproducibility in OpenPBTA
In OpenPBTA, our strategies for ensuring reproducibility throughout the project were simple. We had one monolithic Docker image containing all the dependencies for every analysis and one CI/CD workflow to check that all code could be run. We created test data that included all features for a subset of samples. The workflow would first build the Docker image, run a container from the recently-built image, download the test data, and then run every analysis module using the test data within the container. Before merging a pull request, the workflow had to pass regardless of what files the author was altering in the pull request.

Our approach mitigated or at least made us aware of some problems that arise in analytical projects. For example, running code on test data ensured that the syntax was correct, and if, for example, code made assumptions about the presence of specific samples (i.e., by hardcoding sample identifiers), a workflow failure would let us know we needed to make the code more general. If an analysis moduleâs dependencies were missing from the Docker container, the workflow would fail, and we would know we needed to add the dependencies. Ultimately, we could successfully run all the steps contributing to the OpenPBTA publication in the project Docker container using the final data release.
The approachâs simplicity was appealing, but as the project continued, it created pain points. Large Docker images take a long time to build and download. As the number of analysis modules grew, the number of steps in the workflow also increased. In the latter stages of the project, the workflow would sometimes time out (i.e., fail because it took too long), and even when it finished in the required time, it could be frustrating to wait over an hour to find out that there was a syntax error in the pull request. Heading into OpenScPCA, we wanted to retain the benefits of our OpenPBTA reproducibility strategy while alleviating these pain points.
Ensuring reproducibility in OpenScPCA
Like in OpenPBTA, we expect analyses to be organized into subdirectories or modules. Often, these modules will be entirely independent of each other. For example, if we are adding cell type labels to two projectsâparticularly if we use different methodsâthe code can be organized into separate modules. If Project Aâs cell typing code is broken, that has no bearing on Project Bâs cell typing code. We can use this independence to our advantage when creating our dependency management and testing strategy.
In OpenScPCA, every module has its own environmentâi.e., the set of dependencies required to run its codeâand its own CI/CD workflows: one to run the moduleâs code on test data and another to test building and push the module-specific Docker image to AWS Elastic Container Registry. To help contributors and project organizers with setup, we use a script to initialize new modules with the required files for managing environments and testing. We use a mix of renv and conda to manage dependencies, depending on the language used in the module, and we can use both of these technologies in conjunction with Docker. When we test a moduleâs code using the test data, which is a set of 100 simulated cells for each library, we will activate the environment or run it in the moduleâs container when applicable. These workflows only get triggered when a pull request changes relevant files: if Project Aâs cell typing code changes, only Project Aâs cell typing code gets tested. We hope this will allow us to develop and merge code more efficiently.

Running all the code on every pull request in OpenPBTA did have an advantageâif a module hadnât been touched in a while but no longer worked with a new data release, weâd find out pretty quickly. If we only test OpenScPCA code when it changes, we could be in a situation where code has been broken for months without us being aware. To guard against this possibility, we also build Docker images and run the module code using the test data on a monthly schedule. Then, we have an opportunity to fix any problems before we attempt to rerun something way down the line and are unpleasantly surprised.

Of course, given our focus on cell type labels, we must also prepare for the possibility that modules will build off one another (i.e., an analysis uses cell type labels someone else generated as input). Keep your eyes peeled for Josh Shapiroâs part II of this blog post for more on how weâre thinking about this problem!
Join us for OpenScPCA!
Are you a researcher with experience or expertise in pediatric cancer, single-cell data, labeling cell types or cell states, and/or pan-cancer analyses? We invite you to explore and contribute your ideas to OpenScPCA!
- Read our documentation for more on ensuring reproducibility.
- Explore the OpenScPCA-analysis repository on GitHub.
- Join the conversation on GitHub Discussions, the community forum for OpenScPCA.Â
- Let us know youâre interested by filling out the OpenScPCA intake form.