Creating an open source workflow to uniformly process data for the Single-cell Pediatric Cancer Atlas portal

March 14, 2023

ALLY HAWKINS

Last year, the Data Lab launched the Single-cell Pediatric Cancer Atlas (ScPCA) Portal, which today holds uniformly processed single-cell gene expression data obtained from 8 separate labs, over 500 samples, and representing 52 cancer types. The portal is still growing as we continue to receive and process raw data from ScPCA investigators! All uniformly processed data is made available for download on the ScPCA Portal, giving researchers easy access to a growing database of summarized gene expression data and metadata to utilize for their own research.

But how exactly did we make sure that all of the data was uniformly processed? And how are we able to ensure uniform processing for incoming samples as the portal continues to grow? We invested time up-front to build a reusable and reproducible pipeline for processing single-cell RNA-seq data using the Nextflow workflow manager called scpca-nf. In this post, we’ll go behind the scenes and tell you a little about scpca-nf – how it works, why we like it, and how you can use it too!

What is scpca-nf?

scpca-nf is a reproducible Nextflow-based workflow that takes FASTQ files from single-cell or single-nuclei RNA-seq samples and returns a gene by cell count matrix and associated metadata stored within a SingleCellExperiment object. For each input sample, the workflow performs the following steps:

Gene expression quantification – Salmon is used to map reads to a reference transcriptome and alevin-fry is used to generate gene by cell counts matrices. For more insight into why we use alevin-fry for gene expression quantification, see the FAQ section of the ScPCA documentation.
Removal of empty droplets – The gene by cell count matrix is imported into R to create a SingleCellExperiment object which is subject to filtering of any droplets that do not contain cells. Two different objects are saved, one containing the unfiltered counts matrix and one with the empty droplets removed.
Filtering low quality cells – Further filtering is performed to remove any potential cells that may have a low number of genes detected or high mitochondrial count, suggesting that the cell is likely to be dead or dying. Here, we remove cells with greater than a 75% chance of being poor quality as calculated by miQC and cells with fewer than 100 genes detected per cell.
Post-processing – The counts for cells that remain after removal of empty droplets and low quality cells then undergo normalization and log-transformation. Finally, dimensionality reductions are calculated using both principal component analysis (PCA) and UMAP and are stored in the final processed object.

For more detailed information on the individual steps performed in the workflow, see the processing information page in the ScPCA documentation.

For each sample in the portal, the scpca-nf workflow produces three versions of the count matrix that can be downloaded from the portal as SingleCellExperiment objects: the unfiltered count matrix, filtered count matrix, and final processed count matrix. After download, any of the three objects can be directly read into R to begin analysis (see Getting started with an ScPCA dataset). The processed objects can be particularly useful as they contain normalized data and dimensionality reduction, allowing researchers to jump straight into asking important biological questions and performing downstream analyses.

Why we like scpca-nf

Having a workflow like scpca-nf has allowed us to maintain a reproducible and uniform way to process the data available on the ScPCA portal. But what are some of our favorite features of this workflow, and what benefits have we seen over the lifetime of the project?

Reproducible – Due to the number of participating labs and our hope to expand the portal in the future, we needed a workflow that would allow us to process data intermittently over a long period of time in the same exact way. Using Nextflow helps us accomplish this goal by performing all processing within container environments (docker or singularity), ensuring consistent versions of tools and dependencies.
Fast and efficient – Using Nextflow means we can take advantage of parallel processing, and process more samples faster. For gene expression quantification we use alevin-fry, which requires less computing power and memory than tools like Cell Ranger without compromising accuracy. The combined speed of parallel processing and efficiency of alevin-fry means samples are processed quickly and efficiently.
Modular setup – We can easily add new features. When we want to make changes or additions to the analysis steps, we don’t need to create a brand new workflow – we can simply add new steps to our existing workflow. This was particularly helpful when we added support for the other data types found on the portal (bulk RNA-seq, CITE-seq, spatial transcriptomics, and multiplexed data). Scpca-nf can process any of these data types, all within the same workflow. We can also re-process samples to incorporate changes or new features, and with a few tricks we make sure that Nextflow skips repeating any time-consuming steps, saving us even more time!

Read more about workflow managers, like Nextflow, and their benefits in our blog post on Automating analyses with workflow managers.

How can you get started using scpca-nf to analyze your own samples?

Does scpca-nf sound like it could be useful to you and your research? You can try it out for yourself! Our pipeline is open source and available for all to use to analyze their own set of single-cell and single-nuclei RNA-seq samples. By using scpca-nf, you will be able to take advantage of the tools we have set up to automate the initial pre-processing steps typically performed when analyzing single-cell data, decreasing the overall time you spend analyzing each sample. Additionally, you can take advantage of the low computational cost of alevin-fry and Nextflow’s ability to parallelize jobs, and with one command analyze multiple samples at once in a reproducible way!

So what do you need to get started analyzing your single-cell data faster and more reproducibly?

Organize your single-cell data and create a metadata table where each row contains the required information for each sample to be processed (e.g., sample identifier, path to FASTQ files)
Install Nextflow and Docker.
Test it out on your samples!

For step-by-step instructions on how to get started using scpca-nf to analyze your own data see the instructions for processing your samples in the scpca-nf github repository.

Have any questions or run into trouble trying to use scpca-nf? Tell us about it and file an issue in the scpca-nf github repository. If you have questions about the ScPCA Portal or if you have single-cell data related to childhood cancer that you are interested in including in the ScPCA Portal you can reach out to us at scpca@ccdatalab.org. Happy processing!

What is scpca-nf?

Gene expression quantification – Salmon is used to map reads to a reference transcriptome and alevin-fry is used to generate gene by cell counts matrices. For more insight into why we use alevin-fry for gene expression quantification, see the FAQ section of the ScPCA documentation.
Removal of empty droplets – The gene by cell count matrix is imported into R to create a SingleCellExperiment object which is subject to filtering of any droplets that do not contain cells. Two different objects are saved, one containing the unfiltered counts matrix and one with the empty droplets removed.
Filtering low quality cells – Further filtering is performed to remove any potential cells that may have a low number of genes detected or high mitochondrial count, suggesting that the cell is likely to be dead or dying. Here, we remove cells with greater than a 75% chance of being poor quality as calculated by miQC and cells with fewer than 100 genes detected per cell.
Post-processing – The counts for cells that remain after removal of empty droplets and low quality cells then undergo normalization and log-transformation. Finally, dimensionality reductions are calculated using both principal component analysis (PCA) and UMAP and are stored in the final processed object.

For more detailed information on the individual steps performed in the workflow, see the processing information page in the ScPCA documentation.

Why we like scpca-nf

Reproducible – Due to the number of participating labs and our hope to expand the portal in the future, we needed a workflow that would allow us to process data intermittently over a long period of time in the same exact way. Using Nextflow helps us accomplish this goal by performing all processing within container environments (docker or singularity), ensuring consistent versions of tools and dependencies.
Fast and efficient – Using Nextflow means we can take advantage of parallel processing, and process more samples faster. For gene expression quantification we use alevin-fry, which requires less computing power and memory than tools like Cell Ranger without compromising accuracy. The combined speed of parallel processing and efficiency of alevin-fry means samples are processed quickly and efficiently.
Modular setup – We can easily add new features. When we want to make changes or additions to the analysis steps, we don’t need to create a brand new workflow – we can simply add new steps to our existing workflow. This was particularly helpful when we added support for the other data types found on the portal (bulk RNA-seq, CITE-seq, spatial transcriptomics, and multiplexed data). Scpca-nf can process any of these data types, all within the same workflow. We can also re-process samples to incorporate changes or new features, and with a few tricks we make sure that Nextflow skips repeating any time-consuming steps, saving us even more time!

Read more about workflow managers, like Nextflow, and their benefits in our blog post on Automating analyses with workflow managers.

How can you get started using scpca-nf to analyze your own samples?

So what do you need to get started analyzing your single-cell data faster and more reproducibly?

Organize your single-cell data and create a metadata table where each row contains the required information for each sample to be processed (e.g., sample identifier, path to FASTQ files)
Install Nextflow and Docker.
Test it out on your samples!

For step-by-step instructions on how to get started using scpca-nf to analyze your own data see the instructions for processing your samples in the scpca-nf github repository.

Creating an open source workflow to uniformly process data for the Single-cell Pediatric Cancer Atlas portal

What is scpca-nf?

Why we like scpca-nf

How can you get started using scpca-nf to analyze your own samples?

What is scpca-nf?

Why we like scpca-nf

How can you get started using scpca-nf to analyze your own samples?

Related Post