Downstream Analysis Workflows – do you have a list of genes whose expression you are particularly interested in?

April 11, 2023

The Childhood Cancer Data Lab maintains a collection of uniformly processed single-cell data from pediatric cancer clinical samples and xenografts in the Single-cell Pediatric Cancer Atlas (ScPCA) Portal. Although access to preprocessed data saves researchers time, we know that the downloads from the ScPCA Portal are only the starting point. That’s why we’ve created downstream analysis workflows for commonly performed analyses. Instead of writing code wholesale, you can analyze data once you’ve configured these workflows.

Previously, we introduced you to the core downstream analysis workflow that performs filtering, normalization, and dimensionality reduction. We have since developed a clustering analysis workflow that performs graph-based clustering across various parameters in parallel to help identify optimal clustering for each library in a dataset. 

In this blog post, we’ll introduce you to the recently added genes of interest analysis workflow that evaluates the expression of a provided list of genes in a sample. Both the clustering and genes of interest workflows can be applied to pre-processed data downloaded from the ScPCA Portal or any other single-cell or single-nuclei datasets that have undergone filtering, normalization, and dimensionality reduction. Like the core workflow, these additional downstream workflows mean less time writing scripts to perform these tasks and more time interpreting your results.

What is the “genes of interest” analysis workflow?

Do you have a list of genes whose expression you are particularly interested in? Will visualizing the expression levels of a specific list of genes help identify key features of your dataset? The genes of interest workflow can help you evaluate the expression of a specific list of genes in an individual sample. 

Here’s what you can expect from this module

The genes of interest workflow accepts a list of genes as input, in addition to a pre-processed [.inline-snippet]SingleCellExperiment[.inline-snippet] object stored as an RDS file. Then, the workflow performs three main steps:

  1.  The first step is ensuring that the genes of interest are in your data! A common obstacle in data analysis is having various types of gene identifiers between objects, creating the need to map the gene identifiers to each other. You can provide gene symbols, Ensembl ids, Entrez ids, etc. and the workflow will perform the gene identifier mapping for you!
  2. Hierarchical clustering is then performed on the normalized data associated with the genes of interest. These clustering results are used to generate a heatmap that can help identify any clear structure or groupings of cells with similar gene expression patterns, if present.
  3. The expression levels of the genes of interest are then visualized and included in an HTML report. The normalized and transformed expression for each of the provided genes of interest is compared to the mean expression of all genes in the dataset. An example of this visualization is shown below.

The output of this workflow includes files containing statistics and an HTML file with visualizations like the one above that can help you identify key features of your dataset using the expression levels of the genes you are interested in. We will talk more about the expected output in the next section!

What is the output of the genes of interest module?

Upon each successful run of the genes of interest workflow, you can expect the following output for each library ID:

  1.  A [.inline-snippet]_mapped_genes.tsv[.inline-snippet] file with mapped genes of interest if gene identifier mapping was necessary. Otherwise, this file will store just the provided genes of interest.
  2. A [.inline-snippet]_normalized_zscores.mtx[.inline-snippet] matrix file with the z-scored matrix calculated using the normalized data specific to the provided genes of interest.
  3. A [.inline-snippet]_heatmap_annotation.rds[.inline-snippet] file with the annotations to be used when plotting the heatmap for the HTML report.
  4. The [.inline-snippet]_goi_report.html[.inline-snippet] file is the summary HTML report containing relevant statistics and plots that are useful in interpreting results and sharing results with collaborators. This HTML report can be opened in a web browser of your choice!

You can use these output files for sharing with collaborators or further analyses of your own. You can also readily implement the genes of interest workflow on additional datasets! Learn more about the expected input and output and running the genes of interest workflow on the GitHub repository

Coming soon!

We are excited to announce that we hope to add a data integration module soon! If you want to integrate your sample datasets, stay tuned for our ready-to-go workflow that will allow you to do just that.

The Data Lab continues to enhance the portal, and we appreciate your feedback. Currently, we are conducting usability testing for the downstream analysis workflow and are looking for more childhood cancer researchers to participate. Fill out this form if you’re interested in learning more.

If you have questions about the ScPCA portal, you can contact us at scpca@ccdatalab.org.

The Childhood Cancer Data Lab maintains a collection of uniformly processed single-cell data from pediatric cancer clinical samples and xenografts in the Single-cell Pediatric Cancer Atlas (ScPCA) Portal. Although access to preprocessed data saves researchers time, we know that the downloads from the ScPCA Portal are only the starting point. That’s why we’ve created downstream analysis workflows for commonly performed analyses. Instead of writing code wholesale, you can analyze data once you’ve configured these workflows.

Previously, we introduced you to the core downstream analysis workflow that performs filtering, normalization, and dimensionality reduction. We have since developed a clustering analysis workflow that performs graph-based clustering across various parameters in parallel to help identify optimal clustering for each library in a dataset. 

In this blog post, we’ll introduce you to the recently added genes of interest analysis workflow that evaluates the expression of a provided list of genes in a sample. Both the clustering and genes of interest workflows can be applied to pre-processed data downloaded from the ScPCA Portal or any other single-cell or single-nuclei datasets that have undergone filtering, normalization, and dimensionality reduction. Like the core workflow, these additional downstream workflows mean less time writing scripts to perform these tasks and more time interpreting your results.

What is the “genes of interest” analysis workflow?

Do you have a list of genes whose expression you are particularly interested in? Will visualizing the expression levels of a specific list of genes help identify key features of your dataset? The genes of interest workflow can help you evaluate the expression of a specific list of genes in an individual sample. 

Here’s what you can expect from this module

The genes of interest workflow accepts a list of genes as input, in addition to a pre-processed [.inline-snippet]SingleCellExperiment[.inline-snippet] object stored as an RDS file. Then, the workflow performs three main steps:

  1.  The first step is ensuring that the genes of interest are in your data! A common obstacle in data analysis is having various types of gene identifiers between objects, creating the need to map the gene identifiers to each other. You can provide gene symbols, Ensembl ids, Entrez ids, etc. and the workflow will perform the gene identifier mapping for you!
  2. Hierarchical clustering is then performed on the normalized data associated with the genes of interest. These clustering results are used to generate a heatmap that can help identify any clear structure or groupings of cells with similar gene expression patterns, if present.
  3. The expression levels of the genes of interest are then visualized and included in an HTML report. The normalized and transformed expression for each of the provided genes of interest is compared to the mean expression of all genes in the dataset. An example of this visualization is shown below.

The output of this workflow includes files containing statistics and an HTML file with visualizations like the one above that can help you identify key features of your dataset using the expression levels of the genes you are interested in. We will talk more about the expected output in the next section!

What is the output of the genes of interest module?

Upon each successful run of the genes of interest workflow, you can expect the following output for each library ID:

  1.  A [.inline-snippet]_mapped_genes.tsv[.inline-snippet] file with mapped genes of interest if gene identifier mapping was necessary. Otherwise, this file will store just the provided genes of interest.
  2. A [.inline-snippet]_normalized_zscores.mtx[.inline-snippet] matrix file with the z-scored matrix calculated using the normalized data specific to the provided genes of interest.
  3. A [.inline-snippet]_heatmap_annotation.rds[.inline-snippet] file with the annotations to be used when plotting the heatmap for the HTML report.
  4. The [.inline-snippet]_goi_report.html[.inline-snippet] file is the summary HTML report containing relevant statistics and plots that are useful in interpreting results and sharing results with collaborators. This HTML report can be opened in a web browser of your choice!

You can use these output files for sharing with collaborators or further analyses of your own. You can also readily implement the genes of interest workflow on additional datasets! Learn more about the expected input and output and running the genes of interest workflow on the GitHub repository

Coming soon!

We are excited to announce that we hope to add a data integration module soon! If you want to integrate your sample datasets, stay tuned for our ready-to-go workflow that will allow you to do just that.

The Data Lab continues to enhance the portal, and we appreciate your feedback. Currently, we are conducting usability testing for the downstream analysis workflow and are looking for more childhood cancer researchers to participate. Fill out this form if you’re interested in learning more.

If you have questions about the ScPCA portal, you can contact us at scpca@ccdatalab.org.

Back To Blog