OpenPBTA: Someone is wrong on the internet and it’s probably us (updated 9-9-2020)

May 29, 2020

Also authored by Chante Bethell, Candace Savonen, and Josh Shapiro

Here at the Childhood Cancer Data Lab, we value transparency and the practice of open science. Much of the work we’ve done and the products that we build hinge on the generosity and openness of other scientists. In this post, as part of National Brain Tumor Awareness month, we want to talk about a project that our science team has been working on over the last few months (and to do so in a way that aligns with our values).

Since September of last year, we have been organizing the Open Pediatric Brain Tumor Atlas (OpenPBTA) project with the team at Center for Data-Driven Discovery in Biomedicine at Children’s Hospital of Philadelphia. The ideas behind the OpenPBTA are straightforward – analyze one of the largest collections of somatic pediatric brain tumor data in the open, on GitHub, and crowdsource expertise from all over the world to better understand one of the leading causes of death from disease for children and young adults. Beyond better molecular characterization of pediatric brain tumors, there are added benefits to how we’ve structured (and licensed!) the OpenPBTA project. The robust, reusable workflows that we build can continue to be applied as the PBTA data grows to include additional samples and reused to study other pediatric cancers.

The implementation of the project has been decidedly less straightforward. To our knowledge, this is the first time something quite like this has been tried.  We focus a lot on the tools that underlie reproducible data science in our day-to-day. We’ve built infrastructure that lets us automatically test new contributions, including our own. This has revealed opportunities to improve the underlying data, compare concordance of software, and incorrect simplifying assumptions.

We need your help

So why this blog post now? Well, quite simply, we need the help of other scientists to keep moving the project forward. We at the CCDL are looking to wind down our role in actively analyzing the OpenPBTA data so we can focus on our virtual training workshops and the Alex’s Lemonade Stand Foundation Single Cell Pediatric Cancer Atlas project.

We’ve been preparing for this transition since we began our involvement in the process. We’ve focused on getting the manuscript’s methods section updated, certain figures included, and other portions outlined.

But before we go (and by go, we mean stay but play a different role that is more community-focused), we want to share some of the things that keep us up at night. You see, someone is wrong on the internet and we think it’s probably us.

Webcomic from xkcd
Webcomic from xkcd (https://xkcd.com/386/)


We hope you find our figures aesthetically pleasing, but we think if you spend some time with them you’ll probably agree that we’re wrong. That’s okay, though–to be wrong is to be alive and in the business of science. We want to work together to be less wrong! Only then will we reach the full potential and promise of the OpenPBTA project.

Below, we’ll talk about 3 areas where we think the project could use some input from you, dear reader. We will highlight what we’ve done, what we're concerned about, and some thoughts about moving forward. We’ll also link to relevant GitHub issues, pull requests, and analysis modules so you can take a closer look.

Comparing tumor mutation burden across datasets: WGS vs. WXS

Previous studies comparing pediatric and adult tumor datasets report lower tumor mutational burden (TMB) in pediatric cancers (Vogelstein et al. 2013., Lawrence et al. 2013., and Grobner et al. 2018.). A challenge in comparing TMB results between publications, and in some cases datasets, is the lack of consensus about how to calculate TMB  (TMB Harmonization Project and Merino et al. 2020.).

In this project, we’ve defined TMB as

(total # coding sequence SNVs called both Strelka2 and Mutect2 ) / size of the effectively surveyed coding region of the genome

We considered base pairs to be effectively surveyed if they were in the intersection of the genomic ranges considered by the SNV callers and coding sequences of the genome.

Overview of TMB methodology. a and c) Mutations within the effectively surveyed portion of the genome and will be counted in TMB calculations if they are the same allele called by both callers and within the coding regions. The total length of the effectively surveyed genome is used as the denominator. b) A mutation is called by one but not both callers, so is not included in TMB calculations d) A mutation is called by both callers, but is not within the coding regions, so is not included in TMB. e) A mutation in the same location is called by both callers, but a different allele is called, so this is not included in TMB. f) A mutation is outside the observable range for one of the callers, neither this region nor mutations in it are used for TMB.

To minimize potential sources of artificial TMB variance, we started from the sequence level and used the same SNV caller algorithms, variant allele frequency cutoff minimums, consensus SNV decisions, and coding regions considered for both adult and pediatric datasets.  See the code for these processing steps here: https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers

TMB before, when we were wrong on the internet (5-2-2020)

Our TMB comparisons of pediatric and adult run counter to previous literature:

Tumor mutation burden in the OpenPBTA dataset (pediatric) and the TCGA dataset (adult) — as of 5-2-2020
Tumor mutation burden in the OpenPBTA dataset (pediatric) and the TCGA dataset (adult) — as of 5-2-2020

We suspect this result may be a limitation of sequencing strategy; all adult cancer data used here are whole exome sequenced (WXS), while nearly all the pediatric cancers were whole genome sequenced (WGS). TMB comparison between pediatric and adults is highly influenced by the SNV callers used and some SNV callers like Lancet perform differently on paired WXS and WGS data.

TMB after we fixed what was wrong on the internet in an extremely narrow sense (9-9-20)

Tumor mutation burden in the OpenPBTA dataset (pediatric) and the TCGA dataset (adult) — as of 9-9-2020
Tumor mutation burden in the OpenPBTA dataset (pediatric) and the TCGA dataset (adult) — as of 9-9-2020

A bug in our filtering step was found and addressed!  This bug left in some mutations that were outside of the ranges where mutations are able to be called (akin to situation f on the first figure). These regions we filter by (but weren’t doing so correctly) are mainly coding regions. This disproportionally increased WGS samples’ number of mutations since there are, by design, more noncoding regions sequenced in WGS than WXS samples. This made our PBTA pediatric samples (which are almost all WGS samples) falsely seem like they had higher TMB than TCGA adult samples (which are mostly WXS samples).

We have also explored updating our analysis to include only nonsynonymous mutations, but this has little effect on the TMB results, especially not comparisons among samples.

Structural variants, ploidy, and the challenges of consistency

Copy number alterations and other structural variants are often highly consequential mutations in cancers, but they are notoriously difficult to identify correctly from short read data, and even more so in heterogeneous cell populations like tumors. We started with somatic copy number variant (CNV) and somatic structural variant (SV) calls from Control-FREEC, CNVkit, and Manta. Each of these methods has false positives and false negatives, so we combined their powers by identifying consensus copy number variant segments that were present in at least two callers’ results. Because we expect the precise copy number to have the greatest variation among callers, we only require the direction of a call segment (loss or gain) to match. We also included a number of other filters in the creation of this consensus set, including requiring copy number variant sets to be greater than a certain length, and excluding regions that are known to be problematic for these kinds of analyses, including centromeres, telomeres, and segmentally duplicated regions.

This set of consensus CNV segments was used in downstream analyses, but it seems quite likely that this is an area ripe for improvement. For example, our initial consensus calls largely ignored the inferred ploidy of each tumor due to variation in how ploidy is handled by the underlying callers.  We would love to update these calls to more accurately reflect copy number variation in non-diploid tumors, and to better reflect quantitative variation in copy number to capture the difference between single duplication events and more extensive amplification.

As a first pass at this, we compared the copy number of each segment as defined by CNVkit to the ploidy inferred by Control-FREEC to assign losses, gains, and amplifications (where copy number was more than twice the inferred ploidy).

We also wanted to define “focal” copy number alterations to avoid calling strings of adjacent genes as separate lesions where it would be more appropriate to call a broader region. We tried using GISTIC 2.0 (Mermel et al. 2011.) to differentiate between broad and focal copy number alterations but it proved challenging in this dataset (more on that below). Instead, we took a more manual approach, starting with calling alterations that affected the vast majority of a chromosome arm, then alterations not captured by arm calls that affected the vast majority of a cytoband. Finally, alterations of genes that were not captured in arm or cytoband calls were included individually.  We think this approach is reasonable as a start, but is there a better way to classify status across segments and select focal lesions? Are our cutoffs well-chosen?

What does it mean for alterations or lesions to be “interesting” in a dataset with multiple histologies, some of which have small sample sizes?

The OpenPBTA dataset is an excellent resource with over 30 different disease types represented. A mix of disease types can pose an analytical challenge, especially considering some groups are composed of very few samples; others have hundreds of samples.

Sample distribution across cancer types
Distribution of sample size across cancer types in the OpenPBTA dataset.

Many approaches to identifying “significant” lesions rely in part on the notion of recurrence–if a particular alteration is observed in many samples, it is more likely to be of interest and perhaps to confer some kind of selective advantage for tumor cells (e.g., be a driver). We’ve struggled with selecting appropriate methodologies and, more generally, what it means for an alteration to be interesting in this many-disease context.

For example, we started by running GISTIC 2.0 on the entire cohort to identify somatic copy number alterations of interest (Mermel et al. 2011.). When we compare the GISTIC results from an individual histology to the results from the entire cohort, we see considerable differences. This is not unexpected–the largest individual histology (low-grade glioma) only makes up about a quarter of the samples and it stands to reason that not all copy number alterations will be shared (Extended Data Figure 8 from Grobner et al. 2018.). For histologies with small sample sizes, we ran into errors running GISTIC that we haven’t resolved yet. So now, we count how many times a somatic copy number alteration (gene-level, cytoband-level, or chromosome arm-level) occurs and apply a cutoff that ignores the majority of genes, instead.


We’d love to include methods that identify drivers and account for things like the background rate of somatic alterations. How do you approach these problems in a multi-disease context when sometimes n is small?

OF COURSE, THESE ARE NOT THE ONLY WAYS WE MIGHT BE WRONG, BUT THEY ARE A FEW THAT COME TO MIND!

Contribute to the OpenPBTA project

Did you spot a way that we were wrong on the internet? Are you ready to contribute to the OpenPBTA project? We’d love to have you!

You can find all the analyses on GitHub at https://github.com/AlexsLemonade/OpenPBTA-analysis and a draft of the manuscript at https://github.com/AlexsLemonade/OpenPBTA-manuscript. GitHub is the main way to discuss and contribute analyses to the project; other technologies like Docker and CircleCI for continuous integration round out the project infrastructure (see an excerpt from the OpenPBTA presentation at Rocky 2019 below).

 An excerpt from the OpenPBTA presentation at Rocky 2019

In the analysis repository README, we ask folks who are new to the project to read documentation about the data, look at the analyses that have been performed so far, and check out our instructions for contributing.

We realize this can be a lot of information for first-time contributors and for folks who do not regularly use all these features of GitHub or technologies. We are here to help! Feel free to start small: File a new issue suggesting ways to improve an analysis, or comment on existing issues where you’d like to contribute! (Here’s a good example.) If you have general questions, we encourage you to sign up for the Cancer Data Science Slack, join the #open-pbta channel, and introduce yourself! Organizers will direct you to different parts of the project or tag other contributors as needed. That’s what we’re here for!

In the future, we plan to blog about the CCDL’s internal processes for analytical code review and some of the lessons we’ve learned from the OpenPBTA project so far.

Bye for now 👋 and we hope to see you on GitHub soon!

Also authored by Chante Bethell, Candace Savonen, and Josh Shapiro

Here at the Childhood Cancer Data Lab, we value transparency and the practice of open science. Much of the work we’ve done and the products that we build hinge on the generosity and openness of other scientists. In this post, as part of National Brain Tumor Awareness month, we want to talk about a project that our science team has been working on over the last few months (and to do so in a way that aligns with our values).

Since September of last year, we have been organizing the Open Pediatric Brain Tumor Atlas (OpenPBTA) project with the team at Center for Data-Driven Discovery in Biomedicine at Children’s Hospital of Philadelphia. The ideas behind the OpenPBTA are straightforward – analyze one of the largest collections of somatic pediatric brain tumor data in the open, on GitHub, and crowdsource expertise from all over the world to better understand one of the leading causes of death from disease for children and young adults. Beyond better molecular characterization of pediatric brain tumors, there are added benefits to how we’ve structured (and licensed!) the OpenPBTA project. The robust, reusable workflows that we build can continue to be applied as the PBTA data grows to include additional samples and reused to study other pediatric cancers.

The implementation of the project has been decidedly less straightforward. To our knowledge, this is the first time something quite like this has been tried.  We focus a lot on the tools that underlie reproducible data science in our day-to-day. We’ve built infrastructure that lets us automatically test new contributions, including our own. This has revealed opportunities to improve the underlying data, compare concordance of software, and incorrect simplifying assumptions.

We need your help

So why this blog post now? Well, quite simply, we need the help of other scientists to keep moving the project forward. We at the CCDL are looking to wind down our role in actively analyzing the OpenPBTA data so we can focus on our virtual training workshops and the Alex’s Lemonade Stand Foundation Single Cell Pediatric Cancer Atlas project.

We’ve been preparing for this transition since we began our involvement in the process. We’ve focused on getting the manuscript’s methods section updated, certain figures included, and other portions outlined.

But before we go (and by go, we mean stay but play a different role that is more community-focused), we want to share some of the things that keep us up at night. You see, someone is wrong on the internet and we think it’s probably us.

Webcomic from xkcd
Webcomic from xkcd (https://xkcd.com/386/)


We hope you find our figures aesthetically pleasing, but we think if you spend some time with them you’ll probably agree that we’re wrong. That’s okay, though–to be wrong is to be alive and in the business of science. We want to work together to be less wrong! Only then will we reach the full potential and promise of the OpenPBTA project.

Below, we’ll talk about 3 areas where we think the project could use some input from you, dear reader. We will highlight what we’ve done, what we're concerned about, and some thoughts about moving forward. We’ll also link to relevant GitHub issues, pull requests, and analysis modules so you can take a closer look.

Comparing tumor mutation burden across datasets: WGS vs. WXS

Previous studies comparing pediatric and adult tumor datasets report lower tumor mutational burden (TMB) in pediatric cancers (Vogelstein et al. 2013., Lawrence et al. 2013., and Grobner et al. 2018.). A challenge in comparing TMB results between publications, and in some cases datasets, is the lack of consensus about how to calculate TMB  (TMB Harmonization Project and Merino et al. 2020.).

In this project, we’ve defined TMB as

(total # coding sequence SNVs called both Strelka2 and Mutect2 ) / size of the effectively surveyed coding region of the genome

We considered base pairs to be effectively surveyed if they were in the intersection of the genomic ranges considered by the SNV callers and coding sequences of the genome.

Overview of TMB methodology. a and c) Mutations within the effectively surveyed portion of the genome and will be counted in TMB calculations if they are the same allele called by both callers and within the coding regions. The total length of the effectively surveyed genome is used as the denominator. b) A mutation is called by one but not both callers, so is not included in TMB calculations d) A mutation is called by both callers, but is not within the coding regions, so is not included in TMB. e) A mutation in the same location is called by both callers, but a different allele is called, so this is not included in TMB. f) A mutation is outside the observable range for one of the callers, neither this region nor mutations in it are used for TMB.

To minimize potential sources of artificial TMB variance, we started from the sequence level and used the same SNV caller algorithms, variant allele frequency cutoff minimums, consensus SNV decisions, and coding regions considered for both adult and pediatric datasets.  See the code for these processing steps here: https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers

TMB before, when we were wrong on the internet (5-2-2020)

Our TMB comparisons of pediatric and adult run counter to previous literature:

Tumor mutation burden in the OpenPBTA dataset (pediatric) and the TCGA dataset (adult) — as of 5-2-2020
Tumor mutation burden in the OpenPBTA dataset (pediatric) and the TCGA dataset (adult) — as of 5-2-2020

We suspect this result may be a limitation of sequencing strategy; all adult cancer data used here are whole exome sequenced (WXS), while nearly all the pediatric cancers were whole genome sequenced (WGS). TMB comparison between pediatric and adults is highly influenced by the SNV callers used and some SNV callers like Lancet perform differently on paired WXS and WGS data.

TMB after we fixed what was wrong on the internet in an extremely narrow sense (9-9-20)

Tumor mutation burden in the OpenPBTA dataset (pediatric) and the TCGA dataset (adult) — as of 9-9-2020
Tumor mutation burden in the OpenPBTA dataset (pediatric) and the TCGA dataset (adult) — as of 9-9-2020

A bug in our filtering step was found and addressed!  This bug left in some mutations that were outside of the ranges where mutations are able to be called (akin to situation f on the first figure). These regions we filter by (but weren’t doing so correctly) are mainly coding regions. This disproportionally increased WGS samples’ number of mutations since there are, by design, more noncoding regions sequenced in WGS than WXS samples. This made our PBTA pediatric samples (which are almost all WGS samples) falsely seem like they had higher TMB than TCGA adult samples (which are mostly WXS samples).

We have also explored updating our analysis to include only nonsynonymous mutations, but this has little effect on the TMB results, especially not comparisons among samples.

Structural variants, ploidy, and the challenges of consistency

Copy number alterations and other structural variants are often highly consequential mutations in cancers, but they are notoriously difficult to identify correctly from short read data, and even more so in heterogeneous cell populations like tumors. We started with somatic copy number variant (CNV) and somatic structural variant (SV) calls from Control-FREEC, CNVkit, and Manta. Each of these methods has false positives and false negatives, so we combined their powers by identifying consensus copy number variant segments that were present in at least two callers’ results. Because we expect the precise copy number to have the greatest variation among callers, we only require the direction of a call segment (loss or gain) to match. We also included a number of other filters in the creation of this consensus set, including requiring copy number variant sets to be greater than a certain length, and excluding regions that are known to be problematic for these kinds of analyses, including centromeres, telomeres, and segmentally duplicated regions.

This set of consensus CNV segments was used in downstream analyses, but it seems quite likely that this is an area ripe for improvement. For example, our initial consensus calls largely ignored the inferred ploidy of each tumor due to variation in how ploidy is handled by the underlying callers.  We would love to update these calls to more accurately reflect copy number variation in non-diploid tumors, and to better reflect quantitative variation in copy number to capture the difference between single duplication events and more extensive amplification.

As a first pass at this, we compared the copy number of each segment as defined by CNVkit to the ploidy inferred by Control-FREEC to assign losses, gains, and amplifications (where copy number was more than twice the inferred ploidy).

We also wanted to define “focal” copy number alterations to avoid calling strings of adjacent genes as separate lesions where it would be more appropriate to call a broader region. We tried using GISTIC 2.0 (Mermel et al. 2011.) to differentiate between broad and focal copy number alterations but it proved challenging in this dataset (more on that below). Instead, we took a more manual approach, starting with calling alterations that affected the vast majority of a chromosome arm, then alterations not captured by arm calls that affected the vast majority of a cytoband. Finally, alterations of genes that were not captured in arm or cytoband calls were included individually.  We think this approach is reasonable as a start, but is there a better way to classify status across segments and select focal lesions? Are our cutoffs well-chosen?

What does it mean for alterations or lesions to be “interesting” in a dataset with multiple histologies, some of which have small sample sizes?

The OpenPBTA dataset is an excellent resource with over 30 different disease types represented. A mix of disease types can pose an analytical challenge, especially considering some groups are composed of very few samples; others have hundreds of samples.

Sample distribution across cancer types
Distribution of sample size across cancer types in the OpenPBTA dataset.

Many approaches to identifying “significant” lesions rely in part on the notion of recurrence–if a particular alteration is observed in many samples, it is more likely to be of interest and perhaps to confer some kind of selective advantage for tumor cells (e.g., be a driver). We’ve struggled with selecting appropriate methodologies and, more generally, what it means for an alteration to be interesting in this many-disease context.

For example, we started by running GISTIC 2.0 on the entire cohort to identify somatic copy number alterations of interest (Mermel et al. 2011.). When we compare the GISTIC results from an individual histology to the results from the entire cohort, we see considerable differences. This is not unexpected–the largest individual histology (low-grade glioma) only makes up about a quarter of the samples and it stands to reason that not all copy number alterations will be shared (Extended Data Figure 8 from Grobner et al. 2018.). For histologies with small sample sizes, we ran into errors running GISTIC that we haven’t resolved yet. So now, we count how many times a somatic copy number alteration (gene-level, cytoband-level, or chromosome arm-level) occurs and apply a cutoff that ignores the majority of genes, instead.


We’d love to include methods that identify drivers and account for things like the background rate of somatic alterations. How do you approach these problems in a multi-disease context when sometimes n is small?

OF COURSE, THESE ARE NOT THE ONLY WAYS WE MIGHT BE WRONG, BUT THEY ARE A FEW THAT COME TO MIND!

Contribute to the OpenPBTA project

Did you spot a way that we were wrong on the internet? Are you ready to contribute to the OpenPBTA project? We’d love to have you!

You can find all the analyses on GitHub at https://github.com/AlexsLemonade/OpenPBTA-analysis and a draft of the manuscript at https://github.com/AlexsLemonade/OpenPBTA-manuscript. GitHub is the main way to discuss and contribute analyses to the project; other technologies like Docker and CircleCI for continuous integration round out the project infrastructure (see an excerpt from the OpenPBTA presentation at Rocky 2019 below).

 An excerpt from the OpenPBTA presentation at Rocky 2019

In the analysis repository README, we ask folks who are new to the project to read documentation about the data, look at the analyses that have been performed so far, and check out our instructions for contributing.

We realize this can be a lot of information for first-time contributors and for folks who do not regularly use all these features of GitHub or technologies. We are here to help! Feel free to start small: File a new issue suggesting ways to improve an analysis, or comment on existing issues where you’d like to contribute! (Here’s a good example.) If you have general questions, we encourage you to sign up for the Cancer Data Science Slack, join the #open-pbta channel, and introduce yourself! Organizers will direct you to different parts of the project or tag other contributors as needed. That’s what we’re here for!

In the future, we plan to blog about the CCDL’s internal processes for analytical code review and some of the lessons we’ve learned from the OpenPBTA project so far.

Bye for now 👋 and we hope to see you on GitHub soon!

Back To Blog