, Part 1


The first home-grown project from the Childhood Cancer Data Lab is At the highest level its goal is to build a large and consistently updated compendium of harmonized genomic data. In this blog post I will discuss why we’re building and what it is.

The Why

There is a lot of publicly available genomic data, and it keeps growing. The first type of genomic data that we’re refining is gene expression data. Each human genome contains approximately 20,000 genes that code for proteins. Each cell in our body has every gene in the genome, but not all of those genes expressed equally. Data that measures how expressed genes are is called gene expression data. There are a couple major repositories for this data. The one from the US is called the NCBI’s Gene Expression Omnibus and the one from Europe is called EBI’s ArrayExpress. In aggregate, we estimate that there are more than two million genome-wide assays available through these resources.

We want to put this data into the hands of researchers. This type of resource empowers researchers to answer many types of outstanding questions. For example, a researcher might ask, “Which samples across all public data express genes most similarly to cancer type X?” What this researcher finds may open the door to subsequent hypotheses that require some subset of the available data for further experimentation. At the same time, new machine learning methods are redefining what tasks computers can solve and what questions biological researchers can answer. However these methods require lots of training data. We aim to equip researchers with the data that they need for their own analyses and for new computational approaches. Any researcher can, in theory, download and harmonize all two million genome-wide assays. However doing so is expensive in three ways:

  • The researcher must invest time to learn about and implement data downloading and processing. The less time highly trained scientists spend being data janitors the more time they can spend doing actual science.
  • The researcher must pay the cost of data processing. The cost of downloading and harmonizing datasets of the completed’s size can be substantial.
  • The researcher must wait for data processing to complete. This opportunity cost imposes delays on the hypothesize, experiment, analyze, and repeat cycle.

A compendium of harmonized gene expression data will allow researchers to design an experiment which relies on a subset of that data, mine the data, reevaluate which subset of data they want to use, and repeat in a much quicker cycle. Additionally, it empowers researchers who may not have the technical skills to pre-process data from many different platforms themselves, but who can analyze harmonized data. A public and consistently maintained compendium also means that instead of each researcher needing to master each processing technique that they would like to use, only one researcher needs to. We designed to run analyses via Processors, which are templated workflows to harmonize public datasets. Once a Processor has been written, it can be applied to downloadable data and the resulting harmonized data can be accessed by scientists everywhere.

A permanent compendium also has implications for collaboration and reproducibility. We’re designing tools to interact with persistent repositories, such as Zenodo, to allow researchers to store and share the dataset that they are analyzing. Researchers will be able to send links to their collaborators so everyone is working with the same data. A set of well-characterized processors also reduce the potential for errors that may arise due to skipped pre-processing steps. These processors are portable and can be run outside of as well, so there’s nothing stopping scientists from running the processor themselves and checking their pre-processed data against what is stored within

The What

So far I’ve told you why we need, but not what kind of system is capable of fulfilling those needs. To be truly useful for everything explained above, we’ve designed to be:

  • Scalable

  • Reliable
  • Searchable
  • Transparent
  • Extendable
  • Cost Effective

Scalable, Reliable, Searchable The first three properties may seem fairly obvious. To be able to handle the large and ever growing volume of genomic data, will need to scale out to a large size. This will be achieved by running processor code in a cloud cluster. It will also need to be reliable. Once data is processed it needs to be immutable and to never be lost. Fortunately, cloud providers such as Amazon, Google, and Microsoft have put a lot of work into data stores that seek to provide this guarantee. A large compendium of processed data won’t do anyone any good if they cannot find the data they want in it. Therefore, the collects, stores, and exposes as much metadata about the data is has processed as is available.

Transparent Metadata will also include details about every modicum of work performs on any given data. Scientists will be able to inspect what techniques were applied, what the code which applied them looked like, what versions of packages were used, and when the data was processed. This will enable researchers to have confidence that the data they’re using from in the correct manner for their needs.

Extendable There is not only a lot of data publicly available, but also a lot of kinds of data. There are many formats that data is collected in, along with many formats that it can be processed to. We do not expect that our small team will be able to write processors for all of these. So not only is open source, but it has been designed to enable contributors to add a new Processor module without a comprehensive understanding of the system.

Cost Effective Finally, will need to be cost effective. With the goal to process all public genomic data, the cost could become considerable. We’re designing to take advantage of cost-saving approaches. For example, we’ve designed the system to work with AWS Spot Instances, which substantially reduce the cost for computers.


The service is supported by Alex’s Lemonade Stand Foundation through the Childhood Cancer Data Lab. It is still under development, but we are well on our way to realizing the vision outlined above. If you’re interested in the progress of you can see all of the code and progress here. In a couple weeks we will share another post that will describe in more depth. If you’d like to get in contact with me to discuss this project or anything else, send an email to or tweet at @datawheeler.

Kurt Wheeler