tl;dr: I work at the Childhood Cancer Data Lab, where we use very big data to find cures for childhood cancers. To move data around the internet at very high speeds, we are forced to use a proprietary software suite called Aspera. If somebody could make a Free Software alternative, the future of the internet would be way more awesome! Best of all, you can be the one to do it!
For the past thirty years, the web’s HTTP protocol has been really great at serving things like web pages, chat messages, images, and other objects at the kilobyte to megabyte scale. Unfortunately, the web was not designed for the “big data” objects of today - genetic sequences, large databases, high definition media, and other files at the terabyte to petabyte scale. HTTP transfers over the open internet are hampered by constant connection re-establishment and TCP overhead. Some reconnection problems have been addressed in HTTP/2, but not the overall throughput issues.
The fact is, for very large files, HTTP is slow.
However, there is an alternative, properiatry protocol - fasp, the Fast and Secure Protocol - more known commonly by the client/server software product name Aspera, which avoids these problems. Aspera can deliver speeds up to 1000x times faster than HTTP/FTP. I honestly didn’t believe it until I saw it for myself. In fact, I had never even heard of Aspera before I started this job, and there wasn’t even an English-language Wikipedia article for FASP yet - although I’ve since written one.
Out of necessesity, many organizations are now using this proprietary software for large data transfers, including public institutions such as the National Institute of Health, the European Genome Archive/Sequence Read Archive, the National Cancer Institute/cancer.gov, FEMA and private companies, such as Amazon, Netflix and the owner of Aspera, IBM. Basically, if you’re working at terabyte scale, you probably need this software.
Let me reproduce an experiment for you. We’re going to download an experiment’s worth of RNA-seq expression data from the Sequence Read Archive, which offers both FTP and Aspera downloads. On the left is FTP with wget, which starts first, and on the right the Aspera client,
ascp. This test was performed on two different EC2 instances in
us-east-1, fetching a 343MB FASTQ file from the Sequence Read Archive, which is in the UK.
As you can see, the
ascp client performs over 200x faster than FTP, and can download the whole file in 9 seconds. It’s pretty magical.
It’s also necessary for our project’s basic feasibility. Without Aspera, it would take us months to years to download the millions of samples like this that we need to process. We’d have to fly all over the world with suitcases full of harddrives to be able to collect our data!
In theory, the protocol is simple-ish. The client establishes an SSH connection to the server for negotiating the control of data flow, and then a range of UDP ports above 33001, one for each connection thread, are used to transmit a unidirectional transfer of data, and requests for retransmission are sent back over the SSH connection.
I say in theory, however, because the software is proprietary. So, I have no way of knowing what the protocol is actually doing during a file transfer, just as I have no way of knowing what the client software is doing to my laptop, or what the server software is doing to my servers.
Aspera is owned by IBM - does their client spy on what else is running on my system and report it back to their headquarters as business intelligence? If asked, would they share this information with the government? Without the source code, it’s impossible to tell!
Those were things people used to worry about in the very early days of the internet, until the invention of the Apache web server, which, along with GNU/Linux, truly democratized the web and allowed the explosion of growth which we all benefitted from.
I think a Free Software implementation of Aspera, or something similar, will cause a similar revolution for big data.
There are obviously many immedate practical implications for having a Free
fasp client/server/library implementation.
Firstly, and most relevantly to our work at the Childhood Cancer Data Lab, biologists, physicists and other scientists who deal with “big data” won’t have to use proprietary software to retrieve and share the data they need in a timely fashion. That alone would be a massive boost for science, for software freedom and for data sharing everwhere. But the benefits don’t stop there!
There are also massive implications for system administrators and software developers. For example, many organizations have moved to using Docker images for as part of their system deployments, and although the best practice is to have these images be as small as possible, in practice, this is rarely the case, and images can balloon up to be many tens or even hundreds of gigabytes, especially if they contain large machine learning models or training datasets.
How many thousands of hours of engineer time have already been wasted waiting for slow pulls from DockerHub? A switch to FASP could cut hours down into seconds.
There are also large business implementations. Global businesses who need to share large amounts of data across different regions, especially if that data has to be transferred over the open internet, could massively benefit from radically faster transfers - and in fact, this is IBM/Aspera’s business model!
Although the nature of the protocol makes it difficult use on some personal networks, I do think that elements of the technology could benefit consumer file-sharing, which has not had any major revolutions since the adoption of BitTorrent in the early 2000s - nearly 20 years ago! BitTorrent isn’t actually too dissimilar to Aspera, so maybe a free FASP implementation would end up replacing or ammending file lockers services like Mega, which serve massive files over HTTP - do you want to be the next Kim Dotcom? A detailed showdown of BitTorrent versus Aspera would also be really useful for understanding this problem space.
Lastly, and definately not least, I think this project could significantly benefit journalism and publishing.
We are entering the age of big data journalism, and I think the start of that era has been marked by the publishing of the Panama Papers, which was 2.7 terabytes of leaked information. Leaking that over HTTP/FTP upload could take weeks!
In the future, more and more investigative journalism will be driven by leaks of databases and the publishing of massive caches of documents. Currently, the best way that I know of to transfer 2.7TB is to send a bunch of external harddrives in the post. With better FASP Free infrastructure, that won’t be the case forever.
I see two approaches to solving this problem - a reverse-engineering solution and a “clean room” solution. The FASP protocol is not publically documented, and the technology itself is patented (with a misspelling in the patent title), which may scare off some of the usual suspects like Redhat or Google from developing this technology further.
So, it’s up to us hackers!
The first approach would be to reverse engineer the protocol to create a Free client that can speak to existing servers. This would be immediately useful to us at the CCDL, as we’re not able to distribute proprietary software in our Docker images, so we have to re-run the Aspera installer for every server we instantiate, which costs us time and money.
From there, a server compatible with the Free client could be created and to work in tendem with the Free client. This would then benefit us at the CCDL as well, as we want to share our massive dataset with other cancer researchers and machine learning researchers without having to use AWS S3’s HTTP methods.
Although this method is probably easier, and perhaps more immediately useful, it may make the lawyers nervous, which could hinder business adoption in the long term, and it also leaves the broader community without a Free protocol specification.
An alternative approach would be to ignore the existing Aspera server and client implementations and to try to define a new transmission protocol which solves the same problem at the same level of performance without directly copying the methods.
This method will probably have a higher risk of overall project failure as the timeframe will be longer and the amount of cooperation required will be a lot higher, but the ultimate product would be better: a new, Free protocol and reference client/server implementations to power the future of big data file transfers. This way, we’re not limited by the restrictions of Aspera as it stands, and we can potentially add new features, such are parallel endpoints and content discovery.
Now, here’s the really exciting part: You could be the one to do it!
If you’re a great developer who loves free software but has never been a leader of a major Free software project, I strongly encourage you to try it! It won’t lead you to riches (outside of healthy consulting fees), but it can certainly lead you to awesome grant opportunities, chances to travel internationally for free to speak at conferences, lots of new freinds, and it will greatly improve your programming, project mangement and team leadership skills. Give it a shot!
Maybe you’re working on a CS masters degree and are looking for a thesis project? Maybe your company uses Big Data and gives you 20% time to work on whatever you like? Or maybe you’re just looking for a new hobby project! Give it a go!
It’ll also be an great way to show off the awesome benefits of [[your-favorite-language-of-choice]] - look at you, Gophers and Rustaceans! Python - do you have it on you? Prove your language is the best for big data!
I really think that as data becomes bigger and bandwidth becomes more plentiful over the next 5, 10 and 20 years, that whatever Free software suite that fills this void will become as important as Apache, wget and BitTorrent are today. Be part of the future!
So, are you interested? Shoot me an email at miserlou [at without parenthesis] gmail [dot without parenthesis] com! If there’s interest, I’ll put together a Slack or a mailing list or a GitHub project where we can discuss a plan of attack!
Let’s make big data Free again!