Data explosion: genetic research could soon become the biggest player in the ocean of big data. DNA sequencing of plants, animals and humans is rapidly increasing and with it the amount of data produced. In just ten years it could encompass several exabytes per year – and leave YouTube and other data giants behind, as researchers report in the journal PloS Biology. It is therefore urgent to prepare technically for this challenge.
When it comes to the large volume of data generated, YouTube, Twitter and, in research, astronomy and particle physics are now far ahead. On YouTube alone, 300 hours of film are uploaded every minute, and up to an exabyte of new data has to be stored every year – that corresponds to a trillion bytes or a million terabytes.
DNA sequencing explosion
But in ten years, a new player in the field of big data could surpass all previous ones: genetic research. “Because sequencing technology is getting better and better and costs are coming down, we expect an explosion of DNA sequencing that will generate a huge flood of data,” says study leader Gene Robinson of the University of Illinois at Urbana-Champaign.
The amount of genetic data is doubling every seven months, according to the researchers. And this trend will continue. Because in biology alone, several projects aim to decode the genome of as many known plant and animal species as possible. “By 2025, we estimate that there will be at least 2.5 million fully sequenced genomes of animal and plant organisms,” scientists say. Several million microbe genomes will be added to this number.
Up to 40 exabytes of genetic data – per year
Human DNA testing is an even bigger growth market. It is now possible to examine a person’s genome for certain genes and complete sequencing is becoming cheaper. Several countries, including China, the United States, Great Britain and Saudi Arabia, have already announced that they intend to decode the DNA of between 100,000 and one million of their citizens and register it for research purposes.
“Therefore, by 2025, it is estimated that between 120 million and two billion human genomes could have been sequenced, which corresponds to a growth of four to five orders of magnitude in ten years,” the researchers say. This means, however, that by 2025 between two and 40 exabytes of new genetic data could be generated each year, for human genomes alone. “This far exceeds the growth of today’s major data producers like YouTube or astronomy.”
Clouds for genetic data
And pure DNA sequences are not everything: “The sequence data must be analyzed using complex and often computer-designed algorithms,” explains co-author Saurabh Sinha of the University of Illinois at Urbana-Champaign. Only then can biologically relevant patterns be found in the gene data and comparisons, which can help in the diagnosis of diseases, for example.
In addition, genetic data accumulates in many places and is distributed and accessed worldwide through the Internet. Therefore, in the future, more and more DNA data will be stored in the clouds to facilitate exchange. “All this makes it even more difficult than sequencing DNA and storing this information,” says Sinha.
A “four-headed beast”
In view of this flood of data, researchers are urgently calling for the digital infrastructure to be adapted in time to these challenges. “Genomics is a four-headed beast,” say Robinson and colleagues. “The requirements for data acquisition, storage, distribution and analysis will be among the largest in the entire large data field.
According to the researchers, genomics could be one of the greatest computer technology challenges humanity has ever had to overcome. While the amount of data on YouTube and the like has grown gradually and at the pace of advances in data transmission and storage technology, genomics could outpace technical development. “We need dramatic innovations in computer technology,” says Robinson. “Now is the time to start.”
You might also be interested: