Future trends for e-infrastructures and life sciences

Bioinformatics is evolving. Life sciences are undoubtedly at the frontier of Big Data, as data, algorithms and knowledge are becoming increasingly available for all. Bioinformatics is rapidly transitioning from hypothesis-driven studies to data-driven simulations of whole systems, leading to the emergence of a field aptly named “Integrative information biology”. As such, the analysis of massive data sets and the extraction of meaningful information and knowledge is becoming the driving force in cutting-edge research. E-infrastructures, such as the European Grid Infrastructure (EGI), have been a key factor to this transition, and are yet to play a critical role in the shape of research within the context of life sciences [1].

The EGI is the result of pioneering work that has, over the last decade, built a collaborative production infrastructure of uniform services through the federation of national resource providers that supports multi-disciplinary science across Europe and around the world. An ecosystem of national and European funding agencies, research communities, technology providers, technology integrators, resource providers, operations centres, over 350 resource centres, coordinating bodies and other functions has now emerged to serve over 21,000 researchers in their intensive data analysis across over 15 research disciplines, carried out by over 1.4 million computing jobs a day.

The current trend for life science ICT infrastructure uptake is to use the most diverse and state-of-the-art computing technologies (Grid, Cloud, and now Mist among others) to access computable storage, memory and compute in each country/resource federation that can provide these services. The computational requirements of most standard bioinformatics workflows continue to rise. A characteristic case is the analysis of NGS data; READemption, a pipeline for the computational evaluation of RNA-Seq data [2], requires from 2-3 VMs in normal use, which can rise up to around 30 VMs at peak performance with more than 20 cores, more than 70 GB of memory, and around 1 TB of storage. Chipster, a user-friendly analysis software for high-throughput data [3], exhibits similar characteristics where each server requires over four virtual CPUs, over 16 GB of RAM and around 500 GB of storage at each site. These requirements can be addressed through the use of ICT infrastructures such as EGI and be made available to the wider scientific community.

The emergence of new tools and e-infrastructures such as EGI gave way to a substantial number of publications ranging from large-scale simulations to disease-oriented structural models in proteomics. Although these results are far from new, they do point out a critical change – research has shifted radically. And it is evident that, beside the development of new applications and the improvement of the existing ones, the life sciences and ICT communities need to capitalize on their synergies to strengthen the border disciplines.

References

[1] Duarte AMS, Psomopoulos FE, Blanchet C, Bonvin AMJJ, Corpas M, Franc A, Jimenez RC, de Lucas JM, Nyrönen T, Sipos G and Suhr SB, “Future opportunities and trends for e-infrastructures and life sciences: going beyond the grid to enable life science data analysis”, Frontiers in Genetics, Vol. 6 No. 197 (2015). doi: 10.3389/fgene.2015.00197

[2] Förstner, K. U., Vogel, J., and Sharma, C. M. (2014). READemption – a tool for the computational analysis of deep-sequencing-based transcriptome data. Bioinformatics 30, 3421–3423. doi: 10.1093/bioinformatics/btu533

[3] Kallio, M. A., Tuimala, J. T., Hupponen, T., Klemelä, P., Gentile, M., Scheinin, I., et al. (2011). Chipster: user-friendly analysis software for microarray and other high-throughput data. BMC Genomics 12:507. doi: 10.1186/1471-2164-12-507