Press [ esc ] or close+

EOSCpilot Science Demonstrator: Pan-Cancer Analyses: portable federated cloud-based solution

CONTEXT

The Pan-Cancer Analysis of Whole Genomes Project (PCAWG)

The pan-cancer analysis of whole genomes project (PCAWG) is analysing large cohorts of cancer genomes, and pursuing so-called pan-cancer studies to identify factors that may be involved in tumour formation and disease progression across multiple cancer types.  PCAWG is currently analyzing >2800 cancer whole genomes, largely on academic and public clouds, and is also developing approaches for data integration with transcriptome & clinical data to address specific hypotheses.

The EOSCpilot science demonstrator aims to establish a portable cloud-based federated solution for collaborative cancer genomics and associated health data management, and an environment accessible to European scientists for analysis.

OBJECTIVES

PanCancer aim is to develop interoperable IT frameworks to enable standardized sharing and large-scale processing of cancer genomes with other molecular and clinical data, to enable biological and translational breakthroughs. To do so, interoperable frameworks have been employed to process ~10,000 cancer Whole Genome Sequenced (WGS) tumour-normal pairs from 20most common cancer types. The main research objective is to uncover genome-wide patterns of different types of genetic variation, which requires the availability of WGS data, and integrating these with molecular, demographic and clinical data. PanCancer will allow to proactively create suitable standards and interoperability.

MAIN ACHIEVEMENTS

The Butler scientific workflow framework has been set up and tested at three globally distributed cloud computing environments that are based on the OpenStack platform, these include the EMBL-EBI Embassy Cloud in the UK, the Cyfronet cloud in Poland, and the ComputeCanada cloud in Canada.

  • >400 high coverage whole genome samples (~60 TB of data) from the ICGC pediatric brain cancer cohort were downloaded to the ComputeCanada cloud and
  • ~50 TB of public data from the 1000 Genomes project was loaded onto the Cyfronet environment from the EMBL/EBI data servers utilising Cyfronet’s Oneprovider software.
  • Butler was used to running a genomic alignment workflow (based on BWA and developed at The Sanger Institute) on >400 samples at ComputeCanada and >400 samples at EMBL/EBI Embassy cloud, with over 100 TB of data processed to date. Proper operation of the infrastructure was monitored by Butler’s detailed monitoring and self-healing capabilities.

IMPACT

Engagement in the EOSC will enable Europe to maintain its current global competitive advantage in pan-cancer analyses & foster cloud-based solutions for genomic data analysis. Broader impact and the ability to reuse solutions in other areas (e.g. for cardiovascular & neuro-degenerative diseases) is also foreseen.

RECOMMENDATIONS FOR THE IMPLEMENTATION

  • Improve the availability of computational resources
  • Improve systems’ stability and storage capabilities
  • Introduce clear Service Level Agreements between service providers and service consumers to make clear the requirements asked by the consumers can be met.

Learn more of this Science Demonstrator on the EOSCpilot website.


The EOSC portal has been jointly developed and maintained by the eInfraCentral, EOSC-hub, EOSCpilot and OpenAIRE-Advance projects funded by the European Union’s Horizon 2020 research and innovation programme with contribution of the European Commission.