The Cancer Genomics Linkage Application will enable the integration and re-use of the cancer genomics data available from public repositories such as the International Cancer Genome Consortium (ICGC). This will be accomplished through the capability being developed by the “Early Activity” of the Genomics Virtual Laboratory (GVL-EA). It will enable researchers, such as Professors Andrew Biankin, John Mattick (Garvan Institute for Medical Research) or Sean Grimmond (Queensland Centre for Medical Genomics), to access genomic datasets of international importance and to integrate them with their own clinical and genomic datasets in order to explore, discover and validate key genomic abnormality that cause cancer. The product will further provide the mechanism for such researchers to publish and to make available their analysis for re-use by the community.

The product aims to provide the ability for biologists and clinicians to easily integrate their own research data with datasets from multiple data sources. The Integration of the datasets into a common location and enabling access and mining using best practice workflow tools will enable the Australian cancer researchers to accelerate their discovery processes and to be internationally competitive. Although this project will have a particular focus on pancreatic cancer research as carried out by the Australian Pancreatic Cancer Genome Initiative (APGI), the application can also support the wider cancer research community.

Download the application from here.

Friday 15 March 2013

Final Product



The Cancer Genomics Linkage Application funded by ANDS enables the re-use and integration of data available from public repositories such as the ICGC variant database or the DrugBank drug and drug target database by leveraging the Genomics Virtual Lab capability on the research cloud.  Researchers, such as Professor Andrew Biankin and colleagues from the Garvan Institute for Medical Research are now able to access genomic datasets of international importance and to integrate them with their own clinical and genomic datasets in order to explore, discover and validate key genomic abnormalities that cause cancer, using user friendly computational workflows. The project further provides the mechanism for such researchers to publish and to make available their analysis for re-use by the community.

The solutions developed for this project consists of
  1. Collection Manager
  2. Third Party Solution BioMAJ
  3. Data Galaxy Servlet
  4. Galaxy Data Link Tool
  5. Galaxy Server
  6. Workflow Galaxy Servlet
  7. Automatic generation of collections descriptions and their submission to RDA
  8. OAI-PMH Server
as shown and referenced in Figure 1 "Overall Overview" and described in detail in the following.
Figure 1: Overall Overview



1. Collection Manager

The Collection Manager is a web interface accessing a MySQL database that allows Galaxy administrators and users to edit and to curate collection (data), service (Galaxy instance) and workflow descriptions. Metadata of the collection and workflow descriptions (e.g. title, list of associated sites, collection rights, ANZSRC Codes) can be modified. Furthermore the Collection Manager allows them to publish workflow descriptions to RDA. A detailed user guide can be found here.

Figure 2: Welcome to the Collection Manager Interface
 

2. Integration of Third Party Solution BioMAJ for Data Synchronisation


The download scheduler BioMAJ is used to mirror reference datasets such as ICGC, Drugbank, etc. from public repositories. BioMAJ Watcher is the web interface. A shell script has been developed to automatically send a Post request to the Data Galaxy Servlet when downloading or updating a data library using BioMAJ. The Data Galaxy Servlet is described in the following section. 
Figure 3: BioMAJ Watcher web Interface - general functionalities

3. Data Galaxy Servlet: 

The Data Galaxy Servlet is a Java servlet that creates a new record accessible by the Collection Manager containing all the information and metadata related to the mirrored reference dataset (Figure 4). An email informs the owner that the reference dataset description is ready to be modified and published to RDA.

Figure 4: Collection Manager - Data Library Overview

4. Galaxy Data Link Tool: 

The Galaxy Data Link Tool is a script written in Python that links the reference datasets downloaded by BioMAJ with Galaxy (Figure 5). This tool uploads the specified data files as a Galaxy Data Library resource. The files are linked to the specified data path. If the specified path consists of a directory of files, this directory structure will be automatically mirrored in Galaxy. This tool can be used in conjunction as a post-process step with the data synchronisation tool BioMAJ. 

Figure 5: Data Libraries in Galaxy linked by Galaxy Data Link Tool




5. Galaxy Server: 

To allow automated feeds of workflow RIF-CS records from Galaxy to RDA an extra button has been implemented in Galaxy as shown in Figure 6. To implement this button, the galaxy-dist code hosted on BitBucket was forked and the code was modified. The button initiates a POST request to the Workflow Galaxy Servlet which contains all the information and metadata related to the published workflow. The Workflow Galaxy Servlet is described in the following section. 

Figure 6: New Galaxy feature: Publish workflow to RDA


6. Workflow Galaxy Servlet: 


The Workfow Galaxy Servlet is a Java servlet which manages requests initiated through the mentioned Galaxy server. The servlet creates a new record accessible via the Collection Manager (Figure 7)  that contains all the information and metadata related to the Galaxy workflow (Figure 8). An email informs the owner that the workflow description is ready to be modified and published to RDA.

Figure 7: Workflow description entry with the Collection Manager
Figure 8: Galaxy Workflow

7. Automatic generation of collections descriptions and their submission to RDA: 

Metadata stored in the MySQL database are aggregated to form a collection description and written into a compliant RIF-CS xml using the ANDS supplied RIF-CS Java library. Persistent record identifiers are assigned and the RIF-CS files made accessible to a RDA harvest data source. 

8. OAI-PMH Server (Records indexing): 

The OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) server allows RIF-CS xml files generated by the Collection Manager to be exposed as items in an OAI data repository and made available for harvesting by RDA. Once the collection descriptions are harvested, they will be available through RDA as shown in Figure 9 below.

Figure 9: Research Data Australia - Published Data Library "International Cancer Genome Consortium"
  
The Cancer Genomics Linkage Application has been developed and tested on a development server from the Genomics Virtual Laboratory Project, while the production servers are being deployed, tuned and configured on the Research Cloud. Some of the data sources and tools are currently available at the Garvan Institute and will be deployed to the other GVL nodes as they become available during the year 2013. The variant detection workflow developed for the Garvan Institute will be made publicly available and published on the RDA as soon as the tools developed by Professor Sean Grimmond’s group are published in the research literature.

Licensing

All documents and source code is made available under GPLv3 licence via Google Code - Project AP27 Cancer Genomics Linkage Application.
 

No comments:

Post a Comment