Interoperability
Bioinformatics resources (tools and databases) currently available are focused on particular types of biological entities (genes, proteins, mRNA, lncRNA, small RNA, etc.) or interactions (protein-protein, metabolic reactions, transcriptional regulation, transport, etc.). A joint exploitation of these heterogeneous resources requests both interoperability solutions that aim at providing a uniform access to diverse and distributed resources, and integration processes that enable the establishment of physical and interpretable relationships between complex datasets. Interoperability solutions aim at providing uniform access to diverse and distributed resources with the purpose of their integrated exploitation. Making biological resources more interoperable is an essential requirement for taking full advantage of their obvious complementarity and gaining new insights in integrative biology. Interoperability is closely linked to the FAIR principle (Wilkinson et al, 2016). However, several other levels of interoperability and integration exist (e.g., tools, software environments, etc.), each associated to families of solutions and good practices. The IFB attempted to clarify the needs of the communities involved in the pilot projects selected and identify existing standards and interoperability solutions within IFB and at the European level.
The aim is threefold:
- Identify solutions for interoperability and integration, based on those already implemented on French and European biological resources.
- Support the development of missing solutions (in the largest sense of the term) when a need is identified.
- Promote these state-of-the-art solutions and facilitate their adoption by the community of bioinformatics resource developers and providers.
Several levels of interoperability and integration are usually considered, each associated to families of solutions and good practices:
- Physical integration, relying on data warehousing strategies, which can be either based on generic tools such as BioMart or InterMine, or implemented from domain-specific platforms such as i2b2/tranSMART for biomedical data that are the promoted environments in the frame of EU IMI translational medicine programs.
- Technical interoperability, relying on standard data exchange protocols enabling to develop simple and pragmatic programmatic interfaces to access databases.
- Syntactic interoperability, relying on standard exchange formats for data or metadata (the most commonly used today in bioinformatics being XML, RDF, OWL, JSON).
- Semantic interoperability, relying on standards describing the meaning of the data. This may be performed by an agreement on standard terminologies to annotate the data, exploitation of ontologies based on terms describing each concept of a field, and semantic relations between these terms. A major issue to this purpose will be to establish mappings between similar concepts defined in different ontologies. The deployment of interoperable resources will require to ensure consistency with ontologies defined by international resources but also to stimulate the participation of French communities in these international consortia (e.g. ELIXIR) to foster the evolution of the standards according to the life science community’s needs.
- Tools interoperability, relying on standards to uniformly represent and combine heterogeneous tools. Workflow environments, which enable to chain various bioinformatics tools, play a key role at this level.
- Workflow interoperability, relying on standards to uniformly represent complete analysis pipelines designed in different workflow systems. In particular, enable the translation of workflows designed in environments with Graphical User Interfaces (typically, Galaxy) in workflow languages that can be used on the command line (e.g. Common Workflow Language, SnakeMake).
- Statistical integration, relying on advanced multivariate methodologies based on dimension reduction and variable selection strategies (e.g., Partial Least Squares regression, generalized canonical correlation analysis, multiple co-inertia analysis – with their sparse counterpart models), able to cope with the multidimensionality of omics and other biological data such as imaging, immunophenotyping, electronic health records, etc.
- Specialized visualization, relying on dedicated components for the representation of single-level information (e.g., genome browsers) or for aggregated views of multi-scale data (e.g., network connectivity or multivariate statistical analysis representation), both of which propose dynamic human-computer interactions enabling interactive exploration of complex data. Most of nowadays visualization tools are fully integrated into web browsers. Recent developments, such as BioJS, based on JavaScript technology improvements, allow to rapidly set up high-level web interfaces. Providing software solutions able to interact with databases and tools API will let bioinformaticians to focus on designing convivial interfaces meeting the end-user's needs.
A shared concern by three pilot projects deals with semantic interoperability of metadata describing datasets or data resources. Metadata should be expressed with standard terms derived from available ontologies and according to a standard format. Teams involved in the INEX-MED and IntegrParkinson projects are working on the design of knowledge graphs adapted to facilitate integration between omics data and medical images for machine learning applications. In the PhenoMeta project, bioinformaticians already active at the European level on plant standards construction such as MIAPPE and BrAPI, want to develop a plant phenotyping ontology for purpose of plant datasets interoperability and treatments reproducibility.
Another need detected in three other pilot projects is related to tool interoperability. The ProMetIS and MS2MODELS projects wish to develop software based on flexible combination of interoperable tools in the field of multi-omics analyses (proteomics and metabolomics for the former, proteomics and 3D-interactomics for the latter). Similarly but in the context of sensitive health data, the B2SH project aims to combine biostatistics and bioinformatics (related to genome sequence analysis) tools. Interoperability challenges will concern workflow description, provenance and reproducibility, as well as REST APIs and containerisation.