Résultats de l'appel à projets 2015
The analysis of deep sequencing data is both a bottleneck and a major issue for life sciences. Deep seq is commonly used to study a variety of genomic mechanisms as different as structural variations or fitness contribution of genes among many others. Sequencing technologies are evolving fast and the third generation is now accessible on the market. Third generation sequencing provides longer reads at the expense of higher error rates that makes them difficult to incorporate in a classical analysis pipeline. Their length carries the promise of overcoming two major problems in genome and transcriptome assembly: the presence of i/ long gemonic repeats and ii/ of distinct alleles in doploid or polyploid genomes. We propose to set up a service to perform de novo genomic and transcriptome assembly of third generation reads eventually combined with accurate second generation reads. This service will be developed, evualated and evoled as both the technique and bioinformatics solutions improved. Long read sequencing will be increasingly used in assembly projects, but to our knowledge no such service is available in an academic environment. Our group has acquired some expertise of error correction of third generation reads (see LoRDEC) and aims at exploiting high performance computing devices for this assembly service.
It is necessary to start a new phase of development to 1) improve the code and make it more maintainable; 2) adapt the tool to the new needs of biology. For example, with the arrival of new sequencing technologies, the production of data is more massive and the phases of analysis, including metagenomics, could benefit from more precise management of the data retrieved by BioMAJ.
Capitalizing on previous work carried out by GenOuest on the indexing of data banks, as well as on the ongoing work on the storage of metadata of banks within a graph-oriented database, we propose extra functionalities that allow the user to build its own banks to optimize the processing time for its calculations.
The new BioMAJ version will represent a tool able to create a custom data infrastructure. We wish therefore to carry out this project, to recruit a computer developer for 24 months.
In this context, we propose the development of a new service, called BISTAR, that aims to provide the first pipeline for the analysis of targeted bisulfite-sequencing data. BISTAR will cover all the steps of the analysis from sequencing reads to differential methylation analysis, allele-specific methylation and SNP analyses. Technically, BISTAR will offer a practical solution for biomedical researches, since it will combine time-efficient executions by parallelising the stages of the pipeline, and user-friendly deployment using virtual appliances that run locally and through the IFB infrastructure; BISTAR will thus provide cloud access with elastic resource provision, and both command-line and graphical user interfaces (Galaxy).
BISTAR will fill the gap in bisulfite-sequencing analyses and will provide a flexible and user-friendly tool for a broad spectrum of researchers in the life science community.
Seven tools are already available in Galaxy toolsheds. This m ississippi tool suite allows to analyze small RNA sequencing datasets in order to annotate, align and visualize small RNAs and their meta-properties.
In addition to upgrading existing tools through a continuous development process, we propose here to extend the Galaxy mississippi tool suite with tools and workflows to provide further support to small RNA biology. Thus, we will develop and release tools and workflows to (i) analyse small RNA phasing and editing (ii) profile miRNAs and their differential expression (iii) discover new miRNAs from sequencing datasets (iv) and diagnose/discover viruses through metagenomic analyses of viral siRNAs.
Our service deployment plan includes the release of high-quality tools and workflows in Galaxy tool sheds, in Galaxy server instances at the IBPS bioinformatics platform as well as in docker-containers as an additional option for accessibility and reproducibility. In addition, we wish to benefit from the IFB cloud infrastructure to deploy and provide access to our small RNA-oriented Galaxy server instances.
Our project will benefit to both the small RNA and Galaxy communities.
• Specific aim 1: Improvement of the robustness of the system. We will set up automated backup procedures, and automated testing procedures of system functionality. We will extend the user management system to define user groups and private spaces, allowing the sharing of data between collaborators. We will introduce an archiving/tracking procedure for successive versions of gene models across assemblies.
- Specific aim 2: Extension to new data types. We will adapt the schema to host and represent new types of genomic data including: RNA-seq, ChIP-seq, ATAC-seq, SELEX- Seq, and transgenic lines. We will develop corresponding back-end management and biocuration interfaces.
- Specific aim 3: Development of new user interfaces. We will increase the flexibility of the search interfaces by: i) supporting complex sequential queries, ii) setting up a Biomart server for the extraction of large datasets, iii) developing an API for the programmatic access to the database. We will extend the display interfaces, and in particular introduce reasoning engines to compute and display genetic regulation relationships taking place in each embryonic territory.
Here, we propose to convert the TAGC internal expertise in bioinformatics analyses into a service proposed to customers, coupled to our already existing TGML Next Generation Sequencing facility. The T5 project therefore consists in (1) gather and install all the in-house developed tools on an integrative server, (2) generalize programmatic access to all tools (3) develop ready-to-use “backbone pipelines” for the analysis of datasets that can be tailored to the project specificities.
solutions adopted so far (UNIX, R packages, etc.) clearly show their limits. Bottlenecks affect unified access to core applications as well as computing infrastructure and storage. In the context of a collaboration between the two national infrastructures in metabolomics and bioinformatics, we have developed a Virtual Research Environment (VRE) based on Galaxy framework for data analysis: workflow4metabolomics.org (W4M). This modular and extensible VRE includes existing components (XCMS functions, etc.) but also a whole suite of complementary statistical and annotation tools. This implementation is accessible through a web interface, which guarantees the parameters completeness. The advanced features of Galaxy have made possible the integration of components from different environments and of different languages. Finally, an extensible environment is offered to the metabolomics community, and enables preconfigured workflows sharing for new users, but also experts in the field. The aim of this proposal is to build new functionalities which take into account user interactivity experience (e.g. visualization) and to extend system interoperability with external data resources (e.g. reference database, external repository, web site...). These developments will therefore address the requirements of the experimental community and position W4M as the key resource for open-source computational metabolomics in Europe.