Project

Keywords: Text miningTechniques for the extraction of information from unstructured text., data miningTechniques for the discovery and identification of knowledge in large databases., bioinformaticsApplication of information science and computer science techniques in molecular biology., knowledge managementStrategies and practices for the storage, organization and accessibility of information., molecular genomicsStudy of the structure, organization and processes of genomes at the molecular level..

BIOGRAPH is a project funded by the Bijzonder Onderzoeksfonds from the University of Antwerp (GOA BOF UA) that aims at putting forward a new methodology for text mining from heterogeneous information sources. The final goal of the project is to show new results in mining for previously unknown relations between genes and phenotypes, and improved gene prioritisation catching non-obvious disease causing genes. It is a multidisciplinary project within the University of Antwerp carried out by three research groups:

 

 

 

 

 

 

 

 

 

Project description

The growing overload of textual information available to organizations and professionals hampers effective knowledge management and discovery by increasing the time needed to find relevant information and by causing crucial information to be missed. Especially in the health sciences this is seen as a vexing problem, as the huge and largely unexplored volume of published literature, in combination with structured databases representing experimental data and background knowledge, might lead to new discoveries.

This project proposes the development of a methodology for combined text analysis and data mining (text mining) from such heterogeneous information sources and its application in molecular genetics/genomics and in knowledge management in general. The proposed approach relies on progress in fundamental research issues in text analysis and data mining. For text analysis, we will investigate semi-automatic adaptation of existing text analysis tools to biomedical language, and develop a limited but robust and accurate handling of negation, modality, and quantification in medical language. We will use this information for providing accurate relations automatically extracted from text and weighted according to their reliability. For data mining, we will modify and extend existing graph-based data mining algorithms, especially with regard to scalability and the dynamic nature of the graph that needs to be explored. We will also investigate principled ways for integrating the reliability measures of the output of text analysis with reliability measures for the structured information.

These developments will lead to a new methodology for text mining with heterogeneous information sources that will be tested in two application areas: biomedical text mining and knowledge management. For biomedical text mining, the methodology will be used to assist researchers in ranking candidate disease causing genes. A number of test cases of increasing complexity will be defined (both with known outcome and unknown outcome), and the results of the methodology will be compared to the literature (for the cases with known outcome) and experimentally validated (for the cases with unknown outcome). The application in knowledge management addresses the collection of information about persons (person profiling) from WWW information. It will be of a smaller scale than the biomedical application, and is intended to show the general applicability of the developed text mining approach.

The project will provide improved text mining tools (adaptable and with deeper semantic analysis), new graph-based data mining methods and progress in non-trivial text mining using heterogeneous information sources and in reliability assessment of mined knowledge. Apart from that, we also hope that through the applications, the project will show new results in mining for previously unknown relations between genes and phenotypes and improved gene prioritisation catching non-obvious disease causing genes.