Validating Semantic Similarity in biological data


The management and analysis of the extremely large amount of biological data that are produced from high throughput experimental methods, such as next generation sequencing, is a demanding challenge. A common approach concerns the filtering of the data based on statistical criteria to reduce their volume. The filtered data are associated with a list of biological information, e.g. gene expression, participation in biological pathways, mutations, chromosome regions and proteins. In most cases, the information within the data output of the high throughput experiments is linked to known genes (21.000 genes at Human Genome Project) in order to increase the understanding of biological processes. These genes bear significant biological meaning that has to be decoded and lead to primary scientific conclusions.


  • Utilize machine learning and text mining – based methodologies to construct a Map with gene clusters, based on text information available in PubMed.
  • Validate the emerging clustering within a biological framework.
  • For each gene of interest, extract all the available-associated information that lies within its cluster.
  • Reveal information that is not otherwise obvious and/or available.