An XML-based Database of Molecular Pathways

Research of protein-protein interactions produce vast quantities of data and there exists a large number of databases with data from this research. Many of these databases offers the data for download on the web in a number of different formats, many of them XML-based.With the arrival of these XML-based formats, and especially the standardized formats such as PSI-MI, SBML and BioPAX, there is a need for searching in data represented in XML. We wanted to investigate the capabilities of XML query tools when it comes to searching in this data. Due to the large datasets we concentrated on native XML database systems that in addition to search in XML data also offers storage and indexing specially suited for XML documents.A number of queries were tested on data exported from the databases IntAct and Reactome using the XQuery language…

Contents

1 Introduction
1.1 Background
1.2 Problem overview
1.3 Purpose
1.4 Thesis outline
1.5 Document conventions
2 XML – Extensible Markup Language
2.1 Background
2.1.1 Advantages of XML
2.1.2 Drawbacks with XML
2.1.3 Data vs. document
2.2 Validation
2.2.1 DTD
2.2.2 XML Schema
2.2.3 Relax NG
2.3 Namespaces
2.4 Meta-data
2.4.1 RDF
2.4.2 OWL
2.5 Links
2.6 XML APIs
2.7 Transforms
2.7.1 XSLT
2.8 Query
2.8.1 XPath
2.8.2 History
2.8.3 Update capabilities
2.8.4 XQuery
2.9 Databases in XML
2.9.1 Diﬀerent types of XML databases
2.9.2 Native XML databases
2.9.3 Indices
2.9.4 Normalization
2.9.5 Referential integrity
2.9.6 Performance
2.9.7 Output/API
2.9.8 NXD Models
2.9.9 Implementations of native XML databases
2.10 Summary
3 Bioinformatics
3.1 Genes
3.2 Proteins
3.3 Pathways
3.4 Experimental methods
3.4.1 Two-hybrid systems
3.4.2 Phage-display systems
3.4.3 Curated data
3.5 Databases
3.5.1 KEGG
3.5.2 DIP
3.5.3 MINT
3.5.4 BIND
3.5.5 Reactome
3.5.6 IntAct
3.6 Proposed standard formats
3.6.1 SBML
3.6.2 PSI MI
3.6.3 BioPAX
3.7 Proprietary exchange formats
3.7.1 KGML
3.7.2 XIN
3.7.3 BIND
3.8 Summary
4 Problem analysis
4.1 Questions to be answered
4.1.1 Query capability
4.1.2 Eﬃciency
4.2 Chosen datasets
4.2.1 Databases
4.2.2 Queries
4.3 Chosen technologies
4.3.1 Native XML databases and XQuery
4.3.2 The Graph Template Library
5 Native XML database setup
5.1 Native XML databases
5.1.1 Exist
5.1.2 Sedna
5.1.3 X-Hive
5.1.4 Qizx/open
5.1.5 Java
5.1.6 Machine setup
5.2 Queries
5.2.1 Type of queries and efficiency
5.2.2 Description of queries
5.2.3 XML serialization
5.3 Test framework
5.4 Benchmarking
6 GTL test setup
6.1 The GTL package
6.2 Transformation
6.3 The program
6.3.1 Removal of extraneous edges
6.3.2 Control of reachability and leaf deletion
6.3.3 Path search
6.4 Benchmark methods
7 Results
7.1 Queries on IntAct data
7.2 Queries on Reactome data
7.3 Premature technique
8 Discussion
8.1 Conclusions
8.2 Future work
8.2.1 More formats
8.2.2 Data integration
8.2.3 Data integration with OWL
8.2.4 Using live remote data
8.2.5 XQuery graph support
8.2.6 User interface
Bibliography
Appendix

Author: Hall, David

Source: Linköping University

Download This Report

Download URL 2: Visit Now

An XML-based Database of Molecular Pathways

Leave a Comment