Speaker: John Tate, European Bioinformatics Institute (SDSC, 2003/02/12)
Abstract
Since 1979 the Protein Data Bank (PDB) has been the central repository of macromolecular structure data, but the present flat file archive is incapable of supporting the complex tools that are required for drug discovery, molecular medicine and bioinformatics. In order to fully exploit the volume of structural data that will soon become available, new technologies must be employed. The Macromolecular Structure Database (MSD) group has developed a relational database for storing, validating, searching and retrieving the complex structural information in the PDB. A comprehensive cleaning procedure is under way, to ensure data uniformity across the whole archive, and an extensive set of derived properties and goodness-of-fit indicators will be added. The MSD includes links to many other bioinformatics databases including InterPro, SwissProt, SCOP, CATH, PFAM and PROSITE.
We have developed a flexible search system which exposes the power of the relational database without requiring the user to understand the complexity of the underlying schema. This search system provides a single access point for the MSD and associated databases, allowing searches on a wide range of bio-molecular properties, such as sequence, structure similarity and active site conformation. The database, and several network based-services that are built on top of it, will be available by the end of April 2003. This talk will describe the basic design of the database, outline some of the improvements in data quality that it provides, and will describe the services and search systems that are currently available and planned for the near future.
The E-MSD relational data base and search system at EBI
- MSD URL: http://www.ebi.ac.uk/msd/
- MSD (aims)
- Deposition site via which a molecular structure can be deposited to PDB (Autodep)
- A stable repository of macromolecular structural data
- Services to allow access, search and retrieval of structural data
- PDB format is not extensible, and is even incapable of describing some existing structures. Archive is non-uniform, and search of flat-files is difficult and inaccurate
- MSD design requirements
- Robust: analysis and consistency checks
- Clean
- Maintainable
- Open interface: API, SOAP, JDBC, Perl-DBI
- Extensible
- MSD uses Oracle as the back end, and encompasses 2 databases
- Deposition database: highly normalized, approx. 400 tables
- Search database: denormalized, approx. 40 tables
- PQS: Protein Quaternary Structure Server (http://www.ebi.ac.uk/msd-srv/pqs)
- It generates a macromolecular assembly from a PDB structure
- Derived data:
- Ligand binding sites
- Secondary structure information (DSSP), consistent treatment across repository, but also allows for author’s annotation
- Mapping to other dbs such as SwissProt and SCOP
- Data cleanup
- Spelling errors, e.g. 20 different ways of referring to E. Coli
- Chain name consistency, e.g. header and actual record’s chain ids must match
- Ligand nomenclature (HET compounds), using a graph-based algorithm to disambiguate
- MSD Services
- Chem PDB: search the HET compound list using interactive java applet to draw structure
- Secondary Structure Matching (SSM):
- Interactive 3D comparison
- Compares against the whole PDB in approx 30 sec
- Allows for substructure searches too
- Search@MSD
- Uses an applet to build a query
- Applet generates an XML representation of the query
- Server turns the XML into SQL
- Applet uses an XML file for interface description
- Applet not yet available for other people to use, perhaps licensing issues
- 3D structure viewer: AstexViewer, developed in collaboration by EBI and Astex