Conference: Bioinformatics Technology Conference 2003

BioCon2003

This are my notes from the BioCon2003 organized by O’reilly in San Diego in early 2003. This was the second (and last) of these conferences that they organized.

You can see the archived web site for the conference at http://conferences.oreillynet.com/bio2003/, and the presentations can be downloaded from http://conferences.oreillynet.com/pub/w/21/presentations.html. You will see there a talk I gave on “Creating and Using Multi-protocol Bioinformatic Web Services”. which is also on-line at http://talks.php.net/show/mdb_biocon2003

2003-02-03

XSLT for Bioinformatics (Doug Tidwell, IBM, dtidwell at us dot ibm dot com)

Extension functions and extension elements
Toot-O-Matic: to make tutorials available on the web, downloadable from IBM’s web site
Doug’s page at IBM

XML Web Services BOF

check JXObjects (sp?), Javascript/DHTML components

2003-02-04

The Genes, the Whole Genes, and Nothing But the Genes (Jim Kent)

HMMs can be used to find/identify genes
Programs that used a composite approach to finding genes (prediction and experimental data): fgenesh++, Genome Scan, Ensembl, Genie, twinscan, fgenesh2, SLAM, SGP.

The MetaGraph Framework (John McNeil, Isis Pharmaceuticals)

Why?
- need to have a hierarchy of hypotheses
- also a management of levels of detail
- derive knowledge through statistical association
- need to deal w/ different naming conventions
What
- Defined a semantic network (to store explicit relationships)
- Graphs define also linkages to experimental data, which can create edges pointing to other edges (a hypergraph)
- Hypotheses are groupings of edges (relationships), and they can complement, encompass, or contradict other hypotheses
- Nodes can abstract/aggregate other subgraphs
Implementation
- Java persistence framework
- Query toolkit: allows using a graph-like structure to perform queries, and it does the grunt work of generating the appropriate SQL mappings
- Templates used for JDBC, Object-relational, etc.
- Metagraph: defines an object model for biology
Clusters
- MetaGraph is too fine-grained, so wrapping them into a cluster make the graphs easier to re-use.
- Two clusters can have a different view of the same underlying graph w/o modifying the graph itself
- A cluster can be a sub-graph, rule based, or hybrid
- Clusters are associated to a UI component in most cases
- At the moment, clusters are hand-coded, they are working on an XML representation of clusters
Viewer
- Graphs: non-tree topology, cyclic topology, hypergraph
- Details: interactive drill-down to more information
- Has a Pluggable Object Viewer (POV), wraps any Java object, any persistence mechanism, object-relational, MetaGraph cluster, etc.
- POV has filters, computational app wraps (e.g. BLAST), etc.
They have used their system to integrate GO (Gene Ontology)
Status
- MG and POV deployed w/ GeneTrove hGF Db
- MG is LGPL, http://www.metagraph.org/ (persistence and object-relational model)
- POV and MeshView jars are also released in the MG site, source code will be released after the Isis proprietary code is extracted out.
- Contact: jmcneil at isisph dot com

Graphical interfaces for composing analyses (Tom Hudson)

Dataflow programming (sci viz)
BioMoby, Isys (exploratory tools)
Science factory (dataflow for bioinformatics)
http://www.uncw.edu/csc/bioinformatics/

Vector space methods for structural bioinformatics (Per Jambeck)

Number of novel folds grows slower than totla number of structs.
Most structure comparison algos rely on dynamic programming
Need a similar tool like BLAST for 3D comparison
Related talks: “Using latent semantic analysis in bioinformatics”
Vector Space Methods
- Common in information retrieval
- Use to represent high-dimensional data (microarrays)
- Also for pattern recognition-based approaches
Approach
- Given a set of docs and a list of keywords.
- Parse docs into words.
- Identify keywords in docs.
- Represent docs as a list - each entry corresponds to a keyword - each entry contains the number of occurrences
- Keywords chosen are specific to categories.
- Tries to develop an abstract representation of the problem in terms of a finite list of features
- Features can be: continuous, discrete, binary
- Identify features that address the problem at hand
- Define a distance metric between the list of features
Benefits
- Common representation for comparing diverse data
- Allows data analyis using viz and multivariate statistic methods - Clustering, classification (discriminate analysis or nearest neighbour methods), principal components
- Results in an interpretable model
Protein Structure Domains
- Domains: compact, recurring substructures
- Domain segmentation is similar to clustering
- Domain boundaries depend on algo or person that defines them
- Fold taxonomies. Examples: SCOP, CATH, FSSP, CE
- CE http://cl.sdsc.edu/subdomains/subdomains.html
Domain classification
- Automated method needed to fit into SCOP taxonomy
- Benefits
  - Facilitate taxonomy and db curation
  - Help recognize organizing principles
  - Find features to help search of structural data
SCOP has 4 levels: Class -> Fold -> Superfamily -> Family
Representing Domains as vectors
- AAs
- Use backbone fragment libraries (http://csb.stanford.edu/rachel/fragments)
- For a set of proteins - Break backbone into short, contiguous fragments - Cluster fragments into groups of similar shape - Extract representative fragment from each cluster
- Create a vector repesentation of the protein domain using the fragment libary (min RMSD for each k-residue fragment against known fragment set)
- Features: Freq with which each fragment occurs in domain
- Distance: Chi-square distance
- Algo performs well at pairwaise categorization of structures at the Class level, but accuracy drops at Fold and Superfamily level
- Refined the approach by taking into account 3D relationships between fragments: Partition fragment frequencies into distance bins.
- Partition figuring if the fragment is in the outside or inside of the protein.
- This allowed to find nearest neighbors that are members of the same Superfamily -> will allow us to mine positional preferences of fragments.

My own TODO

Use this in classification of metal-sites?
Possible vector components:
Ligand pattern (1st and/or 2nd shell)
Geom params (proper/improper torsions)
Email Per sometime to sit down and talk about his classification approach

2003-02-05

Integrating distributed bioinformatics data using data webs (Robert Grossman, U Illinois at Chicago and Open Data Partners)

Trends driving the emergence of biowebs
1. Proliferation of bio dbs (usually in the wrong format)
2. Near a trifurcation point (how to use someone else’s data)
  - Biowebs (remote data analysis and distributed mining)
  - Biogrids (transparent high end computing)
  - Biological semantic webs (for bio knowledge)
3. More and more discoveries will be accross dbs
  - “Pearson’s Law”: the usefulness of a column of data varies as the square of the number of columns it is compared to.
Comparing data makes it useful, e.g.: microarray and clinical data
“Conservation of difficulty: When something is hard, it remains hard no matter what you do about it”
Emergence of Open Data requires:
- Open source infrastructure
- Open protocols and data stacks
- Open access to data (on intellectual property restrictions)
Internet Infrastructures for Data
- Data Grids: interoperate distributed supercomputers. Ideal when need remote remote large comp resources, e.g. simulation.
- Semantic Web: how can we extend the web to use other people’s knowledge
Biowebs
Molecular DataSpace: example - chemical libraries, 3D structures
Key data web protocol and services
- Data and metadata selection (DWTP, SQL)
  - Data Web transport protocol (DWTP)
  - XML metadata
- Data transport (DWTP)
  - DWTP and XML/SOAP
- Data merging by universal key
  - Globally unique distributed keys (UCKs) for joining dstributed data
- Data analysis and mining (PMML)
  - Using algorithms for clustering, regression, etc.
Data and metadata are separated.
No reconciliation of ontologies is needed, only keys need to be exchanged.
Data needs specialized protocol, because it has metadata, missing values, keys, etc.: DWTP
DWTP separated control from data channels
URL: http://www.rgrossman.com

From bioinformatics in the small to bioinformatics in the large (Christopher Lee, UCLA)

Three stages: discovery, analysis, and integration.
Integration is different
“Programming in the large vs programming in the small” DeRemer, 1976.
Bioinfo in the small: Focus on analysis.
There is a need for data integration: Protocols and tools need to be created
Using make as a Query Language for data directer graphs. make can do O(n\^2) operation with O(n) set of commands. A Makefile can be considered a knowledge map/graph. It is a tool for the large scale.
Tools:
- ssh to authenticate requests
- make to deal w/ the distributed issues
- MySQL to deal w/ the centralized aspects
Flow:
- DB creates a job table
- ssh agent launches agent to a set of cluster nodes
- agent asks DB for next task, process and sends results to DB
Using all of these you can easily make web interfaces to see what is happening (errors, etc.)

2003-02-06

Java for Numerical Computing: A Tour Through Issues and Directions in the Use and Evolution of Java for Numerical Computing (James Gosling, Sun Microsystems)

Gaming drives the high end numeric computing at this moment
Java Grande (http://www.javagrande.org) - Numerical computing in Java
Plenty of Java APIs, even one for interplanetary navigation which focuses more on precission and accuracy than performance
Issues in Java for Sci/Numerical computing - Notation - Access to legacy code - Performance - Native support for Complex Numbers, Matrices, etc.
Notation
- Operator overloading: widely loathed, the legacy of C++
- Limited set of operators
- Gossling wrote (in Java) an editor that understands the language semantics, and can represent code in different ways, such as regular code listings, syntactic tree, even mathematical equations
Floating Point (JSR84, withdrawn)
- JSR84 wanted to relax floating point accurary in exachange for a higher performance
- Flip side of strict floating point -> to do floating point operations exactly
IEEE754
- Java omits performance-damaging aspects - Global modes (changes flush pipeline) - Exceptions (inhibits optimizations)
- IEEE working on revision
Multidimensional arrays (JSR83)
- Pure API
- Should there be a language-level version?
- Build on API via operator overloading?:
- a[i] —> a.get(j)
- a[i] = v —> a.put(j,v)
Units (JSR108)
- Pure API
- Language-level implementation?
  - Part of type system?
  - For primitives, e.g. double@m@/s velocity = 23@m/17.2@s;
Performance (Java)
- On numeric benchmarks it matches gcc on Linux
- About 50-60% performance when compared to optimized Fortran code (Los Alamos)
- Non-numeric benchmarks: Dynamic optimization, and good performance.
- Issues
  - Small (cheap) objects needed (for complex numbers, etc.)
  - Memory hierarchy

Best Practices for Designing XML Schemas for Bioinformatics (Ayesha Malik, Object Machines)

Use design patterns (increasing level of granularity for data types)
- Russian Doll
- Salami Slice
- Venetian Blind
EBI makes gene data available as BSML
UML used to model schemas, with some tools like ArgusUML that can convert from UML to XSD
XML Schemas support
- abstract data types
- inheritance (via xs:extension element)
- type restriction depending on context (xs:restriction)
Design patterns address decoupling and cohesiveness in XML Schemas
Make good use of namespaces to group schema definitions