Please copy and paste this embed script to where you want to embed

C E N T R E F O R I N T E G R A T I V E

Introduction to Bioinformatics B I O I N F O R M A T I C S V U

Lecture 16 Intracellular Networks Graph theory

High-throughput Biological Data

Enormous amounts of biological data are being generated by high-throughput capabilities; even more are coming – – – – – –

genomic sequences gene expression data mass spectrometry data protein-protein interaction data protein structures ......

Hidden in these data is information that reflects – existence, organization, activity, functionality …… of biological machineries at different levels in living organisms

Bio-Data Analysis and Data Mining

Existing/emerging bio-data analysis and mining tools for – – – – – – – – –

DNA sequence assembly Genetic map construction Sequence comparison and database search Gene finding …. Gene expression data analysis Phylogenetic tree analysis to infer horizontally-transferred genes Mass spec. data analysis for protein complex characterization ……

Current prevailing mode of work

Developing ad hoc tools for each individual application

Bio-Data Analysis and Data Mining

As the amount and types of data and the needs to establish connections across multi-data sources increase rapidly, the number of analysis tools needed will go up “exponentially” – blast, blastp, blastx, blastn, … from BLAST family of tools – gene finding tools for human, mouse, fly, rice, cyanobacteria, ….. – tools for finding various signals in genomic sequences, protein-binding sites, splice junction sites, translation start sites, …..

Many of these data analysis problems are fundamentally the same problem(s) and can be solved using the same set of tools Developing ad hoc tools for each application problem (by each group of individual researchers) may soon become inadequate as bio-data production capabilities further ramp up

Data Clustering

Many biological data analysis problems can be formulated as clustering problems – microarray gene expression data analysis – arrayCGH data (chromosomal gains and losses) – identification of regulatory binding sites (similarly, splice junction sites, translation start sites, ......) – (yeast) two-hybrid data analysis (for inference of protein complexes) – phylogenetic tree clustering (for inference of horizontally transferred genes) – protein domain identification – identification of structural motifs – prediction reliability assessment of protein structures – NMR peak assignments – ......

Data Clustering: an example

Regulatory binding-sites are short conserved sequence fragments in promoter regions ........acgtttataatggcg ...... ........ggctttatattcgtc ...... ........ccgaatataatcta .......

Solving binding-site identification as a clustering problem – Project all fragments into Euclidean space so that similar fragments are projected to nearby positions and dissimilar fragments to far positions – Observation: conserved fragments form “clusters” in a noisy background

Data Clustering Problems

Clustering: partition a data set into clusters so that data points of the same cluster are “similar” and points of different clusters are “dissimilar”

Cluster identification -- identifying clusters with significantly different features than the background

Multivariate statistics – Cluster analysis C1 C2 C3 C4 C5 C6 .. 1 2 3 4 5

Raw table Any set of numbers per column

•Multi-dimensional problems •Objects can be viewed as a cloud of points in a multidimensional space

•Need ways to group the data

Multivariate statistics – Cluster analysis 1 2 3 4 5

C1 C2 C3 C4 C5 C6 ..

Raw table Any set of numbers per column Similarity criterion Scores

5×5

Similarity matrix

Cluster criterion

Dendrogram

Cluster analysis – data normalisation/weighting 1 2 3 4 5

C1 C2 C3 C4 C5 C6 ..

Raw table Normalisation criterion

1 2 3 4 5

C1 C2 C3 C4 C5 C6 ..

Column normalisation

Normalised table x/max

Column range normalisation (x-min)/(max-min)

Cluster analysis – (dis)similarity matrix 1 2 3 4 5

C1 C2 C3 C4 C5 C6 ..

Raw table Similarity criterion Scores

5×5

Similarity matrix

Di,j = (k | xik – xjk|r)1/r Minkowski metrics r = 2 Euclidean distance r = 1 City block distance

Cluster analysis – Clustering criteria Scores

5×5

Similarity matrix

Cluster criterion

Dendrogram (tree) Single linkage - Nearest neighbour Complete linkage – Furthest neighbour Group averaging – UPGMA (phylogeny) Ward

Neighbour joining – global measure (phylogeny)

Cluster analysis – Clustering criteria 1. Start with N clusters of 1 object each 2. Apply clustering distance criterion iteratively until you have 1 cluster of N objects

3. Most interesting clustering somewhere in between

distance

Dendrogram (tree)

1 cluster

N clusters

Single linkage clustering (nearest neighbour) Char 2

Char 1

Single linkage clustering (nearest neighbour) Char 2

Char 1

Single linkage clustering (nearest neighbour) Char 2

Char 1

Single linkage clustering (nearest neighbour) Char 2

Char 1

Single linkage clustering (nearest neighbour) Char 2

Char 1

Single linkage clustering (nearest neighbour) Char 2

Char 1

Distance from point to cluster is defined as the smallest distance between that point and any point in the cluster

Single linkage clustering (nearest neighbour) Let Ci and Cj be two disjoint clusters: di,j = Min(dp,q), where p Ci and q Cj

Single linkage dendrograms typically show chaining behaviour (i.e., all the time a single object is added to existing cluster)

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Distance from point to cluster is defined as the largest distance between that point and any point in the cluster

Complete linkage clustering (furthest neighbour) Let Ci and Cj be two disjoint clusters: di,j = Max(dp,q), where p Ci and q Cj

More ‘structured’ clusters than with single linkage clustering

Clustering algorithm 1. 2. 3. 4.

Initialise (dis)similarity matrix Take two points with smallest distance as first cluster Merge corresponding rows/columns in (dis)similarity matrix Repeat steps 2. and 3. using appropriate cluster measure until last two clusters are merged

Average linkage clustering

(Unweighted Pair Group Mean Averaging -UPGMA) Char 2

Char 1

Distance from cluster to cluster is defined as the average distance over all within-cluster distances

UPGMA Let Ci and Cj be two disjoint clusters: 1

di,j = ———————— pq dp,q, where p Ci and q Cj

|Ci| × |Cj|

Ci

Cj

In words: calculate the average over all pairwise inter-cluster distances

Multivariate statistics – Cluster analysis 1 2 3 4 5

C1 C2 C3 C4 C5 C6 ..

Data table Similarity criterion Scores

Similarity matrix

5×5

Cluster criterion

Phylogenetic tree

Multivariate statistics – Cluster analysis 1 2 3 4 5

C1 C2 C3 C4 C5 C6 Similarity criterion

Scores

6×6 Cluster criterion

Scores

5×5 Cluster criterion

Make two-way ordered table using dendrograms

Multivariate statistics – Two-way cluster analysis

C4 C3 C6 C1 C2 C5 1 4 2 5 3

Make two-way (rows, columns) ordered table using dendrograms; This shows ‘blocks’ of numbers that are similar

Multivariate statistics – Two-way cluster analysis

Graph theory The river Pregal in Königsberg – the Königsberg bridge problem and Euler’s graph

Can you start at some land area (S1, S2, I1, I2) and walk each bridge exactly once returning to the starting land area?

Graphs - definition

Digraphs: Directed graphs

Complete graphs: have all possible edges

Planar graphs: can be presented in 2D and have no crossing edges (e.g. chip design)

Graphs - definition 0 1 1.5 2 5 6 7 9 1 0 2 1 6.5 6 8 8 1.5 2 0 1 4 4 6 5.5 . .

.

Graph

Adjacency matrix

An undirected graph has a symmetric adjacency matrix A digraph typically has a non-symmetric adjacency matrix

Example application – OBSTRUCT: creating non-redundant datasets of protein structures

Based on all-against-all global sequence alignment Create all-against-all sequence similarity matrix Filter matrix based on desired similarity range (convert to ‘0’ and ‘1’ values) Form maximal clique (largest complete subgraph) by ordering rows and columns This is an NP-complete problem (NP = nonpolynomial) and thus problem scales exponentially with number of vertices (proteins)

Example application 1 – OBSTRUCT: creating non-redundant datasets of protein structures •

Statistical research on protein structures typically requires a database of a maximum number of nonredundant (i.e. non-homologous) structures

•

Often, two structures that have a sequence identity of less than 25% are taken as non-redundant

•

Given an initial set of N structures (with corresponding sequences) and all-against-all pair-wise alignments: •

Find the largest possible subset where each sequence has

View more...
Introduction to Bioinformatics B I O I N F O R M A T I C S V U

Lecture 16 Intracellular Networks Graph theory

High-throughput Biological Data

Enormous amounts of biological data are being generated by high-throughput capabilities; even more are coming – – – – – –

genomic sequences gene expression data mass spectrometry data protein-protein interaction data protein structures ......

Hidden in these data is information that reflects – existence, organization, activity, functionality …… of biological machineries at different levels in living organisms

Bio-Data Analysis and Data Mining

Existing/emerging bio-data analysis and mining tools for – – – – – – – – –

DNA sequence assembly Genetic map construction Sequence comparison and database search Gene finding …. Gene expression data analysis Phylogenetic tree analysis to infer horizontally-transferred genes Mass spec. data analysis for protein complex characterization ……

Current prevailing mode of work

Developing ad hoc tools for each individual application

Bio-Data Analysis and Data Mining

As the amount and types of data and the needs to establish connections across multi-data sources increase rapidly, the number of analysis tools needed will go up “exponentially” – blast, blastp, blastx, blastn, … from BLAST family of tools – gene finding tools for human, mouse, fly, rice, cyanobacteria, ….. – tools for finding various signals in genomic sequences, protein-binding sites, splice junction sites, translation start sites, …..

Many of these data analysis problems are fundamentally the same problem(s) and can be solved using the same set of tools Developing ad hoc tools for each application problem (by each group of individual researchers) may soon become inadequate as bio-data production capabilities further ramp up

Data Clustering

Many biological data analysis problems can be formulated as clustering problems – microarray gene expression data analysis – arrayCGH data (chromosomal gains and losses) – identification of regulatory binding sites (similarly, splice junction sites, translation start sites, ......) – (yeast) two-hybrid data analysis (for inference of protein complexes) – phylogenetic tree clustering (for inference of horizontally transferred genes) – protein domain identification – identification of structural motifs – prediction reliability assessment of protein structures – NMR peak assignments – ......

Data Clustering: an example

Regulatory binding-sites are short conserved sequence fragments in promoter regions ........acgtttataatggcg ...... ........ggctttatattcgtc ...... ........ccgaatataatcta .......

Solving binding-site identification as a clustering problem – Project all fragments into Euclidean space so that similar fragments are projected to nearby positions and dissimilar fragments to far positions – Observation: conserved fragments form “clusters” in a noisy background

Data Clustering Problems

Clustering: partition a data set into clusters so that data points of the same cluster are “similar” and points of different clusters are “dissimilar”

Cluster identification -- identifying clusters with significantly different features than the background

Multivariate statistics – Cluster analysis C1 C2 C3 C4 C5 C6 .. 1 2 3 4 5

Raw table Any set of numbers per column

•Multi-dimensional problems •Objects can be viewed as a cloud of points in a multidimensional space

•Need ways to group the data

Multivariate statistics – Cluster analysis 1 2 3 4 5

C1 C2 C3 C4 C5 C6 ..

Raw table Any set of numbers per column Similarity criterion Scores

5×5

Similarity matrix

Cluster criterion

Dendrogram

Cluster analysis – data normalisation/weighting 1 2 3 4 5

C1 C2 C3 C4 C5 C6 ..

Raw table Normalisation criterion

1 2 3 4 5

C1 C2 C3 C4 C5 C6 ..

Column normalisation

Normalised table x/max

Column range normalisation (x-min)/(max-min)

Cluster analysis – (dis)similarity matrix 1 2 3 4 5

C1 C2 C3 C4 C5 C6 ..

Raw table Similarity criterion Scores

5×5

Similarity matrix

Di,j = (k | xik – xjk|r)1/r Minkowski metrics r = 2 Euclidean distance r = 1 City block distance

Cluster analysis – Clustering criteria Scores

5×5

Similarity matrix

Cluster criterion

Dendrogram (tree) Single linkage - Nearest neighbour Complete linkage – Furthest neighbour Group averaging – UPGMA (phylogeny) Ward

Neighbour joining – global measure (phylogeny)

Cluster analysis – Clustering criteria 1. Start with N clusters of 1 object each 2. Apply clustering distance criterion iteratively until you have 1 cluster of N objects

3. Most interesting clustering somewhere in between

distance

Dendrogram (tree)

1 cluster

N clusters

Single linkage clustering (nearest neighbour) Char 2

Char 1

Single linkage clustering (nearest neighbour) Char 2

Char 1

Single linkage clustering (nearest neighbour) Char 2

Char 1

Single linkage clustering (nearest neighbour) Char 2

Char 1

Single linkage clustering (nearest neighbour) Char 2

Char 1

Single linkage clustering (nearest neighbour) Char 2

Char 1

Distance from point to cluster is defined as the smallest distance between that point and any point in the cluster

Single linkage clustering (nearest neighbour) Let Ci and Cj be two disjoint clusters: di,j = Min(dp,q), where p Ci and q Cj

Single linkage dendrograms typically show chaining behaviour (i.e., all the time a single object is added to existing cluster)

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Complete linkage clustering (furthest neighbour) Char 2

Char 1

Distance from point to cluster is defined as the largest distance between that point and any point in the cluster

Complete linkage clustering (furthest neighbour) Let Ci and Cj be two disjoint clusters: di,j = Max(dp,q), where p Ci and q Cj

More ‘structured’ clusters than with single linkage clustering

Clustering algorithm 1. 2. 3. 4.

Initialise (dis)similarity matrix Take two points with smallest distance as first cluster Merge corresponding rows/columns in (dis)similarity matrix Repeat steps 2. and 3. using appropriate cluster measure until last two clusters are merged

Average linkage clustering

(Unweighted Pair Group Mean Averaging -UPGMA) Char 2

Char 1

Distance from cluster to cluster is defined as the average distance over all within-cluster distances

UPGMA Let Ci and Cj be two disjoint clusters: 1

di,j = ———————— pq dp,q, where p Ci and q Cj

|Ci| × |Cj|

Ci

Cj

In words: calculate the average over all pairwise inter-cluster distances

Multivariate statistics – Cluster analysis 1 2 3 4 5

C1 C2 C3 C4 C5 C6 ..

Data table Similarity criterion Scores

Similarity matrix

5×5

Cluster criterion

Phylogenetic tree

Multivariate statistics – Cluster analysis 1 2 3 4 5

C1 C2 C3 C4 C5 C6 Similarity criterion

Scores

6×6 Cluster criterion

Scores

5×5 Cluster criterion

Make two-way ordered table using dendrograms

Multivariate statistics – Two-way cluster analysis

C4 C3 C6 C1 C2 C5 1 4 2 5 3

Make two-way (rows, columns) ordered table using dendrograms; This shows ‘blocks’ of numbers that are similar

Multivariate statistics – Two-way cluster analysis

Graph theory The river Pregal in Königsberg – the Königsberg bridge problem and Euler’s graph

Can you start at some land area (S1, S2, I1, I2) and walk each bridge exactly once returning to the starting land area?

Graphs - definition

Digraphs: Directed graphs

Complete graphs: have all possible edges

Planar graphs: can be presented in 2D and have no crossing edges (e.g. chip design)

Graphs - definition 0 1 1.5 2 5 6 7 9 1 0 2 1 6.5 6 8 8 1.5 2 0 1 4 4 6 5.5 . .

.

Graph

Adjacency matrix

An undirected graph has a symmetric adjacency matrix A digraph typically has a non-symmetric adjacency matrix

Example application – OBSTRUCT: creating non-redundant datasets of protein structures

Based on all-against-all global sequence alignment Create all-against-all sequence similarity matrix Filter matrix based on desired similarity range (convert to ‘0’ and ‘1’ values) Form maximal clique (largest complete subgraph) by ordering rows and columns This is an NP-complete problem (NP = nonpolynomial) and thus problem scales exponentially with number of vertices (proteins)

Example application 1 – OBSTRUCT: creating non-redundant datasets of protein structures •

Statistical research on protein structures typically requires a database of a maximum number of nonredundant (i.e. non-homologous) structures

•

Often, two structures that have a sequence identity of less than 25% are taken as non-redundant

•

Given an initial set of N structures (with corresponding sequences) and all-against-all pair-wise alignments: •

Find the largest possible subset where each sequence has