November 2006 Archives

2006-11-29 16:32:50

Complete Mining of Frequent Patterns Graphs: Mining Graph Data

Ref.: Inokuchi, A.; Washio, T. & Motoda, H. Complete Mining of Frequent Patterns from Graphs: Mining Graph Data Machine Learning, 2003, V50, 321-354.

I found this paper on CiteULike. I am interested in it because it might has an application on network motif, although their algorithm is obtained by the extension of Apriori algorithm which is an algorithm of Data mining.

In this paper, we propose a novel principle and an algorithm that can capture all ``frequent induced subgraphs'' and ``association rules among induced subgraphs'' which are observed in many transactions where each is a general graph.

We proposed a new algorithm by extending the Apriori algorithm to mine frequent induced subgraph structures in graph transaction data. The algorithm can handle general graphs and perform a complete search efficiently.Agraph can be either directed or undirected, and have any number of labels for nodes and links and any loops including self-loops.

However the problem to search induced subgraph isomorphism is very hard to do.....

Posted by Abner Chih Yi Huang | Permanent Link

2006-11-24 22:07:37

Graphs Connectivity

I reported tonight on Graph Meeting. The selected paper is ``Testing Connectivity of Graphs'', Hsuan Chang Chen, Master thesis, Graduate Institute of Computer Science and Information Engineering , NTU, 2006.

Here is my slides.

Prof. Tang was too tired so that he nodded almost whole report. Therefore I thought it is hard to discriminate that Prof. Tang like my report or not, although he said it is good after meeting.

Posted by Abner Chih Yi Huang | Permanent Link | Categories: Seminar

2006-11-23 14:48:54

Parameterized Algorithms and Complexity at CCU

There was a six-hour free tutorial course on "Parameterized Algorithms and Complexity" to be given at Computer Science and Information Engineering, National Chung Cheng University, ChiaYi.

Speaker:Prof. Peter Rossmanith

Theoretical Computer Science, Department of Computer Science, RWTH Aachen University

I can't attend this mini-course, but Joseph gave me the slides of this course. Thanks a lot!

Prof. Rossmanith started at figures about NP-Complete, I liked it very much at that time I read Garey and Johnson. Computers and Intractability.

And he quoted the words of Molecular biologist Joseph Felsenstein:

About ten years ago, some computer scientists came by and said they heard we have some really cool problems. They showed that the problems are NP-complete and went away!

That's why we should study the Parameterized Algorithms.

Posted by Abner Chih Yi Huang | Permanent Link | Categories: Seminar

2006-11-22 15:50:46

Slides for Graph Meeting

I have finished the slides for graph meeting at this friday.

Topic: Graphs Connectivity

Paper: Testing Connectivity of Graphs, Hsuan Chang Chen, Master thesis, Graduate Institute of Computer Science and Information Engineering , NTU, 2006.

Now, I would like to extend this slides for proposal.

Posted by Abner Chih Yi Huang | Permanent Link

2006-11-21 23:46:34

Two researchers

I found the webpages of two researchers, Asaf Shapira and Louis Ibarra, respectively. Their works might helpful to me.

Posted by Abner Chih Yi Huang | Permanent Link

2006-11-17 21:13:46

A survey of the role of multicommodity flow and randomization in network design and routing

Ref.: Srinivasan, A. Pardalos, P. & Rajasekaran, S. (ed.) A survey of the role of multicommodity flow and randomization in network design and routing Randomization Methods in Algorithm Design, American Mathematical Society, 1997.

Ref.: Srinivasan, A. An Extension of the Lov{\'a}sz Local Lemma, and its Applications to Integer Programming. SIAM Journal on Computing, Vol. 36, 609-634, 2006. [ArXiv][SODA1996]

I knew this book at last semester. In this semester I need to report Lovasz Local Lemma, so that I found this paper again.

2.4 Flow-decomposition and randomized rounding

Ref. :Network flow From Wikipedia, the free encyclopedia

Flow Decomposition is a simple and useful process of transforminh ``edge flows'' to ``path flow''.

According to the definition on Wikipedia. A network flow is a real function $f:V \times V \rightarrow R$ with the following three properties for all nodes u and v: (1)Capacity constraints (2)Skew symmetry (3)Flow conservation, where u,v are the nodes different to the sink t and source s.

The procedure to transform f into a set of ``path flow'':

find any path P_1 from s to t that is composited by only positive-flow edges;
find the minimum flow value b_1 of edges of P_1;
substract the b_1 from all edges of P_1, and declare P_1 to be a flow-path with flow value b_1, then remove the edges of P_1 which are with zero flow value.
Repeat!

Eventually we will get a set of flow-path { P_i } with flow value b_i>0.

Moreover, we could generalize it into a multi-commodity flow problem, you have multiple sources and sinks, and various ``commodities'' which are to flow from a given source to a given sink. This could be for example various goods that are produced at various factories, and are to be delivered to various given customers through the same transportation network, i.e., to find each flow-path P_{i,j} on (s_i,t_i).

There are serval ways with randomized rounding. First one is to send a unit flow from s_i to t_i. Since such paths P_{i,j} will be convex combination, we could pick P_{i,j} indepednet at ramdon for each i, and the probability of each P_{i,j} is set as b_{i,j}. The second one is like first one but the probability of each P_{i,j} is set as proportional to b_{i,j}.

3. Low congestion rounding, minimax integer programs, and Lovasz local lemma

Note: the explaination of minimax integer program is not lucid in this paper.

The congestion is defined as the maximum number of chosen paths using any edge of G which is minimized. This problem, that choose one P_{i,j} for each i to minimize congestion, could be formulated by minimax integer problems.

Definition An MIP (minimax integer program) has the variables W and $\{x_{i,j}\,|\,i\in [n],j\in [l_i]\}$ for some integers {l_i}. Let $N=\sum_{i\in [n]}l_i$ and let x denote the N-dimenional vector of the variables x_{i,j}. An MIP seeks to minimize a real W subject to:

$\forall i\in [n] \sum_{j\in [l_i]}x_{i,j}=1;$
a system of linear inequalities $Ax\leq W'$, where $A\in [0,1]^{m\times N}$ and W' is the m-dimenional vector of the variable W in each component,
$\forall i,j, x_{i,j}\in\{0,1\}.$

End of Definition

Note, from constraints (i) and (iii) of MIPs, that for all i, any feasible solution will make the set $\{x_{i,j}\,|\,j\in [l_i]\}$ has prescisely one 1, with all other elements being 0.

The first approach for this problem is LP relaxation of MIP by modifying contraints to non-negative reals. Let $x^*,y^*$ denote the optimal solution and optimal value of LP relaxation of MIP respectively, thus $y^*\leq OPT$. Then one independently picks exact one x_{i,j} rounding to 1 at random like the first approach of maximal flow with the probability x^*_{i,j}.

(Not finish)

Ref.: Network Flows: Theory, Algorithms, and Applications, by Ravindra K. Ahuja, Thomas L. Magnanti, James B. Orlin, Prentice Hall (February 18, 1993)

Posted by Abner Chih Yi Huang | Permanent Link | Categories: Papers

2006-11-17 00:23:33

Predicting the Effectiveness of Naive Data fusion on the Basis of System Characteristic

Ref. : Predicting the Effectiveness of Naive Data fusion on the Basis of System Characteristic, Kwong Bor Ng,Paul B Kantor, Journal of American Society for Information Science, vol 51, no 13, 2000 November, pp. 1177-1189.

My Labmate, PhD candidate, Chun Yi Liu gave this talk in Lab Meeting at 2006-11-02.

The main ideas of Data Fusion are socring and ranking. Data fusion is useful for many fields. It could merge the result of different methods (or databases).

There are some main questions for data fusion.

Under what condition , performance of C is better than performance of D ?(D.F.Hsu and I. Taksa,Information Retrieval 8,(2005))
What kind of combination E of A and B is guaranteed to be better than A and B (or under what condition)?
Under what condition , performace of C (or D)is better than th performace of A and B?

This paper try to anser the third question.

They proposed two conditions for effective data fusion, i.e., Diversity and Similarity. And they gvae the experimental results based on Parametric Statical Methods.

Diversity and Similarity denoted as z and r respectively, then by Multiple Liner Regression Analysis, they got formula 1 and by Linear Discriminant Analysis they got formula 2.

$\begin{eqnarray}E=0.408r+0.164z-0.418\\d=3.159z+4.643r-4.36\end{eqnarray}$

It implies that the effectiveness of data fusion is improved if Diversity and Similarity is improved.

My idea is that (not finish)

Posted by Abner Chih Yi Huang | Permanent Link | Categories: Seminar

2006-11-16 22:54:28

20061110 Graph Meeting Power domination in block graphs

Joyce Liu gave a talk about ``Power domination in block graphs,'' based on Guangjun Xu et. al. paper of TCS v.359.

Power domination: Let G = (V,E) be a graph representing an electric power system. A PMU measures the state variable (voltage and phase angle) for the vertex at which it is placed and its incident edges and their endvertices. (These vertices and edges are said to be observed.) We will introduce the observation rules[1] in the following.

1. Any vertex that is incident to an observed edge is observed.

2. Any edge joining two observed vertices is observed.

3. If a vertex is incident to a total of k >1 edges and k-1 of these edges are observed, then all k of these edges are observed.

This paper gave algorithm based on cut-tree. Since they could build cut-tree of block graph, they try to classfy it into sub-problems, i.e., path, spider, and strong support vertex which adjacent to two or more leaves is called a strong support vertex.

Since it is easy to solve the cases of path and spider on cut-tree, the remaining is to solve the case strong support vertex. It is also easy if we solve it base on bottom-up approach that computed MPD by deciding sub-MPD of path, spider, and strong support vertex iteratively.

After the meeting, Prof. Sheng-Lung Peng also mentioned that Parameterized Algorithms and Exponential algorithms might be the good research direction.

The former is like the graph algorithms, it try to introduce some more general parameters to intactable problem and to find the tractable cases with certain parameters of such problems, but graph algorithms classfy the by sub-graphs.

The later is for some situiations that does not allow the approximation solution. They faced the intractability of problems and tried to devise the algorithms with acceptable time complexity, such as $(1.2)^n$.

Posted by Abner Chih Yi Huang | Permanent Link | Categories: Seminar

2006-11-02 10:25:53

2006 Computational Systems Biology Lecture Series I: Hamid Bolouri

The lecture at 10/27 was the first one of 2006 Computational Systems Biology Lecture Series. The lecturer, Hamid Bolouri, gave two talks, ``Four computational tools for reconstruction and analysis of genetic regulatory networks'' and ``Understanding the dynamic behavior of large scale genetic regulatory networks by functional decomposition''.

The first one is the introduction of tools they developed. The second introduce the electric circuit model for genetic regulatory networks. They adapt the electric circuit model because electric circuit is nonlinear and with feedback. Actually I am interested in network motifs because of this talk.

After browsing some results in networks motifs, I quested that ``what is the meaning of netwotks motifs and what could we do on biological networks by it?'' U. Alon, et. al. said

The fact that the network motifs appear at frequencies much higher than expected at random suggests that they may have specific functions in the information processing performed by the network.

(Network motifs in the transcriptional regulation network of Escherichia coli Nat Genet, 2002, 31, 64-68.)

Is it the FACT? It might be a seminal idea, but not a fact.

Hamid Bolouri answered the question in this talk, ``we cannot guess function from topology alone.''

There are many examples, one is the electric circuits which consisted of transistors have simple topology but complex and nonlinear phenomena.

He also intrduced the recent studys of networks motifs. I thought that is a good reference.

Posted by Abner Chih Yi Huang | Permanent Link | Categories: Seminar

2006-11-01 21:55:31

Regression Analysis

Ref.: Regression analysis - Wikipedia, the free encyclopedia

Regression in Wikipeida.

In statistics: * The phenomenon, that, when two related variables are measured, the expected value of the second is closer to the mean than the measured value of the first. See regression toward the mean. * A method where the mean of one or more random variables is predicted conditioned on other (measured) random variables. See regression analysis. * A method of curve fitting which uses least squares: See Linear regression and Nonlinear regression

For ``regression toward the mean'' also refer here correlation coefficient which indicates the strength and direction of a linear relationship between two random variables. Sometimes we will find the low correlation between two random variables. How do we measure the credibility of such relation? That is, if we got a very low correlation coefficient like 0.2, then the assumed relation of the two random variables is sufficient or not?

However I don't take those questions too serious. In TCS, statistics is not so useful. Nevertheless I need to know it, since I decided to study Macroeconomic Analysis.

In linear case, correlation coefficient = 0 means only ``no linear correlation'', and it does not mean ``no correlation.''

Even though one got the high linear correlation coefficient, it still needs to check something more. The first one should be test of significant of linear correlation, the second we could do linear regession, if we got the linear regression we also need to test it!

$\rho_{X,Y}={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E((X-\mu_X)(Y-\mu_Y)) \over \sigma_X\sigma_Y}$

Above equation is Pearson's correlation coefficient which is a parametric statistic, that assumed the distribution of random variables such as normal, gamma, or possion. For Pearson's correlation coefficient, it is assumed to be normal distribution. For some cases, non-parametric correlation methods, such as Spearman's and Kendall's correlation coefficient may be useful when distributions are unknown (or don't know how to guess the distribution.)

The idea of statistical significance is that

$\mbox{confidence}={\mbox{value of samples}-\mbox{value of population}\over\mbox{standard error of samples}}$

Thus let null hypothesis H0: rho=0, alternative hypothesis H1: rho is not 0, significance level is alpha. When the sample size n of the paired data is large (n>30 for each variable), the standard error s of the linear correlation r is approximately $(1-r^2)/\sqrt{n}$ Thus test statistic value is $T=\frac{r\times\sqrt{n}}{(1-r^2)}$

Posted by Abner Chih Yi Huang | Permanent Link | Categories: Notes

2006-11-01 19:15:34

Consecutive One Property

Consecutive One Property (C1P) is a sub-problem of Consecutive One Submatrix (C1S) which is a decision problem of the existance of Consecutive One Property for any submatrix of given matrix.

In Physical Mapping, people use Shotgun Sequencing in genetics for sequencing long DNA strands, by cloning strands then using to divide sequence-tagged site (or STS) with enzymes. Eventually they adopted the idea of Fragment Assembly to rebuild the original sequence, i.e., Shortest Common Superstring or ``Given a (0,1) matrix of probes versus clones, to reconstruct the relative places of clones or probes.'' And they are both in NP-C.

Ref. Technique: Physical Mapping of the Genome. It introduced some biological techniques in Physical Mapping.

However, if we just consider the problem, the existance of solution for permuting the columns so that the ones in each row are consecutive. It is C1P problem and is in linear time.(However I don't know why it could be linear since matrix is m*n. Transform matrxi to graph also costs m*n!)

This problem is in linear time, but even slight modeified version become in NP-C, like consecutive one submatrix, ``Given a matrix, find a column permutation such that in each row, there are at most k blocks of consecutive 1's,'' ``Given a matrix, find a column permutation such that the total number of blocks of consecutive 1's is minimized. '' (i.e., C1P with some error entries)

Mohammadtaghi Hajiaghayi wrote a paper about C1S, A Note on the Consecutive Ones Submatrix Problem, and a mannual, Consecutive Ones Property, about C1P and C1S at 2002.

In the mannual, he introduce an (new?) algorithm for solving C1P.

In the algorithm, A is a clique-vertex incident matrix. A' is the matrix modified from A, and each column of A' is maxiaml, and A' is C1P iff A is C1P. The A' could be obtained by add an identity matrix below A, i.e., append I, a n*n matrix, to m*n matrix, such that A' is (m+n)*n matrix.

G[A'] means the adjancy matrix of A'. CLex-BFS is an algorithm, that is a modifed Lex-BFS perfering chosing nodes with largest degreee, in Construction of a simple elimination scheme for a chordal comparability graph in linear time, Discrete Applied Mathematics, Volume 91, Issues 1-3, 26 January 1999, Pages 287-292, Richard B. Borie and Jeremy P. Spinrad.

This algorithm is almost linear besides construct the A'.

Wen-Lian Hsu wrote some papers in this problems. W. L. Hsu, "A simple test for the consecutive ones property", Journal of Algorithms 43, (2002), 1-16. W. F. Lu and W. L. Hsu, A Test for the Consecutive Ones Property on Noisy Data - Application to Physical Mapping and Sequence Assembly, Journal of Computational Biology 10(5), (2003), 709-735.

Ron Shamir also wrote some good notes about it, Algorithms in Molecular Biology: Physical Mapping, and Advanced Topics in Graph Algorithms: Lecture #11 Interval Graphs.

Posted by Abner Chih Yi Huang | Permanent Link | Categories: Notes

2006-11-01 15:39:39

Specificity and Stability in Topology of Protein Networks

Ref.: Maslov, S. & Sneppen, K. Specificity and Stability in Topology of Protein Networks, Science, 2002, 296, 910-91

To address topological properties of these two networks, we quantified correlations between connectivities of interacting nodes and compared them to a null model of a network, in which all links were randomly rewired.

To test for correlations in connectivities of nodes for each of the above two networks, we calculated the likelihood P(K0,K1) that two proteins with connectivities K0 and K1 are connected to each other by a link and compared it to the same quantity Pr(K0,K1) measured in a randomized version of the same network.

To further quantify and compare correlation patterns in interaction and regulatory networks, we calculated the average connectivity of nearest neighbors of a node, as a function of its own connectivity K0 (Fig. 3A).

Their method is calculating the likehood of any two interacting proteins with degree preserving, i.e., construct random graph which has the same vertices set and degree of each vertex, and calculate the ratio $P(K_0,K_1)/P_r(K_0,K_1)$ where P(K_0,K_1) is the probability that a pair of proteins with total numbers of interaction partners given by K_0,K_1, correspondingly, directly interact with each other in the full data set, and P_r(K0,K1) is the same probability in a randomized version of the same network.

However, I don't know the meaning of results.

Above graph is the physical interactions of proteins.

Posted by Abner Chih Yi Huang | Permanent Link | Categories: Papers

2006-11-01 14:43:48

Network motifs in the transcriptional regulation network of Escherichia coli

Ref. : hen-Orr, S.S.; Milo, R.; Mangan, S. & Alon, U. Network motifs in the transcriptional regulation network of Escherichia coli Nat Genet, 2002, 31, 64-68.

We evaluated the statistical significance of the network motifs by comparison with randomized networks having the same characteristics as the real E. coli network. The probability that a randomized network had an equal or greater number of each of the motifs than the E. coli network was determined by enumerating the motifs found in 1,000 randomized networks.

Rroughly speaking, their method is based on comparing with randomized network. What do them mean and could we do with those network motifs?

The authors gave two instructions, the first is

The fact that the network motifs appear at frequencies much higher than expected at random suggests that they may have specific functions in the information processing performed by the network.

Is it the FACT? It might be a seminal idea, but not a fact. The second one seems much useful.

Additional insight may be gained by mathematical analysis of their dynamics.

Mathematical analysis suggests that the coherent feedforward loop can act as a circuit that rejects transient activation signals from the general transcription factor and responds only to persistent signals, while allowing a rapid system shutdown. This can occur when X and Y act in an 'AND-gate'−like manner to control operon Z (Fig. 2a), as is the case in the araBAD operon in the arabinose feedforward loop (Fig. 1b)15.

mathematical analysis suggests that SIMs can show a detailed temporal program of expression resulting from differences in the activation thresholds of the different genes (Fig. 2b). Built into this design is a pattern in which the first gene activated is the last one to be deactivated.

The motifs allow a representation of the E. coli transcriptional network (Fig. 3) in a compact, modular form (for an image of the full network, see Web Fig. A online). By using symbols to represent the different motifs (Fig. 1), the network is broken down to its basic building blocks.

Mathematical model of network motif dynamics

I don't know what Boolean kinetics is? But it seems interesting!

Ref.: McAdams, H.H. & Arkin, A. Simulation of prokaryotic genetic circuits. Annu. Rev. Biophys. Biomol. Struct. 27, 199–224

Ref.: Operon From Wikipedia, the free encyclopedia

An operon is a group of key nucleotide sequences including an operator, a common promoter, and one or more structural genes that are controlled as a unit to produce messenger RNA (mRNA).

Ref.: Cell cycle From Wikipedia, the free encyclopedia

The cell cycle, or cell-division cycle (CDC), is the series of events in a eukaryotic cell between one cell division and the next.

Roughly speaking, it is body clock of cells.

Posted by Abner Chih Yi Huang | Permanent Link | Categories: Papers