Thursday, January 15, 2015

Deep Belief Networks for Genomic Analysis

Massive amounts of biological data are being produced (citation). In order to derive meaning from this new information, new tools must be developed. Biology can be described at many levels. Genomic sequence, gene expression levels, protein concentrations and more. As a result, machine learning models would be incredibly powerful to find meaning in the high dimensionality of biology.

Hierarchical learning models have recently generated interest in their ability to make accurate predictions from large amounts of data. One application of learning models has been for image recognition. These models have been used to successfully reconstruct images (citation) and infer properties of images such as location (citation)

Supervised machine learning begins with a training phase where the model learns from a sample size of data that is supposed to represent the problem. During this process, often called classification, the model infers rules for the structure of the data to give a desired output. After, training, the model is applied to new data in order to predict its properties. Machine learning aims to use known data to make predictions about new data (Hastie et al., 2005). On the other hand, unsupervised machine learning groups data based on their similarity of certain measures. In this method, the users does not specify the desired outcome for the model.

Hierarchical learning models are composed of feature detecting layers. The lower layers detect simple features of the data in order to help the higher levels find more complex features. In particular, deep belief networks (DBNs) (Hinton et al., 2006) are multilayered models where each layer are statistical values that represent the interdependency with the layer below. During the learning process, the likelihood that the variables in the lower layers predict the higher layer variables in maximized.DBNs have been successfully used to detect handwritten digits (Hinton et al., 2006) and human motion (Taylor et al., 2007).

DBNs are probabilistic models composed of several layers of stochastic, binary variables. The top layers form an associative memory of the model, while the lower layers receive connections from the layer above. 

DBNs are characterized by two properties:

1. Layered learning

Efficient

Top-down learning procedure

Weighted

Determine the relationship between the variables of one layer with the layer above

After the learning process, the hidden variables from each layer can be inferred from the bottom data
uses the weights generated during learning in the reverse direction

2. Fine-tuning

To improve the predictive power of DBNs the weights can be fine-tuned

The fine tuning process consists of creating a final layer of variables with the desired output and back propagating across the network

DBNs seek to infer the states of unobserved variables using fine-tuning, while adjusting the interactions between the layers to increase the likelihood that the network will generate the observed data.

The learning and fine-tuning process faces problems in networks with many hidden layers. As the number of layers increases, the significance of the inference from a input data vector decreases. This represents a problem when trying to analyze genetic interactions within a biological networks. The problems scales with 2^N where N is the number of genes. As N increases the network becomes more densely connected and the usefulness of DBNs decreases. 

Measuring gene expression levels provide a connection between genotype and phenotype. Many studies use the correlation between gene expression data and pathology to determine the causes of diseases (Tan & Gilbert, 2003). A major problem in the field is determining the significance of changes in gene expression compared to controls. Machine learning techniques offer new tools to find significant disease genes.


The problems of using machine learning in biology is that as dimensionality is increased, the predictions get worse. Furthermore, there may not be enough data sets to successfully train a model. One way to address these problems is to use unsupervised and deep learning methods. A good start would be to use DBNs in genomic analysis.

No comments:

Post a Comment