Multiscale bootstrap clustering with Python and R
By einar
- CommentsWhile reading the statistics for my blog, I noticed that a number of searches looked for hierarchical clustering with Python, which [I covered quite a while ago]({{ site.url }}/2007/11/data-clustering-with-python). Today I’d like to present an updated version which uses more robust techniques.
Defining the problem
Since Eisen’s original paper on clustering, this form of analysis has been widely used by a lot of researchers. However, as it is known, such systems may be susceptible to an ordering bias: in other words, the order of the samples and/or genes might influence the final result. That’s why popular software such as TMeV offers alternative approaches, based on bootstrapping.
In this specific form of bootstrapping, the samples and/or genes are randomly shuffled a number of times (1000 or more iterations are a good starting point) and the resulting dendrograms checked for consistency and robustness of partitioning. In other words, a p-value is calculated, our null hypothesis being that the arrangement of samples/genes is merely due by chance. Depending on the software, this value might be expressed either in form of p-value or percentage (TMeV calls it support).
In the past years, I found an interesting method developed by Hidetoshi Shimodaira: the technique, called multiscale bootstrap resampling, aims at determining more accurate p-values out of the bootstrapping. Shimodaira calls the resulting p-value an AU value, where AU stands for “approximately unbiased”, a more precise p-value than the one obtained through bootstrapping alone.
In addition to this nice algorithm, a R package was also provided, named _pvclust _(it’s available on your favorite CRAN mirror). And that’s exactly what we’ll use for this exercise.
Prerequisites
Some of the readers of this blog might remember my disdain of R: while I need to use it for Bioconductor, I’m often annoyed by its weird syntax, and difficult to understand error messages. Luckily, thanks to the hard work of Laurent Gautier and contributors, there’s rpy2, a nice R-to-Python bridge. All the examples here require this package, version 2.1 or newer (I’d recommend the release candidate of 2.2, it’s really nice). Unfortunately, this means that Windows users are out of luck as there’s no version of rpy2 2.1 or 2.2 available for that platform..
Also, don’t forget to have the pvclust package installed in R.
Loading and preparing the data
Let’s start first by importing the necessary bits:
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
The second line is important, because it’ll let us play with R libraries as they were packages. Case in point, we’ll get the “base” and “pvclust” libraries loaded:
base = importr("base")
pvclust = importr("pvclust")
Now we can manipulate them as if they were modules, and (most) of R’s dotted functions have been converted to underscores, as the dot is the namespace operator in Python. Example: as.data.frame becomes as_data_frame.
Next, we’ll load the data in a data.frame. rpy2 conveniently gives us the _DataFrame _class, which is a no-nonsense wrapper to R’s data.frames. For this exercise, we’ll load a set of normalized data from GSE4984, a microarray experiment with dendritic cells expoosed to different stimuli. It’s just a matter of downloading the data from Array Express (if you ask why from AE and not GEO: the latter doesn’t have a clearly-identified link for normalized data) and then loading it in a data.frame as:
dataframe = robjects.DataFrame.from_csvfile("GSE4984.txt", sep="\t ", row_names=1)
The resulting Python object has all the attributes of a R data.frame but with added Python goodness. We can use the colnames and rownames attributes to access the row names (if set) and column names of the object, and likewise we can use nrow and ncol to quickly glance at the rows/columns.
Since a full array has a lot of genes, we’re going to choose only the first 500 genes:
rows = robjects.IntVector(range(1,501))
subset = dataframe.rx(rows, True)
An IntVector is a rpy2 object which replicates R’s vectors of integers: there are variants for strings, floats, integers, lists (R lists, not the Python type) and factors. rx is an accessor that mimicks R’s item access: in short, it’s equivalent to
subset <- dataframe[rows, ]
rpy2 has another accessor, rx2, which mimicks the [[ ]] access in data.frames.
Clustering
Once we have the data, it’s time to do some serious clustering on it:
result = pvclust.pvclust(subset, nboot=100, method_dist="correlation", method_hclust="average")
We’re using a small number of permutations (100) because the computation times are long. You can change the distance metric and the linkage types using method_dist and method_hclust. Internally the data.frame is converted to a matrix, so ensure you have valid data (i.e. numeric) prior to proceeding.
Notice that this will just cluster the columns by default. If we want to cluster genes, we have to transpose the data.frame. In this case we have to first convert it to a matrix, then transpose it:
matrix = base.as_matrix(subset)
subset_transposed = matrix.transpose()
result_rows = pvclust(subset_tranposed, nboot=100, method_dist="correlation", method_hclust="average")
Once the computation is done, we have a pvclust object which holds information on the results. What we’re most interested in is the hclust attribute, as it holds a dendrogram object we can use for plotting (either standalone or via a heat map). We can also manipulate the object with the pvpick function, for example to color the trees of the dendrogam basing on their AU values.
To get a fast representation, we can just dump the object as it is to a dendrogram which will show AU and BP values for each element of the cluster:
graphics = importr("graphics")
graphics.plot(result)
Or we can do the same, but to a PDF:
graphics = importr("graphics")
grdevices = importr("grDevices")
grdevices.pdf("myresult.pdf", paper="a4")
graphics.plot(result)
grdevices.dev_off()
Of course, we might want a heat map (everyone wants pretty heat maps, right?). In that case we extract both dendrograms and use something like gplots’ _heatmap.2 _ function to represent it (you will need the gplots package installed in order for the following to work):
gplots = importr("gplots")
row_dendrogram = result_rows.rx2("hclust")
column_dendrogram = result.rx2("hclust")
gplots.heatmap_2(subset, Rowv=row_dendrogram, Colv=column_dendrogram, col=gplots.greenred(255), density_info="none")
You can add the grdevices lines like above to make a PDF of the plot. If you notice we have used the rx2 accessors here, just as I wrote above, to access the hclust attribute of the pvclust object.
Moving further
pvclust as-is it’s quite slow. There’s however a parallelized version, called parPvclust, which uses snow to parallelize the clustering, either through multiple machines or using multiple cores. Setting snow properly up is beyond the scope of this tutorial, but it may be worth investing if you cluster a lot of data.