November 2, 2008

The plague of cross-database annotations

Recently I had to annotate a large (10,000+) number of genes identified by Entrez Gene IDs. My goal was to avoid “annotation files” (basically CSV files) that a part of wet lab group likes, because I wanted to stay up-to-date without having to remember to update them. So the obvious solution was to use a service available on the web, and in an automated way. For reference, I just tried to attach gene symbol, gene name, chromosome and cytoband.

April 5, 2008

Performance and R

I’m often wondering why people only resort to R when working with microarrays. I can understand that Bioconductor offers a plethora of different packages and that R’s statistical functions come in handy for many applications, but still, I think people underestimate the impact of performance. R is not a performing language at all, it doesn’t parallelize well when using HPC (at least from the talks I’ve had with people studying the matter), and in general is a memory and resource hog.

February 28, 2008

Follow up on meta-analysis

Fourteen days since my last post. Quite a while, indeed. Mostly I’ve been stumbled with work and some health related issues. Anyway, I thought I’d follow up on the meta analysis matter I discussed in my last post. It turns out that it’s a fault of both limma and the data sets, because apparently the raw data found in the Stanford Microarray Database have different length, gene-wise (a result of not all spots on the array being good?

February 14, 2008

Meta analysis difficulty increasing

Again in the past days I’ve been banging my head thanks to the fact that doing meta-analysis with microarray data is more difficult than what it seems. The problem sometimes lies in the data, sometimes lies in the analysis software and sometimes in a combination of factors. When doing work on a public data set (Zhao et al., 2005), I had to start analysis from raw data. Now, I tried using both the limma and marray Bioconductor packages, but both of them bail out with cryptic error messages.

November 15, 2007

Gene identifiers

While working today on an annotation class in Python I stumbled on a problem. Normally I work with lists of genes that are consistent, i.e. all Entrez Gene IDs (or RefSeq IDs, or Genome Browser IDs…), but today I had a list of mixed identifiers. The subsequent idea was “let’s implement auto-detection of common identifiers in the class”. The problem is… is there any actual documentation on how identifiers are made?

October 4, 2007

Easy RMA: RMAExpress

Today I was looking for an easy way to do some calculations of raw expression data on Affymetrix arrays, but I didn’t want to use R: I have already mentioned how I don’t like its design and implementation. While looking for some documentation, I stumbled upon this nifty little program called RMAExpress.