Digithead's Lab Notebook: Python

Showing posts with label Python. Show all posts

Tuesday, August 04, 2015

Hacking Zebrafish thoughts

The last lab from Scalable Machine Learning with Spark features a guest lecture by Jeremy Freeman, a professor of neuroscience at Janelia Farm Research Campus.

Vladimirov et al., 2014

His group produced this gorgeous video of a living zebrafish brain. Little fish thoughts sparkle away, made visible by a technique called light-sheet flourescent microscopy in which engineered proteins that light up when the neurons fire are engineered into the fish.

The lab covers principal component analysis in a lively way. Principal components are extracted from time-series data and mapped onto an HSV color wheel and used to color an image of the zebrafish brain. In the process, we use some fun matrix manipulation to aggregate the time series data in two different ways - by time relative to the start of a visual stimulus and by the directionality of the stimulus (shown below).

The whole series of labs from the Spark classes was nicely done, but this was an especially fun way to finish it out.

Check out the Freeman Lab's papers:

Light-sheet functional imaging in fictively behaving zebrafish, Vladimirov et al., 2014
Mapping brain activity at scale with cluster computing Freeman et al., 2014

Wednesday, July 08, 2015

Scalable Machine Learning with Spark class on edX

Introduction to Big Data with Apache Spark is an online class hosted on edX that just finished. Its follow-up Scalable Machine Learning with Spark just got started.

If you want to learn Spark - and who doesn't? - sign up.

Spark is a successor to Hadoop that comes out of the AMPLab at Berkeley. It's faster for many operations due to keeping data in memory, and the programming model feels more flexible in comparison to Hadoops rigid framework. The AMPLab provides a suite of related tools including support for machine learning, graphs, SQL and streaming. While Hadoop is most at home with batch processing, Spark is a little better suited to interactive work.

The first class was quick and easy, covering Spark and RDDs through PySpark. No brain stretching on the order of Daphne Koller's Probabilistic Graphical Models to be found here. The lectures stuck to the "applied" aspects, but that's OK. You can always hit the papers to go deeper. The labs were fun and effective at getting you up to speed:

Labs for the first class:

Word count, the hello world of map-reduce
Analysis of web server log files
Entity resolution using a bag-of-words approach
Collaborative filtering on a movie ratings database. Apparently, I should watch these: Seven Samurai, Annie Hall, Akira, Stop Making Sense, Chungking Express.

The second installment looks to very cool, delving deeper into mllib the AMPLab's machine learning library for Spark. Its labs cover:

Musicology: predict the release year of a song given a set of audio features
Prediction of click-through rates
Neuroimaging Analysis on brain activity of zebrafish (which I suspect is the phase "Just keep swimming" over and over) done in collaboration with Jeremy Freeman of the Janelia Research Campus.

The labs for both classes are authored as IPython notebooks in the amazingly cool Jupyter framework where prose, graphics and executable code fit combine to make a really nice learning environment.

Echoing my own digital hoarder tendencies, the first course is liberally peppered with links, which I've dutifully culled and categorized for your clicking compulsion:

The Data Science Process

In case you're still wondering what data scientists actually do, here it is according to...

Jim Gray

Capture
Curate
Communicate

Ben Fry

Acquire
Parse
Filter
Mine
Represent
Refine
Interact

Jeff Hammerbacher

Identify problem
Intrumenting data sources
Collect data
Prepare data (integrate, transform, clean, filter, aggregate)
Build model
Evaluate model
Communicate results

...and don't forget: Jeffrey Leek and Hadley Wickham.

Thursday, July 03, 2014

Galaxy Community Conference

The 2014 Galaxy Community Conference (GCC2014) wrapped up yesterday at Johns Hopkins University in Baltimore. The best part for me was the pre-conference hackathon.

If you're not familiar with Galaxy, it's a framework for wrapping command line bioinformatics tools in a web UI. On top of that, Galaxy adds lots of sophistication around job scheduling, configurable to run its job on SGE, SLURM, and other queuing systems and to run on virtual clusters with Cloudman. Galaxy users can design reproducible workflows and manage tool versioning and dependencies through the Toolshed.

The project has attracted a vibrant community supported by an active Q&A site and an IRC channel. Vendors offer Galaxy appliances, cloud deployments and consulting.

Hackathon

For the hackathon, 40 developers gathered in James Taylor's new computing lab in one of the red brick buildings interspersed with lawns and groves of trees that make up the Hopkins campus. This was a great setting for an accelerated ramp-up on the Galaxy codebase. Out of 18 proposed projects, 9 got far enough to be called finished. I learned how to write tool wrappers and how to access the Galaxy API.

Photo credit - someone on the Galaxy team

Galaxy + Synapse

Specifically, I was there to put together some integration between Synapse and Galaxy, bulding on what Kyle Ellrott had already started. Reproducibile analysis, provenance and annotation are core concerns of both projects, so it seems like a good fit. After data exchange, which Kyle got working, exchanging provenance and metadata sound like logical next steps.

Galaxy histories should be able to refer to Synapse entities and receive annotations along with data objects.

Serializing Galaxy histories, workflows etc. out to Synapse might allow a usage model where a cloud instances of Galaxy can be spun up and shut down on demand, with all the important details preserved in Synapse in between times.

Portable provenance might be a longer term goal. It would be neat to be able to thread provenance through some arbitrary combination of data sources and analysis platforms. All of these provenance-aware platforms - Figshare, Dryad, Synapse, Arvados, Galaxy, etc - ought to be able to share provenance between themselves.

Next year

The whole event was well put together in general. Next year's Galaxy Community Conference is scheduled for 6-8 July 2015 in Norwich, England.

Anton Nekrutenko group at Penn State
James Taylor's group which recently moved from Emory to Johns Hopkins

Tuesday, June 25, 2013

The Dream 8 Challenges

The 8th iteration of the DREAM Challenges are underway.

DREAM is something like the Kaggle of computational biology with an open science bent. Participating teams apply machine learning and statistical modeling methods to biological problems, competing to achieve the best predictive accuracy.

This year's three challenges focus on reverse engineering cancer, toxicology and the kinetics of the cell.

HPN-DREAM Breast Cancer Network Inference Challenge
Infer the signaling networks in breast cancer cell lines
NIEHS-NCATS-UNC DREAM Toxicogenetics Challenge
Predict individual response to environmental and pharmaceutical chemicals
The Whole-Cell Parameter Estimation DREAM Challenge
Infer the kinetic parameters underlying biological processes in whole cell models

Sage Bionetworks (my employer) has teamed up with DREAM to offer our Synapse platform as an integral part of the challenges. Synapse is targeted at providing a platform for hosting data analysis projects, much like GitHub is a platform for software development projects.

My own part in Synapse is on the Python client and a bit on the R client. I expect to get totally pummeled by bug reports once participation in the challenges really gets going.

Open data, collaborative innovation and reproducible science

The goal of Synapse is to enable scientists to combine data, source code, provenance, prose and figures to tell a story with data. The emphasis is on open data and collaboration, but full access control and governance for restricted access is built in.

In contrast to Kaggle, the DREAM Challenges are run in the spirit of open science. Winning models become part of the scientific record rather than the intellectual property of the organizers. Sharing code and building on other contestant's models is encouraged in with hopes of forming networks of collaborative innovation.

Aside from lively competition, these challenges are a great way to compare the virtues of various methods on a standardized problem. Synapse is aiming to become an environment for hosting standard open data sets and documenting reproducible methods for deriving models from them.

Winning methods will be presented at the RECOMB/ISCB Conference in Toronto this fall.

So, if you want to sharpen your data science chops on some important open biological problems, check out the DREAM8 challenges.

Wednesday, January 30, 2013

Javascript style objects in Python

As I've gotten older and crankier, one of the things I've gotten cranky about is Object Oriented Programming. Sometimes, it seems like rigid class hierarchies just get in the way. Sure, it's cool that the compiler can prove type-safety properties about your program, but whether the boilerplate is worth the benefit is situation dependent. When it just has to work once, coding for maintenance is a poor investment - quick-n-dirty is the way to go.

I've found class hierarchies in dynamic scripting languages to be especially unnecessary. Why use a language where you're not supposed to worry about typing to build elaborate user-defined type hierarchies?

In those quick-n-dirty scenarios, what I want is a big stinking bag of properties. Ideally, I don't care whether the values are data or functions. In other words, just like it's done in Javascript.

Here's one way to get something like that, in Python:

Monday, January 14, 2013

VCF: Variant Call Format

VCF, for Variant Call Format, is a tab delimited file format for storing genetic sequence variations developed in the context of the 1000 genomes project. There's a suite of open source vcftools and a pleasantly brief paper.

What follows serves as a dumping ground for knowledge and links about VCF.

VCF is a highly flexible format. There are 9 standard columns plus an additional column for each sample. The contents in the sample column are a colon-separated list of values. The FORMAT column specifies the layout of the sample columns, so that putting the two together yields a matching list of key/value pairs for each variant in each sample. A header section specifies metadata about the sample columns and the key/value properties.

A line of VCF is essentially a precomputed two-level join associating variants with samples and samples with key/value properties. On top of that, the INFO column associates key/value properties with the variant, so I suppose it's a 4-way join. Nice!

Terminology

Genotype: VCF handles genotypes of different ploidy. The genotype can contain just one allele (haploid), two alleles separated by one of '\', '|', '/' (diploid) or potentially more. Human VCF files will likely be mixed haploid and diploid due to the X and Y chromosomes in males and mitochondrial DNA.

Phasing: If the reference sequence assembly is reasonably complete, it may be possible to map the variant calls to a consistent strand. The term for this is phasing. A VCF file may be completely unphased or each call can indicate phased or unphased. The degree of completeness depends on how well and how unambiguously the calls map to the reference.

Filters: FILTER is one of the standard VCF columns. Valid values for FILTER column are "PASS" or a code for the filters that the variant call fails.

Tools

PyVCF is a nice python library for reading and writing VCF. It comes with a handy little utility vcf_filter, which does what it says. The reading part seems to be the most solid, with writing and filtering a little green, yet. But the code is nicely readable, and its proprietors are friendly to contributions.

SnpEff enhances VCF files by annotating variants and calculating the effects they produce on known genes, for example, amino acid changes, frame shifts and splice site changes.

VCF resources

VCF paper in Bioinformatics
VCF poster
VCF Specification at 1000 genomes
TCGA VCF Specification
Viewing a VCF File with Genotypes with the Broaad's Integrative Genomics Viewer

Sunday, August 19, 2012

Scientific Python

Astronomer Joshua Bloom gave a talk titled Python as Super Glue for the Modern Scientific Workflow at SciPy2012. Bloom teaches Python for Scientific Computing at Berkeley (available as a podcast). Bloom showcased Pythonic contributions to work on Supernova and machine-learning in astronomy.

Python has solid support for data analysis and scientific computing, starting with Numpy for matrix manipulation and SciPy, which adds diff-eqs, optimization, statistics, etc. and matplotlib for graphics. I keep meaning to check out Pandas, scikit-learn, and Sage.

IPython keeps getting more impressive and appears to be evolving in the direction of Mathematica's Live Notebooks. I have a long-standing thing for mixing prose and code. Fernando Perez’s talk on IPython at SciPy2012 and a the longer version IPython in-depth are in my viewing queue.

Part of the beauty of Python is it's breadth as a general purpose programming language. Libraries for everyday programming tasks like web application and database interaction are well developed in the Python world. There are good arguments in favor of DSLs for math and statistics, the approach embodied by R, Matlab, Mathematica and Julia. On the other hand, some may agree with John Cook, who puts it like this: "I’d rather do math in a general-purpose language than try to do general-purpose programming in a math language."

Data Mining and Machine-Learning in Time-Domain Discovery & Classification, Bloom and Richards
BigMACC: Bloom's machine classified catalog of variable stars
pydata.org
NetworkX, a Python network library

Tuesday, October 18, 2011

Python cheat sheet

The most important docs at python.org are the tutorial and library reference.

Pointers to the docs

Important modules: sys, os, os.path, re, math, io

Inspecting objects

>>> help(obj)
>>> dir(obj)

List comprehensions and generators

numbers = [2, 3, 4, 5, 6, 7, 8, 9]
[x * x for x in numbers if x % 2 == 0]

Generators might be thought of as lazy list comprehensions.

def generate_squares():
  i = 1
  while True:
    yield i*i
    i += 1

Main method

def main():
    # do something here

if __name__ == "__main__":
    main()

See Guido's advice on main methods. To parse command line arguments use argparse instead o the older optparse or getopt.

Classes

The tutorial covers classes, but know that there are old-style classes and new-style classes.

class Foo(object):
  'Doc string for class'

  def __init__(self, a, b):
    'Doc string for constructor'
    self.a = a
    self.b = b
  
  def square_plus_a(self, x):
    'Doc string for a useless method'
    return x * x + a
    
  def __str__(self):
    return "Foo: a=%d, b=%d" % (self.a, self.b)

Preoccupation with classes is a bit passé these days. Javascript objects are just bags of properties to which you can add arbitrary properties whenever you feel like it. In Ruby, you might use OpenStruct. It's quite easy in Python. You just have to define your own class. I'll follow the convention I've seen elsewhere of creating an empty class called Object derived from the base object. Why you can't set attributes on an object instance is something I'll leave to the Python gurus.

class Object(object):
    pass

obj = MyEmptyClass()
obj.foo = 123
obj.bar = "A super secret message
dir(obj)
['__doc__', '__module__', 'bar', 'foo']

You can add methods, too, but they act a little funny. Self doesn't seem to work.

Files

Reading text files line by line can be done like so:

with open(filename, 'r') as f:
    for line in f:
        dosomething(line)

Be careful not to mix iteration over lines in a file with readline().

Exceptions

try:
  raise Exception('My spammy exception!', 1234, 'zot')
except Exception as e:
  print type(e)
  print e
finally:
  print "cleanup in finally clause!"

Traceback prints stack traces.

Logging

import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.DEBUG,
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

Conditional Expressions

Finally added in Python 2.5:

x = true_value if condition else false_value

Packages, Libraries, Modules

What do they call them in Python? Here's a quick tip for finding out where installed packages are:

python -c 'import sys, pprint; pprint.pprint(sys.path)'

To find out what packages are installed, open a python shell and type:

help('modules')

Magic methods

Python's magic methods are the source of much confusion. Rafe Kettler's Guide to Python's Magic Methods sorts things out beautifully.

Tuesday, October 19, 2010

Message queues with Python

A while back, I wanted to build a web front end for a long-running python script. I started with a basic front end using Django. Django is a pleasantly straight-forward web framework, quite similar to Rails, easy to learn (with the help of the excellent and free Django book), and generally trouble-free. Pylons is an alternate choice.

Because the computation was fairly resource intensive, I thought to introduce a queue. The web-app could then concentrate on collecting the necessary input from the user and dumping a job on the queue, leaving the heavy lifting to a worker process (or several). We'd redirect the user to a status page where s/he could monitor progress and get results upon completion. Sounds simple enough, right? I figured my worker processes would look something like this:

big_chunka_data = load_big_chunka_data()
mo_data = load_mo_data()
queue = init_queue("http://myserver.com:12345", "user", "pw", "etc")

while <not-done>:
    try:
        message = queue.block_until_we_can_take_a_message()
        if message says shutdown: shutdown
        big_computation(message['param'],
                        message['foo'],
                        big_chunka_data,
                        mo_data)
    except e:
        log_errors(e)

...and the whole pile-o-junk would look like this:

Start up a handful of workers and a nice load balancing effect comes for free. Slow heavily loaded workers will take fewer jobs, while faster workers take more. I was also hoping for a good answer to the question, "What happens when one of our workers dies?"

Options

There are a ridiculous number of message queues to choose from. I looked at beanstalk which is nice and simple, but its python binding, pybeanstalk seems to be out of date. There's gearman, from Danga the source of memcached. That looked fairly straight forward as well, although be careful to get the newer python binding. Python, itself, now offers the multiprocessing module which has a queue.

One intriguing option is ZeroMQ (aka 0MQ). It's message queueing without a queue. It's brokerless, meaning there's no external queue server process. Messages are routed in common MQ patterns right down at the network level. Of course, if you want store and forward, you're on your own for the persistence part. Still, very cool... Python bindings for ZeroMQ are found in pyzmq.

Several on the seattle python mailing list recommended Celery. After a (superficial) look, Celery seemed too RPC-ish for my taste. I'm probably being up-tight, but when using a queue, I'd rather think in terms of sending a message than calling a function. That seems more decoupled and avoids making assumptions about the structure of the conversation and what's on the other side. I should probably lighten up. Celery is built on top of RabbitMQ, although they support other options.

RabbitMQ and Carrot

RabbitMQ, now part of the SpringSource empire (in turn owned by VMWare), aims to compete with Apache ActiveMQ as a full on enterprise messaging system based on the AMQP spec. I installed RabbitMQ using MacPorts, where you'll notice that RabbitMQ pulls in an absurd amount of dependencies.

sudo port selfupdate
sudo port install rabbitmq-server

For getting python to talk to RabbitMQ, Carrot is a nice option. It was a bit confusing at first, but some nice folks on the carrot-users mailing list set me straight. Apparently, Carrot's author is working on a rewrite called Kombu.

Here's what worked for me. A producer sends Python dictionary objects, which get turned into JSON. My example code is only slightly modified from Creating a Connection in the Carrot documentation. You'll need a little RabbitMQ terminology to understand the connection methods.

queues are addresses of receivers
exchanges are routers with their own process
virtual hosts are the unit of security

Producer

from carrot.connection import BrokerConnection
from carrot.messaging import Publisher

conn = BrokerConnection(hostname="localhost", port=5672,
                          userid="guest", password="guest",
                          virtual_host="/")

publisher = Publisher(connection=conn,
                    exchange="feed", routing_key="importer")

for i in range(30):
   publisher.send({"name":"foo", "i":i})
publisher.close()

The consumers print out the messages as they arrive, then sleep for a bit to simulate long-running tasks. I tested by starting two consumers, one with a longer sleep time. Then I started a producer and saw that the slower consumer got fewer messages, which is what I expected. Note that setting prefetch_count to 1 is necessary to achieve this low-budget load balancing effect.

Consumer

import time
import sys
from carrot.connection import BrokerConnection
from carrot.messaging import Consumer

# supply an integer argument for sleep time to simulate long-running tasks
if (len(sys.argv) > 1):
    sleep_time = int(sys.argv[1])
else:
    sleep_time = 1

connection = BrokerConnection(hostname="localhost", port=5672,
                          userid="guest", password="guest",
                          virtual_host="/")

consumer = Consumer(connection=connection, queue="feed",
                    exchange="feed", routing_key="importer")

def import_feed_callback(message_data, message):
    print "-" * 80
    print message_data
    print message
    message.ack()
    print "-" * 80
    time.sleep(sleep_time)

consumer.register_callback(import_feed_callback)
consumer.qos(prefetch_count=1)

consumer.consume() 
while True: 
    connection.drain_events()

The project remains incomplete and I'm not at all ready to say this is the best way to go about it. It's just the first thing I got working. RabbitMQ seems maybe a little heavy for a simple task queue, but it's also well supported and documented.

It seems like this sort of thing is less mature in the Python world than in Java. It's moving fast though.

Links, links, links

Enterprise Integration Patterns
Threads, processes and concurrency in Python
Stomp and it's python binding stompy
Carrot documentation and mailing list
RabbitMQ, Celery and Django
Rabbits and Warrens
Selecting a Message Queue - AMQP or ZeroMQ is insightful and has lots of good links.

Obsolete note: The version of RabbitMQ in MacPorts (1.7.2) at the time was a version behind and broken. I had to dig through the compiler error log and add a closing paren in line 100 of rabbit_exchange.erl.

Tuesday, March 23, 2010

How to read a file line by line in Python

The preferred way of reading text files in Python has changed several times over the years, so it's hard to google up a current solution. I think this would be the preferred way as of Python version 2.6.x and 3.1.x. Please comment if this is out of date.

import sys

filename = sys.argv[1]

with open(filename, 'r') as f:
    for line in f:
        dosomething(line)

...or even better, use fileinput which takes the files given as command-line arguments, or if missing, the standard input.

import fileinput

for line in fileinput.input():
    process(line)

The Python Standard Library docs for File Objects
The Python Tutorial on Input and Output
Thanks to this thread on stack overflow for the link to fileinput.
Dive Into Python Working with File Objects

Friday, July 17, 2009

Parsing GEO SOFT files with Python and Sqlite

NCBI's GEO database of gene expression data is a great resource, but its records are very open ended. This lack of rigidity was perhaps necessary to accommodate the variety of measurement technologies, but makes getting data out a little tricky. But, all that flexibility is a curse from the point of view of extracting data. The scripts I end up with are not general parsers for GEO data, but will need to be adapted to the specifics of other datasets.

Note: It could be that I'm doing things the hard way. Maybe there's an easier way.

A GEO record consists of a platform, which describes (for example) a microarray and its probes, and series of samples. In this example, we need to do a join between the platform and the sample records to end up with a matrix of the form (seq, strand, start, end, value1, value2, ..., valueN) where the value1 column holds measurements from the first sample and so on. If we do that, we'll have coordinates on the genome and values for each measurement. My goal is to feed data into a genome browser known as HeebieGB with a stop-over in R along the way.

Merging on a common key is only slightly complicated, but tiling arrays are big (~244,000 probes in this case). I hesitate to try merging several 244K row tables in memory. Database engines are made for this sort of thing, so I'll use SQLite to get this done and Python to script the whole process.

I like to start python scripts with a template similar to Guido's main function, except that I prefer optparse to getopt. An --overwrite option will force the user to be conscious of overwriting files.

import sys
from optparse import OptionParser

def main():
 usage = "%prog [options] input_file sqlite_db_file"
 parser = OptionParser(usage=usage)
 parser.add_option("-o", "--overwrite", dest="overwrite", default=False, action="store_true", 
  help="if output db file exists, overwrite it")
 (options, args) = parser.parse_args()

 if len(args) < 2:
  parser.error("missing required arguments.")
  exit(2)

 input_filename = args[0]
 db_filename = args[1]

if __name__ == "__main__":
 sys.exit(main())

GEO records a bunch of descriptive data about each sample, some of which we want. I've read that storing arbitrary key-value pairs in a relational DB is considered bad by some. But, I'm going to do it anyway. The entity attributes will go in a table called attributes whose schema is (entity_id, key, value).

The function parse_platform_table pulls the platform data from a tab-separated section in the SOFT file into a table with a schema something like this: (id, sequence, strand, start, end). There's also a tab-separated section for each of the samples that refers back to its platform, so I extract that in a similar manner in parse_sample_table. It's easiest to start out with each sample in its own table, even though that's not really what we want.

The complete script -also available from SVN here- ends up like this:

import sys
from optparse import OptionParser
import re
import os
import os.path
import sqlite3

# GEO SOFT format is documented here:
# http://www.ncbi.nlm.nih.gov/projects/geo/info/soft2.html#SOFTformat

# ID field in platform joins with ID_REF field in samples

entity       = re.compile(r'\^(\S+) = (.+)')
kvp          = re.compile(r'!(\S+) = (.+)')

STATE_START = 0
STATE_IN_SERIES = 1001
STATE_IN_PLATFORM = 1002
STATE_IN_SAMPLE = 1003


def overwrite(name):
 if os.path.exists(name):
  os.remove(name)
  return True
 return False

def parse_series_file(file, conn):
 entity_id = None
 state = STATE_START

 # create an attributes table
 try:
  cursor = conn.cursor()
  cursor.execute('create table attributes (entity_id, key, value);')
  conn.commit()
  cursor.close()
 finally:
  cursor.close()

 for line in file:
  line = line.strip()

  # read entity tags
  if line.startswith('^'):
   m = entity.match(line)
   if m:
    entity_type = m.group(1)
    entity_id = m.group(2)
    print(entity_id)
    if entity_type == 'SERIES':
     state = STATE_IN_SERIES
    elif entity_type == 'PLATFORM':
     state = STATE_IN_PLATFORM
    elif entity_type == 'SAMPLE':
     state = STATE_IN_SAMPLE

  # read attribute key-value pairs and tab-separated tables
  elif line.startswith('!'):
   m = kvp.match(line)
   if m:
    key = m.group(1)
    value = m.group(2)
    handle_attribute(conn, entity_id, key, value)
   elif state==STATE_IN_PLATFORM and line=='!platform_table_begin':
    parse_platform_table(file, conn, entity_id)
   elif state==STATE_IN_SAMPLE and line=='!sample_table_begin':
    parse_sample_table(file, conn, entity_id)

def parse_platform_table(file, conn, platform_id):
 """
 Read the tab-separated platform section of a SOFT file and store the ID,
 sequence, strand, start, and end columns in a SQLite database.

 file: a file object open for reading
 conn: a SQLite database connection
 platform_id: a string identifying a GEO platform
 """
 cursor = conn.cursor()
 try:
  # throw away line containing column headers
  file.next()
  # create platform table
  cursor.execute('create table %s (id integer primary key not null, sequence text not null, strand not null, start integer not null, end integer not null, control_type integer);' % (platform_id))
  conn.commit()
  sql = 'insert into %s values(?,?,?,?,?,?)' % (platform_id)
  for line in file:
   line = line.strip('\n')
   if (line.strip() == '!platform_table_end'):
    break
   fields = line.split("\t")
   cursor.execute(sql, (int(fields[0]), fields[6], fields[10], fields[7], fields[8], fields[4]))
  conn.commit()
 finally:
  cursor.close()

def parse_sample_table(file, conn, sample_id):
 """
 Read a tab separated sample section from a SOFT file and store ID_REF and
 value in a SQLite DB.

 file: a file object open for reading
 conn: a SQLite database connection
 sample_id: a string identifying a GEO sample
 """
 cursor = conn.cursor()
 try:
  # throw away line containing column headers
  file.next()
  # create sample table
  cursor.execute('create table %s (id_ref integer not null, value numeric not null);' % (sample_id))
  conn.commit()
  sql = 'insert into %s values(?,?)' % (sample_id)
  for line in file:
   line = line.strip('\n')
   if (line.strip() == '!sample_table_end'):
    break
   fields = line.split("\t")
   cursor.execute(sql, (int(fields[0]), float(fields[1])))
  conn.commit()
 finally:
  cursor.close()

def handle_attribute(conn, entity_id, key, value):
 """
 Store an entity attribute in the attributes table
 """
 cursor = None
 try:
  cursor = conn.cursor()
  cursor.execute("insert into attributes values(?,?,?);", (entity_id, key, value))
  conn.commit()
 finally:
  if cursor:
   cursor.close()


def main():
 usage = "%prog [options] input_file"
 parser = OptionParser(usage=usage)
 parser.add_option("-o", "--overwrite", dest="overwrite", default=False, action="store_true", 
  help="if output db file exists, overwrite it")
 (options, args) = parser.parse_args()

 if len(args) < 2:
  parser.error("missing required arguments.")
  exit(2)

 input_filename = args[0]
 db_filename = args[1]

 if options.overwrite:
  overwrite(db_filename)

 input_file = None
 conn = None
 try:
  conn = sqlite3.connect(db_filename)
  input_file = open(input_filename, 'r')
  parse_series_file(input_file, conn)
 finally:
  if input_file:
   input_file.close()
  if conn:
   conn.close()


if __name__ == "__main__":
 sys.exit(main())

The specific series I'm interested in (GSE12923) has 53 samples. The platform (GPL7255) is a custom array on Agilent's 244k feature microarrays or just short of 13 million individual features. The SOFT file is 708 MB and the script takes a good 5 or 6 minutes to ingest all that data. The next step is merging all the data into a single matrix.

This turned out to be harder than I thought. At first, I naively tried to do a big 54 way join between the platform table and all the sample tables, with an order-by to sort by chromosomal location. I let this run for a couple hours, then gave up. Sure, a big join on unindexed tables was bound to be ugly, but it only had to run once. I'm still surprised that this choked, after all, it's not that much data.

There are two ways around it. One is to index the sample tables by ID_REF and the platform table by (sequence, strand, start, end). The other is to do the big join then sort into a second table. Either takes several minutes, but it's just a one-off, so that's OK.

insert into matrix
select GPL7255.sequence, GPL7255.strand, GPL7255.start, GPL7255.end,
GSM320660.VALUE as GSM320660,
GSM320661.VALUE as GSM320661,
...GSM320712.VALUE as GSM320712
from GPL7255
join GSM320660 on GPL7255.ID = GSM320660.ID_REF
join GSM320661 on GPL7255.ID = GSM320661.ID_REF
...join GSM320712 on GPL7255.ID = GSM320712.ID_REF
where GPL7255.control_type==0 and sequence!='NA';
order by sequence, strand, start, end;

Now that we've done that, do you ever find data that doesn't need to be cleaned up a little bit?

-- swap mislabeled + and - strands (how embarrassing!)
update matrix set strand='Z' where strand='-';
update matrix set strand='-' where strand='+';
update matrix set strand='+' where strand='Z';

-- fix up sequence names
update matrix set sequence='chromosome' where sequence='chr1';
update matrix set sequence='pNRC200' where sequence='chr2';
update matrix set sequence='pNRC100' where sequence='chr3';

-- fix probes crossing the "zero" point
update matrix set start=end, end=start where end-start > 60;

That's about all the data munging I can stand for now. The rest, I'll leave for Part 2.

Thursday, July 02, 2009

R String processing

Note: Nowadays, stringr's str_match solves this problem, nicely. Another option is gsubfn's very R-ish strapply.

Here's a little vignette of data munging using the regular expression facilities of R (aka the R-project for statistical computing). Let's say I have a vector of strings that looks like this:

> coords
[1] "chromosome+:157470-158370" "chromosome+:158370-158450" "chromosome+:158450-158510"
[4] "chromosome+:158510-159330" "chromosome-:157460-158560" "chromosome-:158560-158920"

What I'd like to do is parse these out into a data.frame with a column for each of sequence, strand, start, end. A regex that would do that kind of thing looks like this: (.*)([+-]):(\d+)-(\d+). R does regular expressions, but it's missing a few pieces. For example, in python you might say:

import re

coords = """
chromosome+:157470-158370
chromosome+:158370-158450
chromosome+:158450-158510
chromosome+:158510-159330
chromosome-:157460-158560
chromosome-:158560-158920
"""

regex = re.compile("(.*)([+-]):(\\d+)-(\\d+)")

for line in coords.split("\n"):
 line = line.strip()
 if (len(line)==0): continue
 m = regex.match(line)
 if (m):
  seq = m.group(1)
  strand = m.group(2)
  start = int(m.group(3))
  end = int(m.group(4))
  print "%s\t%s\t%d\t%d" % (seq, strand, start, end)

As far as I've found, there doesn't seem to be an equivalent in R to regex.match, which is a shame. The gsub function supports capturing groups in regular expressions, but isn't very flexible about what you do with them. One way to solve this problem is to use gsub to pull out each individual column. Not efficient, but it works:

> coords.df = data.frame(
 seq=gsub("(.*)([+-]):(\\d+)-(\\d+)", "\\1", row.names(m), perl=T),
 strand=gsub("(.*)([+-]):(\\d+)-(\\d+)", "\\2", row.names(m), perl=T),
 start=as.integer(gsub("(.*)([+-]):(\\d+)-(\\d+)", "\\3", row.names(m), perl=T)),
 end=as.integer(gsub("(.*)([+-]):(\\d+)-(\\d+)", "\\4", row.names(m), perl=T)))
> coords.df
         seq strand  start    end
1 chromosome      + 157470 158370
2 chromosome      + 158370 158450
3 chromosome      + 158450 158510
4 chromosome      + 158510 159330
5 chromosome      - 157460 158560
6 chromosome      - 158560 158920

It seems strange that R doesn't have a more direct way of accomplishing this. I'm not an R expert, so maybe it's there and I'm missing it. I guess it's not called the R project for string processing, but still... By the way, if you're ever tempted to name a project with a single letter, consider the poor schmucks trying to google for help.

Thursday, March 12, 2009

Split split

Apparently, there's some disagreement about what it means to split a string into substrings. Biological data frequently comes in good old fashioned tab-delimited text files. That's OK 'cause they're easily parsed in the language and platform of your choice. Most languages with any pretention of string processing offer a split function. So, you read the files line-by-line and split each line on the tab character to get an array of fields.

The disagreement comes about when there are empty fields. Since we're talking text files, there's no saying, "NOT NULL", so it's my presumption that empty fields are possible. Consider the following JUnit test.

import org.apache.log4j.Logger;
import static org.junit.Assert.*;
import org.junit.Test;

public class TestSplit {
  private static final Logger log = Logger.getLogger("unit-test");

  @Test
  public void test1() {
    String[] fields = "foo\t\t\t\t\t\t\tbar".split("\t");
    log.info("fields.length = " + fields.length);
    assertEquals(fields.length, 8);
  }

  @Test
  public void test2() {
    // 7 tabs
    String[] fields = "\t\t\t\t\t\t\t".split("\t");
    log.info("fields.length = " + fields.length);
    assertEquals(fields.length, 8);
  }
}

The first test works. You end up with 8 fields, of which the middle 6 are empty. The second test fails. You get an empty array. I expected this to return an array of 8 empty strings. Java's mutant cousin, Javascript get's this right, as does Python.

Rhino 1.6 release 5 2006 11 18
js> a = "\t\t\t\t\t\t\t";
js> fields = a.split("\t")
,,,,,,,
js> fields.length
8

Python 2.5.1 (r251:54863, Jan 17 2008, 19:35:17)
>>> a = "\t\t\t\t\t\t\t"
>>> fields = a.split("\t")
['', '', '', '', '', '', '', '']
>>> len(fields)
8

Oddly enough, Ruby agrees with Java as does Perl.

>> str = "\t\t\t\t\t\t\t"
=> "\t\t\t\t\t\t\t"
>> fields = str.split("\t")
=> []
>> fields.length
=> 0

My perl is way rusty, so sue me. but, I think this is more or less it:

$str = "\t\t\t\t\t\t\t";
@fields = split(/\t/, $str);
print("fields = [@fields]\n");
$len = @fields;
print("length = $len\n");

Which yields:

fields = []
length = 0

How totally annoying!

Tuesday, December 09, 2008

Bioinformatics as a Queryable Knowledge Map: the Pygr Project

Pygr is a hypergraph database in Python with applications in bioinformatics written by Christopher Lee, a faculty member at UCLA. There's a 30 minute video of talk about Pygr and a bunch of other resources on the Lee Lab website and Lee's thinking bioinformatics blog.

Thesis: Hypergraphs are a general model for bioinformatics and Python’s core models are already a good model of Bioinformatics Data
Sequence: protein and nucleic acid sequences
Mapping / Graphs: alignment, annotation
Attributes: schema, i.e. relations between data
Namespace (import): the ontology of all bioinformatics data
Pygr aims to show that these Pythonic patterns are a general and scalable solution for bioinformatics.

The general idea is not entirely different from the data types behind Gaggle, especially in the emphasis on basic data structures without a heavy semantic component.

Dr. Lee is also writing a textbook on probabilistic inference.