Showing posts with label tutorial. Show all posts

Friday, February 19, 2016

How to setup an IPython parallel cluster in your LAN via SSH

It has been a long hiatus since I last posted anything here so it is time to get back.

Today I will describe the following scenario: you have two or more machines (linux boxes or OS X) available in your LAN, and you would like to harness the power of their CPU to perform parallel computing with python. This is possible with IPython parallel and there are several ways to get it accomplished.

I will describe the steps required to configure a private IPython parallel cluster in your LAN using SSH. If everything works well, this should require about 30 min to 1 hour to be completed depending on your level of experience.

1. Command to create an IPython parallel profile:

ipython profile create --parallel --profile=ssh

2. Edit config file .ipython/profile_ssh/ipcluster_config.py in your home.

Specify the number of hosts and cores to be used:

c.SSHEngineSetLauncher.engines = {
'macnemmen' : 4,
'pcnemmen' : 8,
'pcraniere' : 4,
}
where you specify the appropriate names of the machines in your LAN.

Specify the IP of controller (the main machine you use to launch ipcluster):

c.LocalControllerLauncher.controller_args = ["--ip=xx.xxx.x.xxx"]
where you make sure the IP is correct.

3. Make sure python, jupyter, ipython and ipyparallel are installed in each computer.

In my case, I use the Anaconda distribution in all machines.

4. Setup SSH

Create a .ssh/config file such that all hosts and corresponding usernames that will run the servers are conveniently referenced.

Setup passwordless SSH login in each client machine from the controller machine.

5. Create a common alias in each client host pointing to the engine launcher binary:

This is in order to avoid the clients not finding the binary if it is in a nonstandard location.

e.g. create aliases in /opt/ipengine

6. Edit config file .ipython/ipcluster_config.py pointing to the launcher alias

c.SSHEngineSetLauncher.engine_cmd = ['/opt/ipengine']

7. Launch the engines in all machines in your "cluster"

ipcluster start --profile='ssh' --debug

Testing the engines

Now it is time to test if the engines were launched successfully.

Test that they are active in IPython:

import ipyparallel
c=ipyparallel.Client(profile='ssh')
c.ids

The output of the last command should be a list with the number of elements matching the number of engines you launched. Otherwise, something went wrong.

Don't forget that the configuration files are located in .ipython/profile_ssh.

To learn how to use in practice such cluster to do CPU intensive tasks, you can read this tutorial.

References

Thursday, October 16, 2014

Parallel computing with IPython and Python

I uploaded to github a quick tutorial on how to parallelize easy computing tasks. I have chosen embarrassingly parallel examples which illustrate some of the powerful features of IPython.parallel and the multiprocessing module.

Examples included:

Parallel function mapping to a list of arguments (multiprocessing module)
Parallel execution of array function (scatter/gather) + parallel execution of scripts
Easy parallel Monte Carlo (parallel magics)

Parallel computing with Python.

Wednesday, October 8, 2014

Python Installation instructions (including IPython / IPython Notebook)

This page describes how to install Python and the other packages (Numpy, Scipy, IPython, IPython Notebook, Matplotlib) required for the course for Mac OS X, Linux and Windows.

Linux

In Linux, the installation instructions are pretty straightforward. Assuming that you are running Debian or Ubuntu, you just need to execute the following command in the terminal:

sudo apt-get install python-numpy python-scipy python-matplotlib ipython-notebook

For Fedora users, you can use the yum tool.

Mac OS X, Linux, Windows

We recommend downloading and installing the Anaconda Python distribution. The installations instructions are available here.

Just download the installer and execute it with bash.

Anaconda includes most of the packages we will use and it is pretty easy to install additional packages if required, using the conda or pip command-line tools.

If the above two methods do not work for OS X

The MacPorts way

You can try installing everything using MacPorts. First download and install macports and then issue the following command in a terminal:

sudo port install py27-zmq py27-tornado py27-nose

The avove dependencies are required in order to run IPython notebook. Then run:

sudo port install py27-numpy py27-matplotlib py27-scipy py27-ipython

The advantage of this method is that it easy to do. The downsides:

It can take a couple of hours to finish the installation depending on your machine and internet connection, since macports will download and compile everything as it goes.
If you like having the bleeding edge versions, note that it can take a while for them to be released on macports
Finally, macports can create conflicts between different python interpreters installed in your system

Using Apple’s Python interpreted and pip

If you feel adventurous, you can use Apple’s builtin python interpreter and install everything using pip. Please follow the instructions described in this blog.

If you run into trouble

Leave a comment here with the issue you found.

Thursday, January 16, 2014

Video tutorial: Bayesian data analysis with PyMC 3

I highly recommend the tutorial by Thomas Wiecki on using PyMC 3 to perform Bayesian data analysis. Some of the cool things he demonstrates in this ~50 min video:

summary of Bayesian analysis and Bayesian theorem
application for parameter estimation for a few introductory examples: coin-flipping experiment, simple linear regression
new features of PyMC 3 with respect to v2
how to construct a model in PyMC 3 and a few notes on samplers
more advanced example: application to time series of correlated stocks, then linear regression of correlated stocks with a time-dependent slope (!!)

Saturday, July 6, 2013

Parallel Bayes with IPython

Excellent tutorial from the folks at Continuum Analytics/Wakari on how to parallelize bayesian parameter estimation using IPython (IPcluster) and pyMC.

Following this tutorial, I was able to make my bayesian spectral fitting code (for astronomical purposes) parallel -- and much faster -- very quickly.

Bayesian Estimation with PyMC and IPCluster

Thursday, May 30, 2013

Installation instructions (from a Python Boot Camp course)

I wrote some instructions on how to install python and relevant dependencies (numpy, matplotlib, ipython, ipython notebook etc) for OS X, linux and windows. These instructions are for an upcoming Python Boot Camp where I work aimed at students, researchers and engineers (mostly Earth and physical sciences)

I wanted to share these instructions since they may be useful to more people. If you have been keeping up with this blog, note that there is some repetition here.

Python Installation Instructions

This page describes how to install Python and the other packages (Numpy, Scipy, IPython, Matplotlib) required for the course for Mac OS X, Linux and Windows.

Linux

In Linux, the installation instructions are pretty straightforward. Assuming that you are running Debian or Ubuntu, you just need to execute the following command in the terminal:

sudo apt-get install python-numpy python-scipy python-matplotlib ipython-notebook

For Fedora users, you can use the yum tool.

Mac OS X, Windows

If you are affiliated with an academic institution

Then the easiest way to install Python and the other packages is to request an academic license and download the Enthought Canopy Python distribution. Enthought includes all the packages we will use during the course.

The installation instructions are available here, which has installers for Mac OS X and Windows.

Note that you need the academic license in order to install the 64-bit (recommended) version. The 32-bit version is free for all.

If you are not affiliated with an university

This is the case for example for GSFC employees and many GSFC postdocs. In this case, we recommend downloading and installing the Anaconda Python distribution. The installations instructions are available here.

Just download the installer and execute it.

Anaconda includes most of the packages we will use and it is pretty easy to install additional packages if required.

If the above two methods do not work for OS X

The MacPorts way

You can try installing everything using MacPorts. First download and install macports and then issue the following command in a terminal:

sudo port install py27-zmq py27-tornado py27-nose

The avove dependencies are required in order to run IPython notebook. Then run:

sudo port install py27-numpy py27-matplotlib py27-scipy py27-ipython

The advantage of this method is that it easy to do. The downsides:

It can take a couple of hours to finish the installation depending on your machine and internet connection, since macports will download and compile everything as it goes.
If you like having the bleeding edge versions, note that it can take a while for them to be released on macports
Finally, macports can create conflicts between different python interpreters installed in your system

Using Apple’s Python interpreted and pip

If you feel adventurous, you can use Apple’s builtin python interpreter and install everything using pip. Please follow the instructions described in this blog.

If you run into trouble

~~Feel free to contact us.~~ Leave a comment here with the issue you found.

Thursday, May 23, 2013

Easy way of installing Python and scientific packages for OS X (non-academic users): Anaconda

Introduction

Mac OS X users know that getting python and its scientific packages (numpy, scipy, ipython etc) installed properly can be a tricky job.

If you are affiliated with a university, you can try the freely available and easy to install Enthought distribution (if you are not affiliated with a university, you can still get the Enthought dsitribution for free but you are stuck with 32 bits binaries).

One way of getting things installed properly (even if you are not affiliated with a university) is by using the native OS X python interpreter and installing everything via pip as I describe in this post. This method is a little tricky but works well for me. One advantage is that it is easy to install new packages in this way: just use pip.

Another way is using macports. This is relatively straightforward, but can take a long time to compile all the dependencies (hours) and, worse of all, can potentially create conflicts between different libraries and different python interpreters.

Anaconda

I recently came across an easy way of getting python and the scientific packages installed. It also provides a convenient framework for installing extra packages. This is the Anaconda python distribution provided by Continuum Analytics. Anaconda provides the most popular packages: numpy, scipy, ipython (+notebook) and even more - astropy, spyder, pandas etc. And 64 bits binaries! For everybody. For free.

How to install it?

Pretty easy. Go to the downloads page. Download the installer for your operating system. After downloading the file, for OS X you just need to issue the command

 sh <downloaded file.sh>

Piece of cake. You can choose to let the installer change your PATH variable. After that, when you invoke python or ipython, it will automatically call the appropriate binaries and you will have available all the important packages.

To install additional packages, you can use their conda package manager or the usual pip.

Monday, May 6, 2013

Video tutorial on iPython Parallel

Here is a 1 hour long video tutorial on iPython parallel by Min Ragan-Kelley. Pretty cool. The material he went over is available here.

Monday, January 14, 2013

Nice matplotlib tutorial

While I was searching for a solution on how to plot correctly the ticks in the x axis of a plot, I came across this very nice and clear matplotlib tutorial.

Definitely check it out.

Thursday, July 19, 2012

How to switch from IDL to Python

There is a new DIY tutorial on how to switch from IDL to Python hosted by AstroBetter. It is especially useful if you are an IDL programmer and want to grasp the basic concepts of python.

Go check it out.

Tuesday, June 12, 2012

Parallel computing in Python for the masses

Case scenario: you wrote a python routine that does some kind of time-consuming computation. Then you think, wow, my computer has N cores but my program is using only one of them at a time. What a waste of computing resources. Is there a reasonably easy way of modifying my code to make it exploit all the cores of my multicore machine?

The answer is yes and there are different ways of doing it. It depends on how complex your code is and which method you choose to parallelize your computation.

I will talk here about one relatively easy way of speeding up your code using the multiprocessing python package. I should mention that there are many other options out there but the multiprocessing package comes pre-installed with any python distribution by default.

I am assuming that you really need to make your code parallel. You will have to stop and spend time thinking about how to break your computation in smaller parts that will be sent to the different cores. And I should mention that debugging is harder for parallel code compared to serial code, obviously.

Parallelization is one way of optimizing your code. Other ideas for optimizing your code is using Cython or f2py. Both these approaches may imply >10x speedup and are worth exploring depending on your situation. But both will involve using the C or Fortran languages along with your python code.

The ideal case is when your problem "embarassingly parallel". What I mean by this is: your problem can be made parallel in a reasonably easy way since the computations which correspond to the bottleneck of the code can be carried out independently and do not need to communicate between each other. Examples:

You have a "grid" of parameters that you need to pass to a time-consuming model (e.g., a 1000x1000 matrix with the values of two parameters). Your model needs to evaluate those parameters and provide some output.
Your code performs a Monte Carlo simulation with 100000 trials which are carried out in a loop. You can then easily "dismember" this loop and send it to be computed independently by the cores in your machine.

Instead of giving code examples myself, I will point out the material I used to learn parallelization. I learned the basics of parallel programming by reading the excellent tutorial "introduction to parallel programming" written by Blaise Barney.

The next step was learning how to use the multiprocessing package. I learned this with the examples posted in the AstroBetter blog. I began by reading the example implemented with the pprocess package. The caveat here is that 'pprocess' is a non-standard package. The multiprocessing package which comes with python should be used instead. Somebody posted the original example discussed in the blog ported to the multiprocessing package.

As the posts above explain, the basic idea behind using 'multiprocessing' is to use the parallel map method to evaluate your time-consuming function using the many cores in your machine. Once you figure out a way of expressing your calculation in terms of the 'map' method, the rest is easy.

In my experience doing parallel programming in python using 'multiprocessing' I learned a few things which I want to share:

Do not forget to close the parallel engine with the close() method after your computation is done! If you do not do this, you will end up leaving a lot of orphan processes which can quickly consume the available memory in your machine.
Avoid using lambda functions when passing arguments to the parallel 'map' at all costs! Trust me, multiprocessing does not play well with lambda constructs.
Finally, as I mentioned before, parallelizing a code increases development time and the complexity of debugging your code. Only resort to parallelization if you really need it, i.e. if you think you will get a big speedup in your code execution. For example, if you code takes 24 hours to execute and you think you can get a 6x speedup by resorting to 'multiprocessing', then the execution time can be reduced to 4 hours which is not bad.

Thursday, April 26, 2012

Using git to manage source code and more

Recently I learned how to use Git to manage source code (thanks to this guy). Let me tell you, it is such a fantastic tool! Especially when you have thousands of line of source code constantly evolving and you need to keep track of what changes.

In my case, I have been using it to manage the source code I wrote for my different scientific projects. And I will soon begin using it even to manage the writing of one paper.

Let me list the tutorials that I read and have been very useful in getting me started quickly:

Git Magic: I began learning git with this one. It goes straight to the point and illustrates the most important commands.
Pro Git: need more detailed information and have more time to spend learning git? Have a look at this one.

Quick reference for gitds:

Git reference: quick description of the commands
Git cheatsheet

I use SourceTree, a GUI on Mac, to check the evolution of the source code.

Changelog
May 24th 2012: replaced suggestion of GUI GitX -> SourceTree.

Wednesday, March 7, 2012

How to install a scientific Python environment on Mac OS X Lion

My Mac OS X workstation was just updated to Lion and I had to reinstall Python and associated scientific tools for plotting, statistics etc. What a pain.

After many trial-and-error procedures I finally found a way to get a scientific Python environment (Python + Scipy + iPython + Numpy + matplotlib) working correctly on Mac OS X Lion. I am reporting the steps I carried out hoping that it will help other people.

You will need a Python installation (in my example I use the one that comes by default with OS X), gfortran and Xcode. Here are the steps:

Install the requirements: Xcode which includes Python (via App Store), gfortran (via Macports), virtualenv, additional libraries required by matplotlib
Create a python environment with virtualenv
Install Numpy, Scipy, matplotlib, ipython with pip. Install readline with easy_install
Create an alias in .profile or .bash_profile (depending on your shell) to run ipython

After these steps are completed, you will get a working Python environment for scientific analysis, visualization and statistics with Mac OS X Lion.

Requirements

Xcode
Python 2.7, which comes pre-installed by default with OS X
gfortran
virtualenv
additional libraries required by matplotlib (optional)

1. How to get the requirements working

gfortran

In my case, I installed it by installing MacPorts and installing GCC which comes with gfortran:

 sudo port install gcc44

To make gfortran visible to the system I created an alias in /usr/local/bin:

 cd /usr/local/bin/  
 sudo ln -s /opt/local/bin/gfortran-mp-4.4 gfortran

virtualenv

I went to web page that hosts virtualenv and downloaded virtualenv.py. You will use virtualenv.py below.

Additional libraries required by matplotlib (optional)

I use the graphical backend TkAgg, which requires the following additional libraries for matplotlib to work: tk, freetype, libpng. I installed them using macports:

sudo port install tk

sudo port install freetype

sudo port install libpng

2. Create a python environment with virtualenv

Create a directory stdpy (in my example) somewhere and issue the command

 /usr/bin/python virtualenv.py stdpy

to create an isolated python environment based on the python provided by default with Mac OS X. This avoids trouble with mixing libraries. Activate the environment by running

 source stdpy/bin/activate

You should now see a (stdpy) showing up in your terminal.

3. Install Numpy, Scipy, matplotlib, ipython with pip and readline with easy_install

After activating the python environment, let's proceed and install the additional modules with pip and easy_install:

 pip install numpy  
 pip install scipy  
 pip install matplotlib  
 pip install ipython  
 easy_install readline

You may need to install additional libraries in order to get matplotlib compiled, depending on the kind of graphical backend that you choose. In my case, I use TkAgg which depends on Tk, freetype and libpng libraries which I installed via macports.

4. Create an alias in .profile or .bash_profile (depending on your shell) to run python

In my case I use Bash and I added the following line to the file .bash_profile in my home directory:

 alias ipy='source ~/stdpy/bin/activate && ipython --pylab'

Now, when I open the terminal and issue the command

ipy

it will automatically activate the python environment and run ipython.

Changelog:

Aug. 18th 2012: added instructions about additional libraries in matplotlib
Sep. 1st 2012: made explanation about matplotlib dependencies clearer (hopefully)

Wednesday, February 8, 2012

How to begin learning Python

Many (perhaps most) people that want to learn Python get confused with the overwhelming number of references sources available. Where to start? So many options!

Motivated by this, I list in this post the references that I used to learn Python (and object-oriented programming as well), which can serve as a starting point for other people. This post is biased towards the scientists interested in learning Python.

Beginner material

Official Python documentation

Learned the basic syntax and capabilities of the language with the official Python tutorial. You can download all of this as PDF files. I suggest this for people with previous programming experience. For absolute beginners, have a look at the Think Python book below.

Computação científica com Python, Fabrício Ferrari (portuguese only)

Introductory lecture about Python, its syntax and science applications. It shows what Python is capable of for data analysis and plotting. Inspiring. The audio is also available for download as a MP3 file.

Using Python for interactive data analysis, Greenfield & Jedrzejewski (STSCI)

Tutorial on using Python for data analysis! How to replace IDL/Matlab with Python. Includes: plotting, FITS files, signal processing.

Introduction to Programming using Python, Programming Course for Biologists at the Pasteur Institute, Schuerer et al.

I learned object-oriented programming using this material. Very clear and "application-oriented" approach. You don't need to be a biologist to understand this.

Think Python: How to think like a computer scientist

Longer introduction for people with no previous extensive programming experience.

Quick reference

Python data analysis reference card

This is a cheat sheet with the basic commands needed for data analysis, array processing and plotting.

Reference card: Python commands equivalent to IDL/Matlab ones

Migrating from IDL/Matlab to Python.

IPython quick reference card

If you are going to do serious stuff with Python, I suggest using the enhanced interactive Python terminal IPython.

Longer introductory books

Learning Python, Mark Lutz

A primer on scientific programming with Python, Hans Petter Langtangen

Longer reference books

Python essential reference, David Beazley

Here is a collection of Python-related links: tutorials, references, modules etc.

Note: this post is a revised version of the text originally posted here.

Friday, December 16, 2011

Script illustrating how to do a linear regression and plot the confidence band of the fit

The script below illustrates how to carry out a simple linear regression of data stored in an ASCII file, plot the linear fit and the 2 sigma confidence band.

The script invokes the confband method described here to plot the confidence bands. The confband is assumed to be lying inside the "nemmen" module (not yet publicly available, sorry) but you can place it in any module you want.

I got the test data here to perform the fitting.

After running the script you should get the following plot:

where the best-fit line is displayed in green and the shaded area is in gray.

The script below is also available at Github Gist.

   
 import numpy, pylab, scipy  
 import nemmen  
   
 # Data taken from http://orion.math.iastate.edu/burkardt/data/regression/x01.txt.  
 # I removed the header from the file and left only the data in 'testdata.dat'.  
 xdata,ydata = numpy.loadtxt('testdata.dat',unpack=True,usecols=(1,2))  
 xdata=numpy.log10(xdata) # take the logs  
 ydata=numpy.log10(ydata)  
   
 # Linear fit  
 a, b, r, p, err = scipy.stats.linregress(xdata,ydata)  
   
 # Generates arrays with the fit  
 x=numpy.linspace(xdata.min(),xdata.max(),100)  
 y=a*x+b  
   
 # Calculates the 2 sigma confidence band contours for the fit  
 lcb,ucb,xcb=nemmen.confband(xdata,ydata,a,b,conf=0.95)  
   
 # Plots the fit and data  
 pylab.clf()  
 pylab.plot(xdata,ydata,'o')  
 pylab.plot(x,y)  
   
 # Plots the confidence band as shaded area  
 pylab.fill_between(xcb, lcb, ucb, alpha=0.3, facecolor='gray')  
   
 pylab.show()  
 pylab.draw()

Wednesday, December 14, 2011

Quick Python reference documents

These quick reference sheets will help you to quickly learn and practice data analysis, plotting and statistics with Python:

Python data analysis reference card (created by Fabricio Ferrari)

Python commands for people with previous IDL/Matlab experience

IPython quick reference card

Sunday, August 21, 2011

References for learning Python

I list in this post the reading material that I used to learn Python. When I started learning it I had considerable previous experience with several programming languages. So the reading list below is targeted at people with some experience. In addition, my goal was to apply Python to science and data analysis.

If you don't have previous programming experience, one suggested starting point is the book Think Python: How to think like a computer scientist. It is available for free.

I learned the basic Python syntax with the official tutorial at http://docs.python.org/.

Fabricio Ferrari's lecture (in portuguese) illustrates what python is capable of for scientific data analysis and plotting. Inspiring.

If you don't learn object-oriented programming (OOP) you will not be able to explore the full potential of Python. I learned OOP from

Introduction to Programming using Python, Programming Course for Biologists at the Pasteur Institute, Schuerer et al. Very clear exposition.

Tutorial on using Python for astronomical data analysis. How to plot spectra, read and manipulate FITS files and much more. Highly recommended.

Using Python for interactive data analysis, Greenfield & Jedrzejewski (STSCI)

Monday, August 15, 2011

How to install python, ipython, numpy, scipy etc on Mac OS X Lion: The MacPorts way

Note added on March 7th 2012: This tutorial is deprecated. Please refer instead to this updated tutorial.

I realized that by using MacPorts to install Python, as described in this tutorial, I am mixing libraries installed via MacPorts and the ones installed with easy_install which can lead to horrible incompatibility issues.

The updated tutorial describes how to get a working Scipy + Numpy + iPython + matplotlib installation using the built-in OS X Python and pip/easy_install.