Statistics in Python with R, Liang Bo Wang

PyCon APAC 2014/TW, 2014-05-16

Statistics in Python with R

Liang Bo Wang (亮亮)
CC 4.0 BY

Esc to overview
← → to navigate

About Me

Master student at
Bioinfo & Biostat Core Lab, NTU CGM
R / Python
Try to learn how to speak DNA
Taiwan R co-organizer
PyCon APAC 2014/TW staff

Today's Topic

Why Python / R? Thoughts about data analysis and statistics
Statistics in Python (overview)
- scipy.stats - probability distribution and utils (or d__, p__, r__ in R)
- patsy - model description (or formula() in R)
- statsmodel - statistical analysis (or lm(), glm() ... in R)
rpy2 - R-powered Python
Demo - survival analysis

Today's Viewpoint

From a Python developer and R user

Write own packages
Know how Python works roughly
While lack of knowing the implementation details

Use R for specific purpose
View R's internal as a black box
How most statisticians use R

War between Python and R ... really?

Not the case in Taiwan. We got Python talks in R user group!

and PyCon APAC even recognized us as well

Some said we are R at cover but Python under the skin (hmm...) Noooope!

Why R?

When it comes to serious statistics, people think of R, SAS, SPSS and Stata. And new method (e.g. on JASA and Biometrika) often ships with a R package.
R is designed for statistics. Python is learning from all R's good parts (pandas for R's data.frame; patsy for R's formula)

It's hard to ignore the wealth of statistical packages available in R/CRAN ...

Python as a statistics workbench, Stack Exchange

Why Python (or maybe not R)?

You should know the answer otherwise you wouldn't come here!
R is hard to learn (more functional, multiple OO systems, ...)
R is not designed for general tasks
Python is easier to optimize (Cython, cffi, ...)
Better at data preprocessing, networking, and many other aspects

Some above are myths about R. Please read R: the Good Parts.

Before we start to speak statistics

Statistics is about

Not merely mean and variance (also known as descriptive statistics)
Sampling and finding representative samples (otherwise, biased)
Modeling for the specific data structure
Avoid over fitting or complicated models (model simplification)
Insights about data's mean and variance (or domain knowledge)

Essentially, all models are wrong, but some are useful.

George E. P. Box (1987)

Statistics
in Python

Python Ecosystem for Scientific Computing

Python 3.4 (give it a try!)
Numpy 1.8, Scipy 0.13, Matplotlib 1.3.1
pandas 0.13+
IPython (Notebook) 2.0
Cython 0.20
(needed for this talk) patsy 0.2.1, statsmodel 0.6dev, rpy2 2.4dev

Vagrantfile in repo can setup a Ubuntu 14.04 with above packages.

Probability Distribution

Under scipy.stats. Take Normal distribution as example, it is norm. More on Scipy stats tutorial.

from scipy.stats import norm, binom
rv = norm(loc=0, scale=1)  # fix the shape of Normal dist.
rv.cdf(0)  # 0.5 = Pr(X ≤ 0)
rv.pdf(-np.Inf)  # 0 = f(X = -Inf)
rv.rvs(size=10)  # generate 10 random variables

Model construction by patsy

A powerful formula-like statement. Provide data transformation by custom function.

from patsy import dmatrices, dmatrix, demo_data
data = demo_data("a", "x1", "x2", "y", min_rows=20)
# generate a dict-like structure
{'a': ['a1', 'a2'],
 'x2': array([ 0.97873798,  2.2408932 ]),
 'x1': array([ 1.76405235,  0.40015721]),
 'y': array([ 1.86755799, -0.97727788])}

# Y depends on x1 and interaction between a and x2
In [2]: dmatrices("y ~ x1 + a:x2", data=data)  # syntax similar to R
Out[2]: # specify LHS = RHS two DesignMatrices
(DesignMatrix with shape (2, 1)
          y
    1.86756
   -0.97728
   Terms:
     'y' (column 0),
 DesignMatrix with shape (2, 4)
   Intercept       x1  a[a1]:x2  a[a2]:x2
           1  1.76405   0.97874   0.00000
           1  0.40016   0.00000   2.24089
   Terms:
     'Intercept' (column 0)
     'x1' (column 1)
     'a:x2' (columns 2:4))

In [3]:
dmatrix("I(x1 * x2) + np.log(x2 + 10)", data)
Out[3]:
DesignMatrix with shape (2, 3)
  Intercept  I(x1 * x2)  np.log(x2 + 10)
          1     1.72655          2.39596
          1     0.89671          2.50478
  Terms:
    'Intercept' (column 0)
    'I(x1 * x2)' (column 1)
    'np.log(x2 + 10)' (column 2)

Model fitting by statsmodel

Like glm(), lm() in R. To run Generalized Linear Model(GLM), one can supply the model's formula and data previously constructed in patsy. More on statsmodel homepage.

import statsmodels.api as sm
import statsmodels.formula.api as smf
mod1 = smf.glm(
    formula="y ~ x1 + x2", data=data,
    family=sm.families.Gaussian(sm.families.links.identity)
).fit()

mod1.summary()
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                   20
Model:                            GLM   Df Residuals:                       17
Model Family:                Gaussian   Df Model:                            2
Link Function:               identity   Scale:                  0.862442022663
Method:                          IRLS   Log-Likelihood:                -25.274
Date:                Sat, 17 May 2014   Deviance:                       14.662
Time:                        03:13:52   Pearson chi2:                     14.7
No. Iterations:                     3
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.5220      0.250     -2.088      0.037        -1.012    -0.032
x1             0.2262      0.245      0.923      0.356        -0.254     0.707
x2            -0.0342      0.175     -0.195      0.846        -0.378     0.310
==============================================================================

Python is also good at

Heavy computation - Theano (GPU computation)
Simulation - PyMC (MCMC simulation)
Machine learning - scikit-learn, Orange (friendly vis interaction)

Those packages are under frequent development and expect to gain further functionality. However, R takes over when it comes to serious statistics (e.g., stratified Cox Proportional Hazard regression model in survival analysis).

Statistics in R
through RPy2

Quick RPy2 Intro

As one runs rpy2, it starts an embedded R process.
Communicate with the R process through R's C API
Provide different levels of abstraction
Integrate with pandas and IPython
Graphic passing is possible
Control the embedded R by rpy2.robjects.r at high level

from rpy2 import robjects as ro
pi_py = ro.r['pi']  # original pi is a R vector
# <FloatVector - Python:0x101974288 / R:0x7fa65157ba78>
# [3.141593]
pi_py[0]      # 3.14159
pi_py + 1     # [3.14, 1] Pythonic list addition
pi_py.ro + 1  # [4.14] R vector addition
res = ro.IntVector([1, 2, 3, 4])  # make a R int vector
ro.r['sum'](res)  # get R's sum() function
# <IntVector - Python:0x1072c3fc8 / R:0x7fa653269b88>
# [      10]

Use R function from packages

# rank(0, na.last = TRUE) from base package
from rpy2.robjects.packages import importr
base = importr('base')
base.rank(0, na_last = True)  # change '.' by '_'
base.rank(0, **{'na.last': True})  # or **kwargs form

rank = ro.r['rank']
rank(0, na_last = True)

Pandas has its own support with rpy2

# get R datasets (ex iris) into Py DataFrame
import pandas.rpy.common as com
iris_df = com.load_data('iris')
type(iris_df)  # pandas.core.frame.DataFrame

# Py DataFrame to R data.frame
r_iris = com.convert_to_r_dataframe(iris_df)

# From rpy2 to numpy
int_vec = ro.IntVector(list(range(10)))

int_np = np.array(int_vec)  # conversion by copy
int_npnc = np.asarray(int_vec)  # conversion by reference

int_np[0] = 123
int_vec.rx(1)[0]  # 0

int_npnc[0] = 123
int_vec.rx(1)[0]  # 123

IPython Integration

In [1]: %load_ext rpy2.ipython  # ipython magic
In [2]: %R c(1, 2) + 1
Out[2]:
<FloatVector - Python:0x1031b9748 / R:0x7fa6532c37f0>
[2.000000, 3.000000]
In [3]: a = %R c(1, 2) + 1
In [4]: %%R
  ....: print(head(iris))  # only show last output
  ....: rnorm(10)

IPython Integration (cont'd)

In [5]: df = pd.DataFrame({'a': [1], 'b': [2]})
In [6]: %Rpush df  # push Py obj, numpy.array also
Out[6]: %%R
  ....: print(class(df))  # only show last output
  ....: df
      # [1] "data.frame"
      #   a b
      # 1 1 2
In [7]: df_r = %Rget df   # get R obj

Demo:
Survival Analysis

What's Survival Analysis

This form of data has the outcome having both:

Event outcome (whether an event occurs)
Event time (how long takes an event to occur)

Commonly used in medicine industry and clinical trials (Ex. Used to determine the effect of treatment on some disease).
Python has package lifeline while lacks of the flexibility to combine with semi-parametric Cox regression model.

Demo 1

Demo 2

Summary

Python + R is a good solution to get both power of statistics and computation.
RPy2 provides good support for R binding and Python scientific ecosystem.
Statistics broadens one's view about data.
PyCon rocks, so does Taiwan R MLDM Meetup :)