 # Statistics in Python with R

Liang Bo Wang (亮亮), 2014-05-16

## Statistics in Python with R

Liang Bo Wang (亮亮)
CC 4.0 BY

Esc to overview
to navigate • Master student at
Bioinfo & Biostat Core Lab, NTU CGM
• R / Python
• Try to learn how to speak DNA
• Taiwan R co-organizer
• PyCon APAC 2014/TW staff ## Today's Topic

• Why Python / R? Thoughts about data analysis and statistics
• Statistics in Python (overview)
• scipy.stats - probability distribution and utils (or d__, p__, r__ in R)
• patsy - model description (or formula() in R)
• statsmodel - statistical analysis (or lm(), glm() ... in R)
• rpy2 - R-powered Python
• Demo - survival analysis

## Today's Viewpoint

From a Python developer and R user

• Write own packages
• Know how Python works roughly
• While lack of knowing the implementation details
• Use R for specific purpose
• View R's internal as a black box
• How most statisticians use R

## War between Python and R ... really?

Not the case in Taiwan. We got Python talks in R user group! and PyCon APAC even recognized us as well Some said we are R at cover but Python under the skin (hmm...) Noooope!

## Why R?

• When it comes to serious statistics, people think of R, SAS, SPSS and Stata. And new method (e.g. on JASA and Biometrika) often ships with a R package.
• R is designed for statistics. Python is learning from all R's good parts (pandas for R's data.frame; patsy for R's formula)

## Why Python (or maybe not R)?

• You should know the answer otherwise you wouldn't come here!
• R is hard to learn (more functional, multiple OO systems, ...)
• R is not designed for general tasks
• Python is easier to optimize (Cython, cffi, ...)
• Better at data preprocessing, networking, and many other aspects

## Statisticsin Python ## Python Ecosystem for Scientific Computing

• Python 3.4 (give it a try!)
• Numpy 1.8, Scipy 0.13, Matplotlib 1.3.1
• pandas 0.13+
• IPython (Notebook) 2.0
• Cython 0.20
• (needed for this talk) patsy 0.2.1, statsmodel 0.6dev, rpy2 2.4dev

Vagrantfile in repo can setup a Ubuntu 14.04 with above packages.

## Probability Distribution

Under scipy.stats. Take Normal distribution as example, it is norm. More on Scipy stats tutorial.

from scipy.stats import norm, binom
rv = norm(loc=0, scale=1)  # fix the shape of Normal dist.
rv.cdf(0)  # 0.5 = Pr(X ≤ 0)
rv.pdf(-np.Inf)  # 0 = f(X = -Inf)
rv.rvs(size=10)  # generate 10 random variables



## Model construction by patsy

A powerful formula-like statement. Provide data transformation by custom function.

from patsy import dmatrices, dmatrix, demo_data
data = demo_data("a", "x1", "x2", "y", min_rows=20)
# generate a dict-like structure
{'a': ['a1', 'a2'],
'x2': array([ 0.97873798,  2.2408932 ]),
'x1': array([ 1.76405235,  0.40015721]),
'y': array([ 1.86755799, -0.97727788])}

# Y depends on x1 and interaction between a and x2
In : dmatrices("y ~ x1 + a:x2", data=data)  # syntax similar to R
Out: # specify LHS = RHS two DesignMatrices
(DesignMatrix with shape (2, 1)
y
1.86756
-0.97728
Terms:
'y' (column 0),
DesignMatrix with shape (2, 4)
Intercept       x1  a[a1]:x2  a[a2]:x2
1  1.76405   0.97874   0.00000
1  0.40016   0.00000   2.24089
Terms:
'Intercept' (column 0)
'x1' (column 1)
'a:x2' (columns 2:4))

In :
dmatrix("I(x1 * x2) + np.log(x2 + 10)", data)
Out:
DesignMatrix with shape (2, 3)
Intercept  I(x1 * x2)  np.log(x2 + 10)
1     1.72655          2.39596
1     0.89671          2.50478
Terms:
'Intercept' (column 0)
'I(x1 * x2)' (column 1)
'np.log(x2 + 10)' (column 2)


More on patsy quickstart

## Model fitting by statsmodel

Like glm(), lm() in R. To run Generalized Linear Model(GLM), one can supply the model's formula and data previously constructed in patsy. More on statsmodel homepage.

import statsmodels.api as sm
import statsmodels.formula.api as smf
mod1 = smf.glm(
formula="y ~ x1 + x2", data=data,
).fit()
mod1.summary()
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                   20
Model:                            GLM   Df Residuals:                       17
Model Family:                Gaussian   Df Model:                            2
Method:                          IRLS   Log-Likelihood:                -25.274
Date:                Sat, 17 May 2014   Deviance:                       14.662
Time:                        03:13:52   Pearson chi2:                     14.7
No. Iterations:                     3
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.5220      0.250     -2.088      0.037        -1.012    -0.032
x1             0.2262      0.245      0.923      0.356        -0.254     0.707
x2            -0.0342      0.175     -0.195      0.846        -0.378     0.310
==============================================================================


## Python is also good at

• Heavy computation - Theano (GPU computation)
• Simulation - PyMC (MCMC simulation)
• Machine learning - scikit-learn, Orange (friendly vis interaction)

Those packages are under frequent development and expect to gain further functionality. However, R takes over when it comes to serious statistics (e.g., stratified Cox Proportional Hazard regression model in survival analysis).

## Statistics in R through RPy2 ## Quick RPy2 Intro

• As one runs rpy2, it starts an embedded R process.
• Communicate with the R process through R's C API
• Provide different levels of abstraction
• Integrate with pandas and IPython
• Graphic passing is possible
• Control the embedded R by rpy2.robjects.r at high level
from rpy2 import robjects as ro
pi_py = ro.r['pi']  # original pi is a R vector
# <FloatVector - Python:0x101974288 / R:0x7fa65157ba78>
# [3.141593]
pi_py      # 3.14159
pi_py + 1     # [3.14, 1] Pythonic list addition
pi_py.ro + 1  # [4.14] R vector addition
res = ro.IntVector([1, 2, 3, 4])  # make a R int vector
ro.r['sum'](res)  # get R's sum() function
# <IntVector - Python:0x1072c3fc8 / R:0x7fa653269b88>
# [      10]

Use R function from packages

# rank(0, na.last = TRUE) from base package
from rpy2.robjects.packages import importr
base = importr('base')
base.rank(0, na_last = True)  # change '.' by '_'
base.rank(0, **{'na.last': True})  # or **kwargs form

rank = ro.r['rank']
rank(0, na_last = True)

Pandas has its own support with rpy2

# get R datasets (ex iris) into Py DataFrame
import pandas.rpy.common as com
type(iris_df)  # pandas.core.frame.DataFrame

# Py DataFrame to R data.frame
r_iris = com.convert_to_r_dataframe(iris_df)

# From rpy2 to numpy
int_vec = ro.IntVector(list(range(10)))

int_np = np.array(int_vec)  # conversion by copy
int_npnc = np.asarray(int_vec)  # conversion by reference

int_np = 123
int_vec.rx(1)  # 0

int_npnc = 123
int_vec.rx(1)  # 123


IPython Integration

In : %load_ext rpy2.ipython  # ipython magic
In : %R c(1, 2) + 1
Out:
<FloatVector - Python:0x1031b9748 / R:0x7fa6532c37f0>
[2.000000, 3.000000]
In : a = %R c(1, 2) + 1
In : %%R
....: print(head(iris))  # only show last output
....: rnorm(10)

IPython Integration (cont'd)

In : df = pd.DataFrame({'a': , 'b': })
In : %Rpush df  # push Py obj, numpy.array also
Out: %%R
....: print(class(df))  # only show last output
....: df
#  "data.frame"
#   a b
# 1 1 2
In : df_r = %Rget df   # get R obj

## Demo:Survival Analysis ## What's Survival Analysis

This form of data has the outcome having both:

• Event outcome (whether an event occurs)
• Event time (how long takes an event to occur)

Commonly used in medicine industry and clinical trials (Ex. Used to determine the effect of treatment on some disease).
Python has package lifeline while lacks of the flexibility to combine with semi-parametric Cox regression model.

## Summary

• Python + R is a good solution to get both power of statistics and computation.
• RPy2 provides good support for R binding and Python scientific ecosystem. 