Statistics in Python with R

Liang Bo Wang (亮亮), 2014-05-16

From Taiwan R User Group, more info on Meetup.

PyCon APAC 2014/TW, 2014-05-16

Statistics in Python with R

Liang Bo Wang (亮亮)
CC 4.0 BY

Esc to overview
to navigate

About Me

Today's Topic

Today's Viewpoint

From a Python developer and R user

War between Python and R ... really?

Not the case in Taiwan. We got Python talks in R user group!

and PyCon APAC even recognized us as well

Some said we are R at cover but Python under the skin (hmm...) Noooope!

Why R?

It's hard to ignore the wealth of statistical packages available in R/CRAN ...

Python as a statistics workbench, Stack Exchange

Why Python (or maybe not R)?

Some above are myths about R. Please read R: the Good Parts.

Before we start to speak statistics

Statistics is about

Essentially, all models are wrong, but some are useful.

George E. P. Box (1987)

Statistics
in Python

Python Ecosystem for Scientific Computing

Vagrantfile in repo can setup a Ubuntu 14.04 with above packages.

Probability Distribution

Under scipy.stats. Take Normal distribution as example, it is norm. More on Scipy stats tutorial.

from scipy.stats import norm, binom
rv = norm(loc=0, scale=1)  # fix the shape of Normal dist.
rv.cdf(0)  # 0.5 = Pr(X ≤ 0)
rv.pdf(-np.Inf)  # 0 = f(X = -Inf)
rv.rvs(size=10)  # generate 10 random variables

    

Model construction by patsy

A powerful formula-like statement. Provide data transformation by custom function.

from patsy import dmatrices, dmatrix, demo_data
data = demo_data("a", "x1", "x2", "y", min_rows=20)
# generate a dict-like structure
{'a': ['a1', 'a2'],
 'x2': array([ 0.97873798,  2.2408932 ]),
 'x1': array([ 1.76405235,  0.40015721]),
 'y': array([ 1.86755799, -0.97727788])}
    
# Y depends on x1 and interaction between a and x2
In [2]: dmatrices("y ~ x1 + a:x2", data=data)  # syntax similar to R
Out[2]: # specify LHS = RHS two DesignMatrices
(DesignMatrix with shape (2, 1)
          y
    1.86756
   -0.97728
   Terms:
     'y' (column 0),
 DesignMatrix with shape (2, 4)
   Intercept       x1  a[a1]:x2  a[a2]:x2
           1  1.76405   0.97874   0.00000
           1  0.40016   0.00000   2.24089
   Terms:
     'Intercept' (column 0)
     'x1' (column 1)
     'a:x2' (columns 2:4))
    
In [3]:
dmatrix("I(x1 * x2) + np.log(x2 + 10)", data)
Out[3]:
DesignMatrix with shape (2, 3)
  Intercept  I(x1 * x2)  np.log(x2 + 10)
          1     1.72655          2.39596
          1     0.89671          2.50478
  Terms:
    'Intercept' (column 0)
    'I(x1 * x2)' (column 1)
    'np.log(x2 + 10)' (column 2)
    

More on patsy quickstart

Model fitting by statsmodel

Like glm(), lm() in R. To run Generalized Linear Model(GLM), one can supply the model's formula and data previously constructed in patsy. More on statsmodel homepage.

import statsmodels.api as sm
import statsmodels.formula.api as smf
mod1 = smf.glm(
    formula="y ~ x1 + x2", data=data,
    family=sm.families.Gaussian(sm.families.links.identity)
).fit()
mod1.summary()
                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                   20
Model:                            GLM   Df Residuals:                       17
Model Family:                Gaussian   Df Model:                            2
Link Function:               identity   Scale:                  0.862442022663
Method:                          IRLS   Log-Likelihood:                -25.274
Date:                Sat, 17 May 2014   Deviance:                       14.662
Time:                        03:13:52   Pearson chi2:                     14.7
No. Iterations:                     3
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.5220      0.250     -2.088      0.037        -1.012    -0.032
x1             0.2262      0.245      0.923      0.356        -0.254     0.707
x2            -0.0342      0.175     -0.195      0.846        -0.378     0.310
==============================================================================

Python is also good at

Those packages are under frequent development and expect to gain further functionality. However, R takes over when it comes to serious statistics (e.g., stratified Cox Proportional Hazard regression model in survival analysis).

Statistics in R
through RPy2

Quick RPy2 Intro

from rpy2 import robjects as ro
pi_py = ro.r['pi']  # original pi is a R vector
# <FloatVector - Python:0x101974288 / R:0x7fa65157ba78>
# [3.141593]
pi_py[0]      # 3.14159
pi_py + 1     # [3.14, 1] Pythonic list addition
pi_py.ro + 1  # [4.14] R vector addition
res = ro.IntVector([1, 2, 3, 4])  # make a R int vector
ro.r['sum'](res)  # get R's sum() function
# <IntVector - Python:0x1072c3fc8 / R:0x7fa653269b88>
# [      10]

Use R function from packages

# rank(0, na.last = TRUE) from base package
from rpy2.robjects.packages import importr
base = importr('base')
base.rank(0, na_last = True)  # change '.' by '_'
base.rank(0, **{'na.last': True})  # or **kwargs form

rank = ro.r['rank']
rank(0, na_last = True)

Pandas has its own support with rpy2

# get R datasets (ex iris) into Py DataFrame
import pandas.rpy.common as com
iris_df = com.load_data('iris')
type(iris_df)  # pandas.core.frame.DataFrame

# Py DataFrame to R data.frame
r_iris = com.convert_to_r_dataframe(iris_df)
# From rpy2 to numpy
int_vec = ro.IntVector(list(range(10)))

int_np = np.array(int_vec)  # conversion by copy
int_npnc = np.asarray(int_vec)  # conversion by reference

int_np[0] = 123
int_vec.rx(1)[0]  # 0

int_npnc[0] = 123
int_vec.rx(1)[0]  # 123

IPython Integration

In [1]: %load_ext rpy2.ipython  # ipython magic
In [2]: %R c(1, 2) + 1
Out[2]:
<FloatVector - Python:0x1031b9748 / R:0x7fa6532c37f0>
[2.000000, 3.000000]
In [3]: a = %R c(1, 2) + 1
In [4]: %%R
  ....: print(head(iris))  # only show last output
  ....: rnorm(10)

IPython Integration (cont'd)

In [5]: df = pd.DataFrame({'a': [1], 'b': [2]})
In [6]: %Rpush df  # push Py obj, numpy.array also
Out[6]: %%R
  ....: print(class(df))  # only show last output
  ....: df
      # [1] "data.frame"
      #   a b
      # 1 1 2
In [7]: df_r = %Rget df   # get R obj

Demo:
Survival Analysis

What's Survival Analysis

This form of data has the outcome having both:

Commonly used in medicine industry and clinical trials (Ex. Used to determine the effect of treatment on some disease).
Python has package lifeline while lacks of the flexibility to combine with semi-parametric Cox regression model.

Demo 1

Demo 2

Summary

Thank You!

Fork me on Github