Liang Bo Wang (亮亮), 2014-05-16

From Taiwan R User Group, more info on Meetup.

Liang Bo Wang (亮亮)

CC 4.0 BY

`Esc` to overview

`←` `→` to navigate

- Master student at

Bioinfo & Biostat Core Lab, NTU CGM - R / Python
- Try to learn how to speak DNA
- Taiwan R co-organizer
- PyCon APAC 2014/TW staff

- Why Python / R? Thoughts about data analysis and statistics
- Statistics in Python (overview)
*scipy.stats*- probability distribution and utils (or d__, p__, r__ in R)*patsy*- model description (or formula() in R)*statsmodel*- statistical analysis (or lm(), glm() ... in R)

*rpy2*- R-powered Python- Demo - survival analysis

*From a Python developer and R user*

- Write own packages
- Know how Python works roughly
- While lack of knowing the implementation details

- Use R for specific purpose
- View R's internal as a black box
- How most statisticians use R

Not the case in Taiwan. We got Python talks in R user group!

and PyCon APAC even recognized us as well

Some said we are *R at cover but Python under the skin* (hmm...) Noooope!

- When it comes to
*serious*statistics, people think of R, SAS, SPSS and Stata. And new method (e.g. on JASA and Biometrika) often ships with a R package. - R is designed for statistics. Python is learning from all R's good parts (pandas for R's data.frame; patsy for R's formula)

- You should know the answer otherwise you wouldn't come here!
- R is hard to learn (more functional, multiple OO systems, ...)
- R is not designed for general tasks
- Python is easier to optimize (Cython, cffi, ...)
- Better at data preprocessing, networking, and many other aspects

Some above are myths about R. Please read *R: the Good Parts*.

Statistics is about

*Not*merely mean and variance (also known as descriptive statistics)*Sampling*and finding representative samples (otherwise, biased)*Modeling*for the specific data structure*Avoid over fitting*or complicated models (model simplification)*Insights*about data's mean and variance (or domain knowledge)

in Python

- Python 3.4 (give it a try!)
- Numpy 1.8, Scipy 0.13, Matplotlib 1.3.1
- pandas 0.13+
- IPython (Notebook) 2.0
- Cython 0.20
- (needed for this talk) patsy 0.2.1, statsmodel 0.6dev, rpy2 2.4dev

Vagrantfile in repo can setup a Ubuntu 14.04 with above packages.

Under scipy.stats. Take Normal distribution as example, it is `norm`

. More on Scipy stats tutorial.

```
from scipy.stats import norm, binom
rv = norm(loc=0, scale=1) # fix the shape of Normal dist.
rv.cdf(0) # 0.5 = Pr(X ≤ 0)
rv.pdf(-np.Inf) # 0 = f(X = -Inf)
rv.rvs(size=10) # generate 10 random variables
```

A powerful formula-like statement. Provide data transformation by custom function.

```
from patsy import dmatrices, dmatrix, demo_data
data = demo_data("a", "x1", "x2", "y", min_rows=20)
# generate a dict-like structure
{'a': ['a1', 'a2'],
'x2': array([ 0.97873798, 2.2408932 ]),
'x1': array([ 1.76405235, 0.40015721]),
'y': array([ 1.86755799, -0.97727788])}
```

```
# Y depends on x1 and interaction between a and x2
In [2]: dmatrices("y ~ x1 + a:x2", data=data) # syntax similar to R
Out[2]: # specify LHS = RHS two DesignMatrices
(DesignMatrix with shape (2, 1)
y
1.86756
-0.97728
Terms:
'y' (column 0),
DesignMatrix with shape (2, 4)
Intercept x1 a[a1]:x2 a[a2]:x2
1 1.76405 0.97874 0.00000
1 0.40016 0.00000 2.24089
Terms:
'Intercept' (column 0)
'x1' (column 1)
'a:x2' (columns 2:4))
```

```
In [3]:
dmatrix("I(x1 * x2) + np.log(x2 + 10)", data)
Out[3]:
DesignMatrix with shape (2, 3)
Intercept I(x1 * x2) np.log(x2 + 10)
1 1.72655 2.39596
1 0.89671 2.50478
Terms:
'Intercept' (column 0)
'I(x1 * x2)' (column 1)
'np.log(x2 + 10)' (column 2)
```

More on patsy quickstart

Like `glm(), lm()`

in R. To run Generalized Linear Model(GLM), one can supply the model's formula and data previously constructed in patsy. More on statsmodel homepage.

```
import statsmodels.api as sm
import statsmodels.formula.api as smf
mod1 = smf.glm(
formula="y ~ x1 + x2", data=data,
family=sm.families.Gaussian(sm.families.links.identity)
).fit()
```

```
mod1.summary()
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 20
Model: GLM Df Residuals: 17
Model Family: Gaussian Df Model: 2
Link Function: identity Scale: 0.862442022663
Method: IRLS Log-Likelihood: -25.274
Date: Sat, 17 May 2014 Deviance: 14.662
Time: 03:13:52 Pearson chi2: 14.7
No. Iterations: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -0.5220 0.250 -2.088 0.037 -1.012 -0.032
x1 0.2262 0.245 0.923 0.356 -0.254 0.707
x2 -0.0342 0.175 -0.195 0.846 -0.378 0.310
==============================================================================
```

- Heavy computation - Theano (GPU computation)
- Simulation - PyMC (MCMC simulation)
- Machine learning - scikit-learn, Orange (friendly vis interaction)

Those packages are under frequent development and expect to gain further functionality. However, R takes over when it comes to serious statistics (e.g., *stratified Cox Proportional Hazard regression model* in survival analysis).

through RPy2

- As one runs rpy2, it starts an embedded R process.
- Communicate with the R process through R's C API
- Provide different levels of abstraction
- Integrate with pandas and IPython
- Graphic passing is possible
- Control the embedded R by
`rpy2.robjects.r`

at high level

```
from rpy2 import robjects as ro
pi_py = ro.r['pi'] # original pi is a R vector
# <FloatVector - Python:0x101974288 / R:0x7fa65157ba78>
# [3.141593]
pi_py[0] # 3.14159
pi_py + 1 # [3.14, 1] Pythonic list addition
pi_py.ro + 1 # [4.14] R vector addition
res = ro.IntVector([1, 2, 3, 4]) # make a R int vector
ro.r['sum'](res) # get R's sum() function
# <IntVector - Python:0x1072c3fc8 / R:0x7fa653269b88>
# [ 10]
```

Use R function from packages

```
# rank(0, na.last = TRUE) from base package
from rpy2.robjects.packages import importr
base = importr('base')
base.rank(0, na_last = True) # change '.' by '_'
base.rank(0, **{'na.last': True}) # or **kwargs form
rank = ro.r['rank']
rank(0, na_last = True)
```

Pandas has its own support with rpy2

```
# get R datasets (ex iris) into Py DataFrame
import pandas.rpy.common as com
iris_df = com.load_data('iris')
type(iris_df) # pandas.core.frame.DataFrame
# Py DataFrame to R data.frame
r_iris = com.convert_to_r_dataframe(iris_df)
```

```
# From rpy2 to numpy
int_vec = ro.IntVector(list(range(10)))
int_np = np.array(int_vec) # conversion by copy
int_npnc = np.asarray(int_vec) # conversion by reference
int_np[0] = 123
int_vec.rx(1)[0] # 0
int_npnc[0] = 123
int_vec.rx(1)[0] # 123
```

IPython Integration

```
In [1]: %load_ext rpy2.ipython # ipython magic
In [2]: %R c(1, 2) + 1
Out[2]:
<FloatVector - Python:0x1031b9748 / R:0x7fa6532c37f0>
[2.000000, 3.000000]
In [3]: a = %R c(1, 2) + 1
In [4]: %%R
....: print(head(iris)) # only show last output
....: rnorm(10)
```

IPython Integration (cont'd)

```
In [5]: df = pd.DataFrame({'a': [1], 'b': [2]})
In [6]: %Rpush df # push Py obj, numpy.array also
Out[6]: %%R
....: print(class(df)) # only show last output
....: df
# [1] "data.frame"
# a b
# 1 1 2
In [7]: df_r = %Rget df # get R obj
```

Survival Analysis

This form of data has the outcome having both:

- Event outcome (whether an event occurs)
- Event time (how long takes an event to occur)

Commonly used in medicine industry and clinical trials (Ex. Used to determine the effect of treatment on some disease).

Python has package *lifeline* while lacks of the flexibility to combine with semi-parametric Cox regression model.

- Python + R is a good solution to get both power of statistics and computation.
- RPy2 provides good support for R binding and Python scientific ecosystem.
- Statistics broadens one's view about data.
- PyCon rocks, so does Taiwan R MLDM Meetup :)