Liang Bo Wang (亮亮), 2014-05-16
From Taiwan R User Group, more info on Meetup.
Liang Bo Wang (亮亮)
CC 4.0 BY
Esc to overview
← → to navigate
From a Python developer and R user
Not the case in Taiwan. We got Python talks in R user group!
and PyCon APAC even recognized us as well
Some said we are R at cover but Python under the skin (hmm...) Noooope!
Some above are myths about R. Please read R: the Good Parts.
Statistics is about
Vagrantfile in repo can setup a Ubuntu 14.04 with above packages.
Under scipy.stats. Take Normal distribution as example, it is norm
. More on Scipy stats tutorial.
from scipy.stats import norm, binom
rv = norm(loc=0, scale=1) # fix the shape of Normal dist.
rv.cdf(0) # 0.5 = Pr(X ≤ 0)
rv.pdf(-np.Inf) # 0 = f(X = -Inf)
rv.rvs(size=10) # generate 10 random variables
A powerful formula-like statement. Provide data transformation by custom function.
from patsy import dmatrices, dmatrix, demo_data
data = demo_data("a", "x1", "x2", "y", min_rows=20)
# generate a dict-like structure
{'a': ['a1', 'a2'],
'x2': array([ 0.97873798, 2.2408932 ]),
'x1': array([ 1.76405235, 0.40015721]),
'y': array([ 1.86755799, -0.97727788])}
# Y depends on x1 and interaction between a and x2
In [2]: dmatrices("y ~ x1 + a:x2", data=data) # syntax similar to R
Out[2]: # specify LHS = RHS two DesignMatrices
(DesignMatrix with shape (2, 1)
y
1.86756
-0.97728
Terms:
'y' (column 0),
DesignMatrix with shape (2, 4)
Intercept x1 a[a1]:x2 a[a2]:x2
1 1.76405 0.97874 0.00000
1 0.40016 0.00000 2.24089
Terms:
'Intercept' (column 0)
'x1' (column 1)
'a:x2' (columns 2:4))
In [3]:
dmatrix("I(x1 * x2) + np.log(x2 + 10)", data)
Out[3]:
DesignMatrix with shape (2, 3)
Intercept I(x1 * x2) np.log(x2 + 10)
1 1.72655 2.39596
1 0.89671 2.50478
Terms:
'Intercept' (column 0)
'I(x1 * x2)' (column 1)
'np.log(x2 + 10)' (column 2)
More on patsy quickstart
Like glm(), lm()
in R. To run Generalized Linear Model(GLM), one can supply the model's formula and data previously constructed in patsy. More on statsmodel homepage.
import statsmodels.api as sm
import statsmodels.formula.api as smf
mod1 = smf.glm(
formula="y ~ x1 + x2", data=data,
family=sm.families.Gaussian(sm.families.links.identity)
).fit()
mod1.summary()
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 20
Model: GLM Df Residuals: 17
Model Family: Gaussian Df Model: 2
Link Function: identity Scale: 0.862442022663
Method: IRLS Log-Likelihood: -25.274
Date: Sat, 17 May 2014 Deviance: 14.662
Time: 03:13:52 Pearson chi2: 14.7
No. Iterations: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -0.5220 0.250 -2.088 0.037 -1.012 -0.032
x1 0.2262 0.245 0.923 0.356 -0.254 0.707
x2 -0.0342 0.175 -0.195 0.846 -0.378 0.310
==============================================================================
Those packages are under frequent development and expect to gain further functionality. However, R takes over when it comes to serious statistics (e.g., stratified Cox Proportional Hazard regression model in survival analysis).
rpy2.robjects.r
at high level
from rpy2 import robjects as ro
pi_py = ro.r['pi'] # original pi is a R vector
# <FloatVector - Python:0x101974288 / R:0x7fa65157ba78>
# [3.141593]
pi_py[0] # 3.14159
pi_py + 1 # [3.14, 1] Pythonic list addition
pi_py.ro + 1 # [4.14] R vector addition
res = ro.IntVector([1, 2, 3, 4]) # make a R int vector
ro.r['sum'](res) # get R's sum() function
# <IntVector - Python:0x1072c3fc8 / R:0x7fa653269b88>
# [ 10]
Use R function from packages
# rank(0, na.last = TRUE) from base package
from rpy2.robjects.packages import importr
base = importr('base')
base.rank(0, na_last = True) # change '.' by '_'
base.rank(0, **{'na.last': True}) # or **kwargs form
rank = ro.r['rank']
rank(0, na_last = True)
Pandas has its own support with rpy2
# get R datasets (ex iris) into Py DataFrame
import pandas.rpy.common as com
iris_df = com.load_data('iris')
type(iris_df) # pandas.core.frame.DataFrame
# Py DataFrame to R data.frame
r_iris = com.convert_to_r_dataframe(iris_df)
# From rpy2 to numpy
int_vec = ro.IntVector(list(range(10)))
int_np = np.array(int_vec) # conversion by copy
int_npnc = np.asarray(int_vec) # conversion by reference
int_np[0] = 123
int_vec.rx(1)[0] # 0
int_npnc[0] = 123
int_vec.rx(1)[0] # 123
IPython Integration
In [1]: %load_ext rpy2.ipython # ipython magic
In [2]: %R c(1, 2) + 1
Out[2]:
<FloatVector - Python:0x1031b9748 / R:0x7fa6532c37f0>
[2.000000, 3.000000]
In [3]: a = %R c(1, 2) + 1
In [4]: %%R
....: print(head(iris)) # only show last output
....: rnorm(10)
IPython Integration (cont'd)
In [5]: df = pd.DataFrame({'a': [1], 'b': [2]})
In [6]: %Rpush df # push Py obj, numpy.array also
Out[6]: %%R
....: print(class(df)) # only show last output
....: df
# [1] "data.frame"
# a b
# 1 1 2
In [7]: df_r = %Rget df # get R obj
This form of data has the outcome having both:
Commonly used in medicine industry and clinical trials (Ex. Used to determine the effect of treatment on some disease).
Python has package lifeline while lacks of the flexibility to combine with semi-parametric Cox regression model.