Data Science Introduction

王亮博(亮亮)2014-06-25

From Taiwan R User Group, more info on Meetup.

Foxconn, 2014-06-25

Data Science Introduction
R/Python/Vis

王亮博 (亮亮)
Shared under CC 4.0 BY license

Esc to overview
to navigate

About Me

Slide on http://blog.liang2.tw/2014-FXN-datasci/

  • Now in collaboration on DNA Cloud with FXN
  • So pretty much I am just a researcher at
    beginner level
  • May view some aspects quite differently from yours

Agenda

I: Data Science Intro

This part is heavily adapted from Johnson's talk in DSP.

Explained by examples

What can you find from this helicop for car chasing video?

Discovery by Oona Räisänen

From http://www.windytan.com/2014/02/mystery-signal-from-helicopter.html

The wave pattern turns out to be some data coding, with most part repeating. Location? Video timestamp? Camera direction?

From http://www.windytan.com/2014/02/mystery-signal-from-helicopter.html

From http://www.windytan.com/2014/02/mystery-signal-from-helicopter.html

Map full trace on map. Later decodes into exact GPS locations

From http://www.windytan.com/2014/02/mystery-signal-from-helicopter.html

Another example on WorldCup prediction

http://grollchristian.wordpress.com/2014/06/12/world-cup-2014-prediction/

What can we learn from these examples?

Anyway, that is pretty much how to conduct a data analysis

From http://www.slideshare.net/euler96/ss-35513599

  • Know your problem
  • Get your hands dirty
  • Make a trained, logical inference

From http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

That's the whole picture

We then dive into each following subjects one by one.

Data ETL

So true

Data ETL

R? Python? PHP? C/Java? Javascript? shown by examples

G0V serves one of best examples

And best place to learn data science as well

Check out http://g0v.tw/

ETL data type

鄉民 OCR

Heavily utilized "Human Learning" (工人智慧) How?

From 政治獻金數位化

From http://ronnywang.pixnet.net/blog/post/40488349

He is using PHP and C :)

Using Hough Transform in OpenCV

Divide the table by separate cells

From http://ronnywang.pixnet.net/blog/post/40488349

G0V even owns a data portal now

Seems just crawling data, what's special?

From http://data.g0v.tw/

Jump to a more recent event

  • He used data from g0v.today
  • Show somme political position difference between media

From 服貿事件 X 資料科學, Johnson Hsieh

Not us but in fact someone did

  • No comment on whether 自由 and 蘋果 are on people's side
  • But they managed to get the traffic of internet news
  • One may think of this as coincidence
  • 蘋果 in fact doing some promising try (to become New York Times?)

From 「太陽花運動」蘋果、中時閱讀率高下立判, TechNews

From 新聞打卡地圖,蘋果互動新聞圖表

Adopting data visual. tech rapidly

Data ETL summary and notes

Analysis(Model)

Analysis (Model)

Some terms need to know

Before we start to speak statistics

Statistics is about

Essentially, all models are wrong, but some are useful.

George E. P. Box (1987)

Choose a model

From Machine Learning Cheat Sheet (for scikit-learn), Peekaboo

Based on their property

From Classifier comparison, scikit-learn

From Cluster comparison, scikit-learn

同學請我分析的資料

模式間的比較

model.full <- lm(Speed ~ HP_10 * Time, data=df_sim)
model.hi <- update(
  model.full,
  . ~ . + HP_10 : I(Time^2),  # try `HP_10 * I(Time^2)`
  data=df_sim
)
anova(model.full, model.hi)
summary(model.hi)
      
anova(...)
# Analysis of Variance Table
#
# Model 1: Speed ~ HP_10 * Time
# Model 2: Speed ~ HP_10 + Time + HP_10:Time + HP_10:I(Time^2)
#   Res.Df    RSS Df Sum of Sq      F    Pr(>F)
# 1    474 269.43
# 2    472 242.26  2    27.175 26.473 1.269e-11 ***
summary(...)
# Coefficients:
#                  Estimate Std. Error t value Pr(>|t|)
# (Intercept)      10.09593    0.10485  96.289  < 2e-16 ***
# HP_101            3.87381    0.15231  25.433  < 2e-16 ***
# Time              0.06238    0.14318   0.436    0.663
# HP_101:Time      -0.94099    0.20679  -4.551 6.81e-06 ***
# HP_100:I(Time^2) -0.18324    0.03914  -4.681 3.73e-06 ***
# HP_101:I(Time^2)  0.23082    0.04143   5.571 4.27e-08 ***

      

linear model

quadratic model

Thoughts about the previous model

More examples (if sufficient time left)

Visualization/Report

Plain words is hard to understand

... I Think TED actually stands for: middlebrow megachurch infotainment. The key rhetorical device for TED talks is a combination of epiphany and personal testimony (an “epiphimony” if you like ) through which the speaker shares a personal journey of insight and realization, its triumphs and tribulations.
What is it that the TED audience hopes to get from this? A vicarious insight, a fleeting moment of wonder, an inkling that maybe it’s all going to work out after all? A spiritual buzz?...

We Need to Talk About TED, Benjamin H. Bratton

Visualization is like expressing in this way

... WAIT! Are you sure that's the article about?

From http://www.naturalnews.com/042112_TED_conferences_pseudoscience_GMO.html

Thoughts about visualization

Visualization Example

Report example

Interactive Visualization

From publication to manipulation

Demo from http://timelyportfolio.github.io/gridSVG_intro/

Why interactive?

With today's method, we can reveal more details, provide different view points, and can be fancier :)

Interactive visualization examples

The largest vocabularary in hip hop

From http://rappers.mdaniels.com.s3-website-us-east-1.amazonaws.com/

How interactive?

Web and browsers dominates our front-end world.

Almost every PC and mobile have a modern browser today.

← SVG / HTML5 Canvas

Why SVG?

Scalable Vector Graphics (SVG) is an XML markup language for describing two-dimensional vector graphics.

Mozilla Developer Network

Interactive visualization in web

My feeling about front-end developing

  • Debug and testing is non-trivial
  • Mozilla Developer Network (MDN) is a good reference for web stuff
  • Not merely about programming, but more about design and UX

Hope you enjoy :)

Big Data Issue

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it ...

Dan Ariely

Do you really need map reduce?

Hadoop? Try Spark!

DataSci Communities

DSP 資料科學計畫

About Taiwan R User Group

  • More known a weekly meetup MLDM Monday (Machine Learning and Data Mining Monday)
  • Topics ranges from
    • R lang: basic tutorial, Rcpp, quantmod, ggplot2, slidify, knitr, googleVis
    • Statistics, ML/DM: survival analysis, neural network, SVM, regression, nonparam. stat
    • Big Data: Hadoop, MPI
    • PyData: Numpy, Scikit-learn, pandas

Meetup record on Youtube

台灣資料科學愛好者年會 2014

PyCon APAC 2015

Sponsor of PyCon APAC 2014

PyCon APAC 2014 Statistics

Part I ends

II: Data Science in Action

Thank You!

Fork me on Github