Introduction

Hui Lin and Ming Li

Course Website and Github

https://course2021.scientistcafe.com/

The term no one really defined

Data science is the discipline of making data useful. Ok…so what is it?

HTML5 Icon

The myth of data science

Three tracks of data science

(It is a group work from https://github.com/brohrer/academic_advisory/blob/master/authors.md !)

Engineering

  1. Data environment: data storage, Kafka platform, Hadoop and Spark cluster etc.

  2. Data management: parsing the logs, web scraping, API queries, and interrogating data streams.

  3. Production: integrate model and analysis into the production system

Analysis

  1. Domain knowledge

  2. Exploratory analysis

  3. Storytelling

Modeling

Problem to solve:

🔑 Questions

💡 Waffle Houses and Divorce Rate

##     Location WaffleHouses South MedianAgeMarriage Marriage Divorce
## 1    Alabama          128     1              25.3     20.2    12.7
## 2     Alaska            0     0              25.2     26.0    12.5
## 3    Arizona           18     0              25.8     20.3    10.8
## 4   Arkansas           41     1              24.3     26.4    13.5
## 5 California            0     0              26.8     19.1     8.0
## 6   Colorado           11     0              25.7     23.5    11.6

💡 Waffle Houses and Divorce Rate

💡 Waffle Houses and Divorce Rate

Modeling

Some confusions and more to come

Three tracks of data science

HTML5 Icon

Three tracks of data science

HTML5 Icon

Three tracks of data science

HTML5 Icon

Data Science Types v.s Needs

HTML5 Icon

Data Flow

HTML5 Icon

Data Science Roles