Hui Lin and Ming Li
Data science is the discipline of making data useful. Ok…so what is it?
Engineering (infrastructure and production): the process of making everything else possible
Analysis: the process of turning raw information into insights in a fast way
Modeling/Inference: the process of diving deeper into the data to discover the pattern we don’t easily see
(It is a group work from https://github.com/brohrer/academic_advisory/blob/master/authors.md !)
Data environment: data storage, Kafka platform, Hadoop and Spark cluster etc.
Data management: parsing the logs, web scraping, API queries, and interrogating data streams.
Production: integrate model and analysis into the production system
Domain knowledge
Exploratory analysis
Storytelling
Problem to solve:
Prediction/classification: image recognition, machine translation, spam/not_spam
Explanation: customer segmentation, feature prioritization
Causal inference: vaccine effectiveness, policy change
🔑 Questions
Do we want to intervene?
Is the cost of an error too high?
Does the problem have a simple objective?
## Location WaffleHouses South MedianAgeMarriage Marriage Divorce
## 1 Alabama 128 1 25.3 20.2 12.7
## 2 Alaska 0 0 25.2 26.0 12.5
## 3 Arizona 18 0 25.8 20.3 10.8
## 4 Arkansas 41 1 24.3 26.4 13.5
## 5 California 0 0 26.8 19.1 8.0
## 6 Colorado 11 0 25.7 23.5 11.6
caret
package