My takeaways from A Practical Introduction To Data Science at NDC by @markawest
Data Science is a combination of computer science, machine learning and traditional research from match statistics.
It follows the scientific process: Hyptothesis and analysis. Question -> Data -> Exploration & analysis (which is about 80% of the work) -> modelling -> interpretation ("what insight am I learning from this model?") -> communication (storytelling) -> result.
A successful data science team needs a wide range of competences:
- Whereas a data scientist uses models to try to read meaning out of the data at hand, it has generally very little experience and knowledge about operationalizing the model - putting it into production for use in systems.
- A data engineer deals with data integration, building data driven platforms and operationalize models.
- A visualization expert can be good at storytelling and provide insight.
- Then, there's a process owner that deals with project management and communication.
Artificial Intelligence is a very general topic which, in turn, contain the subset Machine Learning (automatic generation of rules). Machine Learning, in turn, has a subset called Deep Learning (neural networks).
A set of training data is pushed through an Algorithm, which renders a Model (rules).
Learning modes
Supervised learning is where you categorize your data.
- Linear regression tries to draw a line through the majority of the data points. Deviations (out of range values) are called outliers and might signal an unfit model (not enough labels?) or an error.
- Decision trees are intuitive for software developers, since they're essentially flowcharts. The deeper your tree is, the more prone they are for overfitting (i.e. not predicting a result from your data, but matching your data to a result, since your data exactly matches previous training data).
Unsupervised learning is where a machine clusters data by its properties (i.e. all green things in one lump and all blue in another)
- K-means clustering. Describing, not predicting. K = the amount of clusters you want the algoritm to identify.
- Reinforcement learning is a continually learning model that learns from past mistakes.
A Feature is an input - something that I look for.
A Label is an output - a resulting value.
You need data science skills - how to select and tune algorithms - in order to be successful in this field.
Getting started: get scikit-learn and jupyter notebooks. Kaggle provides datasets. Feature engineering - massaging data - is a vital skills. Domain knowledge is key. Split dataset into training (80%) and testing (20%) sets.
Reflection: K-means clustering seems to be an applicable unsupervised learning thing for log analytics and such - to find patterns in your data.
Comments