My takeaways from A Practical Introduction To Data Science at NDC by @markawest

Data Science is a combination of computer science, machine learning and traditional research from match statistics.

It follows the scientific process: Hyptothesis and analysis. Question -> Data -> Exploration & analysis (which is about 80% of the work) -> modelling -> interpretation ("what insight am I learning from this model?") -> communication (storytelling) -> result.

A successful data science team needs a wide range of competences:
  • Whereas a data scientist uses models to try to read meaning out of the data at hand, it has generally very little experience and knowledge about operationalizing the model - putting it into production for use in systems.
  • A data engineer deals with data integration, building data driven platforms and operationalize models.
  • A visualization expert can be good at storytelling and provide insight.
  • Then, there's a process owner that deals with project management and communication.
Artificial Intelligence is a very general topic which, in turn, contain the subset Machine Learning (automatic generation of rules). Machine Learning, in turn, has a subset called Deep Learning (neural networks).

A set of training data is pushed through an Algorithm, which renders a Model (rules).

Learning modes

Supervised learning is where you categorize your data.
  • Linear regression tries to draw a line through the majority of the data points. Deviations (out of range values) are called outliers and might signal an unfit model (not enough labels?) or an error.
  • Decision trees are intuitive for software developers, since they're essentially flowcharts. The deeper your tree is, the more prone they are for overfitting (i.e. not predicting a result from your data, but matching your data to a result, since your data exactly matches previous training data).
Unsupervised learning is where a machine clusters data by its properties (i.e. all green things in one lump and all blue in another)
  • K-means clustering. Describing, not predicting. K = the amount of clusters you want the algoritm to identify.
  • Reinforcement learning is a continually learning model that learns from past mistakes.
A Feature is an input - something that I look for.
A Label is an output - a resulting value.

You need data science skills - how to select and tune algorithms - in order to be successful in this field.
Getting started: get scikit-learn and jupyter notebooks. Kaggle provides datasets. Feature engineering - massaging data - is a vital skills. Domain knowledge is key. Split dataset into training (80%) and testing (20%) sets.

Reflection: K-means clustering seems to be an applicable unsupervised learning thing for log analytics and such - to find patterns in your data.

Comments

Popular posts from this blog

Auto Mapper and Record Types - will they blend?

Unit testing your Azure functions - part 2: Queues and Blobs

Testing WCF services with user credentials and binary endpoints