My takeaways from A Practical Introduction To Data Science at NDC by @markawest

June 22, 2018

Data Science is a combination of computer science, machine learning and traditional research from match statistics.

It follows the scientific process: Hyptothesis and analysis. Question -> Data -> Exploration & analysis (which is about 80% of the work) -> modelling -> interpretation ("what insight am I learning from this model?") -> communication (storytelling) -> result.

A successful data science team needs a wide range of competences:

Whereas a data scientist uses models to try to read meaning out of the data at hand, it has generally very little experience and knowledge about operationalizing the model - putting it into production for use in systems.

A data engineer deals with data integration, building data driven platforms and operationalize models.

A visualization expert can be good at storytelling and provide insight.

Then, there's a process owner that deals with project management and communication.

Artificial Intelligence is a very general topic which, in turn, contain the subset Machine Learning (automatic generation of rules). Machine Learning, in turn, has a subset called Deep Learning (neural networks).

A set of training data is pushed through an Algorithm, which renders a Model (rules).

Learning modes
Supervised learning is where you categorize your data.

Linear regression tries to draw a line through the majority of the data points. Deviations (out of range values) are called outliers and might signal an unfit model (not enough labels?) or an error.

Decision trees are intuitive for software developers, since they're essentially flowcharts. The deeper your tree is, the more prone they are for overfitting (i.e. not predicting a result from your data, but matching your data to a result, since your data exactly matches previous training data).

Unsupervised learning is where a machine clusters data by its properties (i.e. all green things in one lump and all blue in another)

K-means clustering. Describing, not predicting. K = the amount of clusters you want the algoritm to identify.

Reinforcement learning is a continually learning model that learns from past mistakes.

A Feature is an input - something that I look for.

A Label is an output - a resulting value.

You need data science skills - how to select and tune algorithms - in order to be successful in this field.

Getting started: get scikit-learn and jupyter notebooks. Kaggle provides datasets. Feature engineering - massaging data - is a vital skills. Domain knowledge is key. Split dataset into training (80%) and testing (20%) sets.

Reflection: K-means clustering seems to be an applicable unsupervised learning thing for log analytics and such - to find patterns in your data.

Search This Blog

Development Experience(s)

My takeaways from A Practical Introduction To Data Science at NDC by @markawest

Learning modes

Comments

Popular posts from this blog

Auto Mapper and Record Types - will they blend?

Unit testing your Azure functions - part 2: Queues and Blobs

Testing WCF services with user credentials and binary endpoints