Data Science

What is Data Science?

UNDER CONSTRUCTION

What isn't included in this area

A lot of very useful data science packages can be used without any knowledge of the underlying algorithms. A lot of the technical aspects of a machine learning model are hidden away behind an app, user interface, or API endpoint. These solutions are typically referred to as machine learning as a service (MLaaS) and you may not even be aware of when one is being used. For instance, when a website recommends another product to buy, you order a taxi through an app, or use the latest filter on instagram. To build and implement these systems requires detailed knowledge not only of machine learning algorithms but, crucially, of the context of that problem. The hope is, that after digesting the information within these topics, that you will have the necessary skills and knowledge to start solving real world problems within your domain with some machine learning algorithms. To that end, we will focus heavily on the underlying algorithms and not on the MLaaS offerings that are available. Although, we would always recommend seeing if a solution for your problem already exists embarking on producing your own solution.

A lot of data science techniques necessitate the use of some programming language, typically python, SQL, or R. However, we won't be covering these languages here. We assume a knowledge of python and the relevant data science packages that might be used in python; pandas, tensorflow, sci-kit learn. We also assume that you have an understanding of the matplotlib and seaborn libraries for visualization as well as the standard python suite and numpy. Most of the practical examples given in these topics will be written in python.

It is also assumed that you have the requisite mathematical and statistical knowledge to understand when and why to apply certain statistical tests and the validity of assumptions used.

01
Basic data science solutions in python
In this section we'll go through some basic examples of implementing data science solutions via fully connected neural networks and an ensemble method (Random Forest). These will be written in python and using the tensorflow and sci-kit learn packages. The idea here is to not provide a detailed introduction to what these algorithms do but to provide a quick introduction into how to implement these solutions in common packages. We'll go into a detailed description of these algorithm in later sections.

  1. Basic classification neural network
  2. Basic regression neural networks
  3. Basic random forest
02
Training concepts
The aim of this section is to provide background in important concepts that are encountered when training most, but not all, machine learning algorithms.

  1. Loss and accuracy
  2. Optimization algorithms
  3. Dataset splitting and validation of supervised models
  4. Hyper-parameter optimization
03
Preprocessing
Typically your dataset won't be perfect. It will require some amending, some fixing, and some finessing to get it into a format and structure that you can use in your models and will help you optimize the performance of those models.

  1. Normalization
  2. Encoding for categorical variables
  3. Embeddings
  4. Feature engineering
  5. Text Preprocessing
04
Classification methods
In this section we'll go through some common data science techniques for modelling data which has one or multiple target categories/classes/labels associated with each entry. A very popular approach to solving these kinds of problems is to use neural networks. However, neural networks are such a broad and important field that they deserve their own section, see further down the page.

  1. Logistic regression
  2. Decision Trees
  3. Support Vector Machines (SVM)
  4. Naive Bayes
05
Regression methods
In this section we'll go through some common data science techniques for modelling continuous target variables, which is typically called regression.
There are a number of algorithms that offer both classification and regression capabilities. However, some algorithms only approximate regression as they still output a discrete set of output values, such as decision trees where the output values are based on the end nodes of the tree (see the decision tree section for more details). These algorithms won't be covered and this is a non-exhaustive list of methods.
Neural networks can also be used for regression by, for example, omitting an activation function on the final output layer of a fully connected network (see the neural network section for more details)

  1. Linear and polynomial regression
  2. Support Vector Machines (SVM) for regression
06
Clustering methods
  1. K-nearest neighbours
  2. Self Organizing Maps
  3. DBSCAN
  4. T-SNE
07
Neural networks
  1. Basics of Neural networks
  2. Introduction to convolutional neural networks
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.