Data Science
What is Data Science?
UNDER CONSTRUCTION
What isn't included in this area
A lot of very useful data science packages can be used without any knowledge of the underlying algorithms. A
lot of the technical aspects of a machine learning model are hidden away behind an app, user interface, or
API endpoint. These solutions are typically referred to as machine learning as a service (MLaaS) and you may
not even be aware of when one is being used. For instance, when a website recommends another product to buy,
you order a taxi through an app, or use the latest filter on instagram. To build and implement these systems
requires detailed knowledge not only of machine learning algorithms but, crucially, of the context of that
problem. The hope is, that after digesting the information within these topics, that you will have the
necessary skills and knowledge to start solving real world problems within your domain with some machine
learning algorithms. To that end, we will focus heavily on the underlying algorithms and not on the MLaaS
offerings that are available. Although, we would always recommend seeing if a solution for your problem
already exists embarking on producing your own solution.
A lot of data science techniques necessitate the use of some programming language, typically python, SQL, or
R. However, we won't be covering these languages here. We assume a knowledge of python and the relevant data
science packages that might be used in python; pandas, tensorflow, sci-kit learn. We also assume that you
have an understanding of the matplotlib and seaborn libraries for visualization as well as the standard
python suite and numpy. Most of the practical examples given in these topics will be written in python.
It is also assumed that you have the requisite mathematical and statistical knowledge to understand when and
why
to apply certain statistical tests and the validity of assumptions used.
01
Basic data science solutions in python
In this section we'll go through some basic examples of implementing data science solutions via fully
connected neural networks and an ensemble method (Random Forest). These will be written in python and using
the tensorflow and sci-kit learn packages. The idea here is to not provide a detailed introduction to what
these algorithms do but to provide a quick introduction into how to implement these solutions in
common packages. We'll go into a detailed description of these algorithm in later sections.
- Basic classification neural network
- Basic regression neural networks
- Basic random forest
02
Training concepts
The aim of this section is to provide background in important concepts that are encountered when training
most, but not all, machine learning algorithms.
- Loss and accuracy
- Optimization algorithms
- Dataset splitting and validation of supervised models
- Hyper-parameter optimization
03
Preprocessing
Typically your dataset won't be perfect. It will require some amending, some fixing, and some finessing to
get it into a format and structure that you can use in your models and will help you optimize the
performance of those models.
- Normalization
- Encoding for categorical variables
- Embeddings
- Feature engineering
- Text Preprocessing
04
Classification methods
In this section we'll go through some common data science techniques for modelling data which has one or
multiple target categories/classes/labels associated with each entry. A very popular approach to solving
these kinds of problems is to use neural networks. However, neural networks are such a broad and important
field that they deserve their own section, see further down the page.
- Logistic regression
- Decision Trees
- Support Vector Machines (SVM)
- Naive Bayes
05
Regression methods
In this section we'll go through some common data science techniques for modelling continuous target
variables, which is typically called regression.
There are a number of algorithms that offer both classification and regression capabilities. However, some
algorithms only approximate regression as they still output a discrete set of output values, such as
decision trees where the output values are based on the end nodes of the tree (see the decision tree section
for more details). These algorithms won't be covered and this is a non-exhaustive list of methods.
Neural networks can also be used for regression by, for example, omitting an activation function on the
final output layer of a fully connected network (see the neural network section for more details)
- Linear and polynomial regression
- Support Vector Machines (SVM) for regression
06
Clustering methods
- K-nearest neighbours
- Self Organizing Maps
- DBSCAN
- T-SNE
07
Neural networks
- Basics of Neural networks
- Introduction to convolutional neural networks