Data Science

What is Data Science?

UNDER CONSTRUCTION

What isn't included in this area

A lot of very useful data science packages can be used without any knowledge of the underlying algorithms. A lot of the technical aspects of a machine learning model are hidden away behind an app, user interface, or API endpoint. These solutions are typically referred to as machine learning as a service (MLaaS) and you may not even be aware of when one is being used. For instance, when a website recommends another product to buy, you order a taxi through an app, or use the latest filter on instagram. To build and implement these systems requires detailed knowledge not only of machine learning algorithms but, crucially, of the context of that problem. The hope is, that after digesting the information within these topics, that you will have the necessary skills and knowledge to start solving real world problems within your domain with some machine learning algorithms. To that end, we will focus heavily on the underlying algorithms and not on the MLaaS offerings that are available. Although, we would always recommend seeing if a solution for your problem already exists embarking on producing your own solution.

A lot of data science techniques necessitate the use of some programming language, typically python, SQL, or R. However, we won't be covering these languages here. We assume a knowledge of python and the relevant data science packages that might be used in python; pandas, tensorflow, sci-kit learn. We also assume that you have an understanding of the matplotlib and seaborn libraries for visualization as well as the standard python suite and numpy. Most of the practical examples given in these topics will be written in python.

It is also assumed that you have the requisite mathematical and statistical knowledge to understand when and why to apply certain statistical tests and the validity of assumptions used.

01
Basic data science solutions in python
In this section we'll go through some basic examples of implementing data science solutions via fully connected neural networks and an ensemble method (Random Forest). These will be written in python and using the tensorflow and sci-kit learn packages. The idea here is to not provide a detailed introduction to what these algorithms do but to provide a quick introduction into how to implement these solutions in common packages. We'll go into a detailed description of these algorithm in later sections.

  1. Basic classification neural network
  2. Basic regression neural networks
  3. Basic random forest
02
Useful and foundational concepts
The aim of this section is to provide some basic background into important foundational concepts that will be encountered in almost any data science solution.

  1. Loss and accuracy
  2. Optimization algorithms
  3. Dataset splitting and validation of supervised models
03
Preprocessing
Typically your dataset won't be perfect. It will require some amending, some fixing, and some finessing to get it into a format and structure that you can use in your models and will help you optimize the performance of those models.

  1. Normalization
  2. Encoding for categorical variables
  3. Embeddings
  4. Feature engineering
  5. Text Preprocessing
04
Common supervised data science methods
In this section we'll go through some common data science techniques for modelling data, either for regression or classification purposes. Unsupervised learnings and neural networks are large enough topics to be covered in their own sections.

  1. Linear, logistic, and polynomial regression
  2. Decision Trees
  3. Support Vector Machines (SVM)
  4. Naive Bayes
05
Common unsupervised data science methods
  1. K-nearest neighbours
  2. Self Organizing Maps
  3. DBSCAN
  4. T-SNE
06
Basics building blocks of neural networks models
  1. Basics of Neural networks
  2. Hyper-parameter optimization
07
Convolutional neural networks
  1. Introduction to convolutional neural networks
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.