Data Science

What is Data Science?

UNDER CONSTRUCTION

What isn't included in this area

A lot of very useful data science packages can be used without any knowledge of the underlying algorithms. A lot of the technical aspects of a machine learning model are hidden away behind an app, user interface, or API endpoint. These solutions are typically referred to as machine learning as a service (MLaaS) and you may not even be aware of when one is being used. For instance, when a website recommends another product to buy, you order a taxi through an app, or use the latest filter on instagram. To build and implement these systems requires detailed knowledge not only of machine learning algorithms but, crucially, of the context of that problem. The hope is, that after digesting the information within these topics, that you will have the necessary skills and knowledge to start solving real world problems within your domain with some machine learning algorithms. To that end, we will focus heavily on the underlying algorithms and not on the MLaaS offerings that are available. Although, we would always recommend seeing if a solution for your problem already exists embarking on producing your own solution.

A lot of data science techniques necessitate the use of some programming language, typically python, SQL, or R. However, we won't be covering these languages here. We assume a knowledge of python and the relevant data science packages that might be used in python; pandas, tensorflow, sci-kit learn. We also assume that you have an understanding of the matplotlib and seaborn libraries for visualization as well as the standard python suite and numpy. Most of the practical examples given in these topics will be written in python.

It is also assumed that you have the requisite mathematical and statistical knowledge to understand when and why to apply certain statistical tests and the validity of assumptions used.

Basic data science solutions in python

In this section we'll go through some basic examples of implementing data science solutions via fully connected neural networks and an ensemble method (Random Forest). These will be written in python and using the tensorflow and sci-kit learn packages. The idea here is to not provide a detailed introduction to what these algorithms do but to provide a quick introduction into how to implement these solutions in common packages. We'll go into a detailed description of these algorithm in later sections.

Useful and foundational concepts

The aim of this section is to provide some basic background into important foundational concepts that will be encountered in almost any data science solution.

Preprocessing

Typically your dataset won't be perfect. It will require some amending, some fixing, and some finessing to get it into a format and structure that you can use in your models and will help you optimize the performance of those models.

Common supervised data science methods

In this section we'll go through some common data science techniques for modelling data, either for regression or classification purposes. Unsupervised learnings and neural networks are large enough topics to be covered in their own sections.

Common unsupervised data science methods

Basics building blocks of neural networks models

Convolutional neural networks

Introduction to convolutional neural networks