8 Algorithms to Build Machine Learning Products

Published in

The Startup

6 min readJun 29, 2020

Machine learning is a discipline of AI. It uses statistics to solve user problems. So, if you want to develop a machine learning product you need to understand the underlying algorithms. And surprise: statistics. This article is a non-technical overview of 8 proven algorithms to use in Machine Learning products.

Algorithms

Generally speaking, an algorithm is a procedure in the form of program code. An algorithm is used to solve a problem using a proven procedure [1]. For illustration purposes, all algorithms [2] are explained below with examples from the online real estate market.

(Learning) Decision Trees

Decision trees are a popular method of decision making. In practice they are often used due to their good traceability. The result obtained can be checked by the user at any time since a decision tree consists of a hierarchical sequence of decision rules. In practical use, the decision tree takes values and returns a decision. Conceptually, a distinction must be made between classification trees (predict a label) and regression trees (predict a quantity).

A decision tree is useful for:

An application on the real estate market is the decision to be made as to whether it is more advantageous for a private person to rent or buy a property:

Learning decision trees are a special form of decision trees. A decision tree is learning if it is able to increase its efficiency. The learning decision tree uses training data and recognizes whether it is necessary to check all paths or not. As a result, the tree becomes shallow and the calculation faster.

Random Forest

The Random Forest algorithm is one of the most popular algorithms for solving classification problems. Classifying means assigning a new data point to an existing category.

A random forest consists of a large number of independent decision trees. Each decision tree provides a class prediction. The most frequently predicted class corresponds to the overall result of the random forest forecast. Conceptually, the algorithm distinguishes itself from other algorithms by a special feature: the final result is an equivalent to a democratic majority decision, whereas the results of other algorithms tend to be viewed as individual opinions.

Random Forest is useful for:

To find out what class (or category) an object belongs to. After each decision tree has reached a result, the Random Forest algorithm delivers the merged decision:

Linear Regression

Linear regression is often not the most precise algorithm, but it is a good introduction to machine learning. Linear regression can be used to check whether there is a linear relationship between variables (a straight line). In the calculation, the procedure is always to explain an observed (dependent) variable with the help of independent variables.

Linear regression is useful for:

For example, on the real estate market I would like to know the relationship between living space (independent variable) and purchase price (dependent variable). My goal is to estimate the price of a 100 m2 apartment. I do this as follows:

In reality, of course, other variables affect the purchase price in addition to the living space. But despite the limited data set, the (simple) linear regression already provides a useful approximation. A more realistic prediction is possible by using the multiple linear regression. The rental price (dependent variable) can then be explained by several independent variables (living space, city, year of construction).

Logistic Regression

The goal of logistic regression is to determine a probability. A typical application example on the real estate market is the creditworthiness of the buyer and the associated probability of default. As in the linear regression analysis, the basic procedure is to explain a dependent variable (credit score or probability of default) with a number of independent variables (socio-demographic characteristics, payment history, existing loans). If the logistic regression returns a value of 0.651, this means, using the example of the probability of default, that the borrower will not repay the property loan with a probability of 65%. In addition, it is common practice to map the result of the logistic regression into defined rating classes (classification).

K-Nearest Neighbor (KNN)

KNN is a machine learning algorithm for solving classification problems. It assigns a new data point to a category. The algorithm works on the assumption that similar objects are close to each other (nearest neighbor) and combines them into classes. The class assignment of new objects is made by taking into account the number of k-nearest neighbors.

The k-nearest neighbor algorithm is useful for:

KNN is suitable for identifying similar objects. In the real estate environment, an object is e.g. an apartment. With the k-nearest neighbor algorithm it is possible to identify similar apartments and display them to the user in the form of item recommendations:

Each data point in the diagram corresponds to an apartment. The classification took place, for example, with the help of the real estate metadata “year of construction”, “living space”, “zip code” and “price per square meter”. Now the goal is to classify a newly advertised apartment: in the diagram, the variable k has the value 3 (the number of nearest neighbors). These 3 nearest neighbors are recommended to the user as item recommendations, since they are the most similar apartments in the database.

Naive Bayes Classifier

The Naive Bayes Classifier is a supervised machine learning algorithm. It’s based on Bayes’ theorem — hence the name Bayes classifier. Naive, because the algorithm is based on the assumption that measurement variables are always independent of each other. Although this assumption is rarely true in practice, the Naive Bayes classifier returns good results.

The algorithm assigns an object to the class that it most likely belongs to. In order to carry out the classification, Naive Bayes has to be trained with a dataset, so the algorithm learns which class assignment will be expected.

The Naive Bayes algorithm is useful for:

Typically to solve spam detection and text classification problems. In online real estate portals, the Naive Bayes classifier can be used to identify incorrectly categorized apartments. If an apartment was advertised in the Frankfurt city area, but the advertisement text indicates to the location Wiesbaden, it is possible to mark the advertisement for revision by the advertiser.

Support Vector Machine (SVM)

The Support Vector Machine is commonly used to classify objects. SVN counts as a supervised learning algorithm — the class or grouping is specified by the user. The SVM algorithm has a special characteristic: it creates class boundaries with the widest possible range between the existing data points of both classes. The so-called large margin classifier. By maximizing the distance, new data points can be classified with a higher probability of success.

Support Vector Machine is useful for:

Pattern, image or text recognition.

k-means algorithm

k-means is a unsupervised machine learning algorithms to solve clustering problems. Unsupervised means, it works without human supervision. The algorithm learns by independently recognizing patterns in data. The goal of the k-means algorithm is to form clusters from a set of similar data points. The main human input is the number of k expected clusters (therefor the name k-means).

The k-means-algorithm is useful for:

K-means can be used to answer the question whether objects form groups in terms of their characteristics. A typical application is customer clustering, including customer segmentation:

The k-means algorithm helps to answer the question of whether and which customer segments exist (not to assign customers to a segment).

For further information, calculation examples I recommend to the medium publication Towards Data Science.

References:

[1] Stuart Russell and Peter Norvig, “Artificial Intelligence: A Modern Approach”, Fourth Edition (2020)
[2] Giuseppe Bonaccorso, “Machine Learning Algorithms: A reference guide to popular algorithms for data science and machine learning”, First Kindle Edition (2017)