# What is Machine Learning?
Machine Learning is the study of software algorithms/models for performing pattern recognition tasks without explicit instructions on how to perform the recognition.

It can be either supervised or unsupervised. The latter requires a pre-classified training set to create a model which can then be evaluated on a testing set. The former does not need to know the labels of the instances in its training set, it only needs to know how many classes are contained in the set. It can then recognise the patterns in the data and organise it into clusters. The testing set is used to assess how well the clustering model performs on previously unseen data. Examples of supervised machine learning are Support Vector Machines (SVMs) and K-Nearest Neighbours (KNN).

#What is Cross Validation?
Although it is important for models to be trained and tested on different datasets, sometimes datasets do not come with separate training and test subsets. A process known as cross-validation is normally used in such cases to artificially create training and test subsets from the dataset. Cross-validation can be performed in an exhaustive and non-exhaustive manner. An example of exhaustive cross validation is Leave-p-out cross-validation which involves using p random observations as the validation set and the remaining observations as the training set. This is repeated in all possible ways of cutting the original sample (of n elements) into a validation set (of p observations) and a training set (of n-p elements). In Leave-one-out cross-validation p is set to 1. Non-exhaustive methods include k-fold and hold out cross validation. In k-fold cross-validation, the original sample is partitioned into k (normally 10) equally sized random subsamples. One of these k subsamples, is used as the validation set for validating the model, while the remaining k − 1 subsamples are combined into a training set. The cross-validation process is performed k times so that each of the k subsamples gets a chance to be used as the validation set.  The final estimation of performance is calculated as the average of the k folds’ results. One benefit of this approach is that all observations get an opportunity to be used for both training and validation purposes. To improve the robustness of evaluation, repeated k-fold cross-validation can be performed such that multiple partitions are considered to reduce the occurrence of split bias. The final evaluation is the average of all the repeats. The holdout method is known as the simplest form of cross-validation. It randomly partitions the dataset into training and validation sets using a predetermined split ratio. A single run can be executed or multiple iterations can be averaged. A technique known as stratification can be used in cross-validation to ensure the class ratios in the entire dataset are preserved in both the training and validation sets.

#What Metrics are used to Measure Machine Learning Performance?
Classification is normally binary or multi-labelled. Binary classification uses terminology from the medical field to distinguish between its two classes. It is assumed that one of the classes is the main one (positive) while the other is not of primary focus (negative). For example, when diagnosing a disease, positive and negative outcomes signify the actual presence or absence of the disease, respectively. Four possible outcomes can be derived from these 2 classes: true positives, false positives, true negatives and false negatives, depending on whether a model correctly (true) or incorrectly (false) predicts the outcome. Sensitivity is the fraction of true positives over all actual positives (true positives + false negatives) while specificity is the fraction of true negatives over all actual negatives (true negatives + false positives). The most common means of assessing classification is through accuracy, which is the fraction of correct/true classifications out of all classifications made. Other common evaluation tools include a confusion matrix, precision, recall and F1 score.

#Interactive Dataset Manipulation Example
Adapted from: https://www.kdnuggets.com/2022/07/knearest-neighbors-scikitlearn.html 

In [1]:
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

In [2]:
# url for Iris dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Assign column names to the dataset
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Read in the dataset
df = pd.read_csv(url, names=names)

In [3]:
df.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [5]:
X = df.iloc[:, :-1].values
y = df.iloc[:, 4].values

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [7]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [8]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

KNeighborsClassifier()

In [9]:
y_pred = classifier.predict(X_test)

In [10]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[ 8  0  0]
 [ 0 10  0]
 [ 0  1 11]]
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         8
Iris-versicolor       0.91      1.00      0.95        10
 Iris-virginica       1.00      0.92      0.96        12

       accuracy                           0.97        30
      macro avg       0.97      0.97      0.97        30
   weighted avg       0.97      0.97      0.97        30

