Software specification


Linear Regression

Class Diagram

classDiagram LinearModel <|-- LinearRegression class LinearModel { +isFit : bool = false } class LinearRegression { +m : float = 0 +b : float = 0 +fit(xTrain : float[], yTrain : float[]) +predict(xTest : float[]) float[] +mserror(yTrain : float[], yPredict : float[]) float }

Polynomial Regression

Class diagram

classDiagram PolynomialModel <|-- PolynomialRegression class PolynomialModel { +isFit : bool = false } class PolynomialRegression { +solutions :[] +error :float +fit(xArray: float[], yArray: float[], degree: int) +predict(xArray: float[]) +calculateR2(xArray: float[], yArray: float[]) +getError() }

Methods and attributes description

# Method or Atribute Scope Description
1 isFit PolynomialModel Class Attribute that defines if the model is already trained or not
2 solutions PolynomialRegression Class Array of floats that stores the solutions for the polynomial model
3 fit PolynomialRegression Class This method creates the solutions for the polynomial model based in the logic that will be described below
4 xArray Method fit - PolynomialModel Class Array parameter of x values that are used to train the model
5 yArray Method fit - PolynomialRegression Class Array parameter of y values that are used to train the model
6 degree Method fit - PolynomialRegression Class Paramter with the desired degree for the model
7 equationSize Method fit - PolynomialRegression Class Attribute that defines the number of equations that are going to be used in the model based in the degree of the model
8 nElements Method fit - PolynomialRegression Class Attribute that defines the number columns that are going to be used in the matrix for the model
9 equations Method fit - PolynomialRegression Class Matrix that stores the equations and solutions for the polynomial model
10 predict PolynomialRegression Class Function that returns the predicted values based on the traning done before by the model
10 xArray Method predict - PolynomialRegression Class Parameter with the x Array values to be predicted
11 yArray Method predict - PolynomialRegression Class Array with y values predicted by the model
12 calculateR2 Method calculateR2 - PolynomialRegression Class Method that creates the r^2 error of the trained model
13 getError Method getError - PolynomialRegression Class Method that return the r^2 value for the trained model

Logic Used

In order to train the model to get an accurate prediction based in the input degree, it was used several math concepts to get a precise function for the model, in this library it was used the following concepts to develop the solution:

  • Least Squares
  • Gauss Jordan

Least Squares

This method was used to build a coefficent matrix for the equation system solution, this matrix is simetric and it is created based in an input degree that we call "m", the logic of least squares is pretty simple, it just consist in creating a matrix that contains a group of equations that increase their degree as we create new equations in the matrix, this matrix is later solved to get the coefficents for the solution of the Regression Model.

n Σx Σx^2 Σx^3 ... Σx^m Σy
Σx Σx^2 Σx^3 Σx^4 ... Σx^m Σxy
Σx^2 Σx^3 Σx^4 Σx^5 ... Σx^m Σx^2*y
... ... ... ... ... ... ...
Σx^m Σx^m+1 Σx^m+2 Σx^m+3 ... Σx^m+n Σx^m*y

Gauss Jordan

This method was used to build the solution for the model, once we have the equations matrix to be solved it was used Gauss Jordan to get the coefficent of the equations to be later stored in the array of solutions.

1 2 3 4 ... n m
0 1 2 3 ... n m
0 0 1 2 ... n m
... ... ... ... ... ... ...
0 0 0 0 ... n m

Usage

The usage of the PolynomialRegression Library is pretty simple you just have to follow the next steps

Import the library

In order to use the library you must import it into your html code, you can find the library in the dist folder as PolynomialRegression.js file

Training data

After importing the library into your html code you need to train the model. In order to train your model you must create 2 training array data that will be used when instanciating the library

Training the model

In orther to train the model you must create an instance of the PolynomialRegression library, then you must call the method fit, in this method you will send the xArray and yArray of training data followed by the degree of training for the model

Prediction

To create a prediction you must create an array of x values to be predicted, to create the prediction you only have to call the function precit and then you send the values to predict to the function

Decision Tree - ID3

Class Diagram

classDiagram DecisionTreeID3 -- NodeTree DecisionTreeID3 -- Feature DecisionTreeID3 -- Atrribute class NodeTree { +id : string = "" +tag : string = "" +value : string = "" +child : NodeTree[] = [] } class Feature { +attribute : string = "" +entropy : Float = -1 +gain : Float = -1 +primaryCount : Integer = 0 +secondaryCount : Integer = 0 +primaryPosibility : string = "" +secondPosibility : string = "" +updateFeature(_posibility : string ) bool +calculateEntropy(_p : Integer, _n: Integer ) Float } class Atrribute { +attribute : string = "" +features : any[] = [] +infoEntropy : = -1 +gain : = -1 +index : = -1 } class DecisionTreeID3 { +dataset : any[] = [] +generalEntropy : float = 0 +primaryCount : float = 0 +secondaryCount : float = 0 +primaryPosibility : string = "" +secondaryPosibility : string = "" +root : NodeTree = null +calculateEntropy(_p : Integer, _n : Integer) Number +train(_dataset : any[], _start : Integer) NodeTree +predict(_predict : any[], _root : NodeTree) NodeTree +recursivePredict(_predict : any[],_node :NodeTree) NodeTree +calculateGeneralEntropy(_dataset : any[], indexResult : Integer) Float +classifierFeatures(_dataset : any[], indexFeature : Integer, indexResult : Integer) Feature[] +calculateInformationEntropy(_features : Feature[]) Number +calculateGain(_generalEntropy : Number, _infoEntropy : Number) Number +selectBestFeature(_attributes : Attribute[]) Integer +generateDotString(_root : NodeTree) string +recursiveDotString(_root : NodeTree, _idParent : string) string }

DecisionTreeID3 Class

Properties

Name Description
dataset 2-dimensional matrix that contains the data header in the first row and the last column contains the 2 possible classes.
generalEntropy Stores the overall entropy of the data set.
primaryCount Stores how many times the first class found appears in the data set.
secondaryCount Stores how many times the second class found appears in the data set.
primaryPosibility Stores the first class found in the data set.
secondaryPosibility Stores the second class found in the data set.
root It is the root of the tree resulting from training

Methods

Name Description
calculateEntropy Receives 2 parameters, the first is how many times the first label appears, the second parameter is how many times the second label appears and returns the calculation of the entropy equation
train This method is in charge of generating the decision tree through the data set it receives, it returns the root node of the generated tree.
predict This function classifies the 2xm matrix it receives as a parameter, starts the search from the node it receives as a parameter, returns the node with the class it belongs to or null if it is not able to classify.
recursivePredict This function is used as an aid to traversing the decision tree.
calculateGeneralEntropy This function analyzes the received data set, it counts the number of times the first and second class appear, the index parameter that it receives indicates the number of column in which the count must be performed, it makes use of the calculateEntropy function and returns the value of the entropy of the entire data set.
classifierFeatures This function analyzes the received data set, using the index parameter, separates each data into the corresponding characteristic, and returns a list of characteristics.
calculateInformationEntropy This function receives a list of characteristics and returns the value of the entropy of the information for the received characteristic.
calculateGain This function receives the general entropy and the information entropy and returns the value of the profit.
selectBestFeature This function receives a set of characteristics and returns the index of the characteristic with the highest gain.
generateDotString This function is used to generate the string in the format that the visjs tool accepts to generate a tree type graph.
recursiveDotString This function is auxiliary to traverse the tree and generate the string for visjs

Overview

This Machine Learning process works with the ID3 algorithm. It is a simple implementation to classify data matrices with the following characteristics

  1. The data set is a matrix and contain the header.
  2. The result column must be in the matrix last column.
  3. The header is considering as Attribute.
  4. Every data is considering as Feature.

Example Data Set:

[
    ["Attr1", "Attr2", "Attr3", "Result"],
    ["lorem", "lorem", "lorem", "Class1"],
    ["lorem", "lorem", "lorem", "Class2"],
    ["lorem", "lorem", "lorem", "Class1"],
    ["lorem", "lorem", "lorem", "Class2"],
    ["lorem", "lorem", "lorem", "Class1"],
]
Note: Attr#, lorem, Result and Class# can be any string.

Data to Predict:

The data to be predicted consists of a matrix with 2 rows and m columns. The first row will be the header, the second row will consist of the data for decision making.

[
    ["Attr1", "Attr2", "Attr3"],
    ["lorem", "lorem", "lorem"],
]
Note: The header must be in the data

Result of predict:

The result of the classification consists of an object of type NodeTree which has in its value attribute the classification of the data entered.

{
    childs: []
    id: "3eb4d4228163"
    tag: "Overcast"
    value: "Yes"
}

Usage

  1. Import the Library inside HTML page in a script tag.
  2. Prepare a matrix to use as a data set.
  3. Train the algorithm.
  4. Predict data!.

KMeans - Linear and 2D Data


Class Diagram

classDiagram KMeans<|-- LinearKMeans KMeans<|-- _2DKMeans class KMeans { +k : int = 3 } class LinearKMeans { +data : int[] = [] +clusterize(k : int, data : int[], iterations : int ) any[] +distance(point_a : int, point_b : int) int +calculateMeanVariance(arr : int) any[] } class _2DKMeans { +data : any[] = [] +clusterize(k : int, data : any[], iterations : int ) any[] +distance(point_a : any[], point_b : any[]) int +calculateMeanVariance(arr : int) any[] }

LinearKMeans Class

Properties

Name Description
k Number of clusters in wich the points will be grouped.
data Array of numbers representing each point on the X axis.
iterations Amount of repetitions that the algorithm will perform. The iterations apply specifically to the randomization of possible cluster points.

Methods

Name Parameters Description
clusterize
  • k
  • data
  • iterations
This is the main method of the KMeans algorithm, which follows the following steps:
  1. k ammount of random points are selected from the data array without repetition. This are the potential clusters.
  2. The distance between each point on the data array and each potential cluster is calculated and stored in the same array.
  3. Each point gets assigned to the closest (less distance) potential cluster point.
  4. The mean and variance of each cluster group is calculated and stored.
  5. The points are reasigned to the potential clusters but this time the distance is calculated between the point and the cluster group mean. This step is repeated untill there are no more changes in the cluster groups.
  6. The sum of each group variance (total variance) of the potential cluster is stored.
  7. Steps 1 through 6 are repeated iterations times
  8. The potential cluster who has the lesser total variance is selected as the optimal solution, and send as the returned value.
distance
  • point_a
  • point_b
This method returns the distance between two points (point_b - point_a)
calculateMeanVariance
  • arr
This method returns the mean and variance of an array of values. The mean of arr is the sum of all the data divided by the item count. The variance of arr is the sum of the substraction of each point and the mean, to the power of 2, divided by de item count.

Example Data

data = [
    -99,  -92,  -89,  -87,  -83,  -82,  -78,
    -76,  -70,  -62,  -57,  -55,  -50,  -42,
    -35,  -33,  -32,  -30,  -27,  -17,  -12,
    -10,  0,  1,  2,  25,  29,  33,  39,
    41,  53,  54,  67 ]
k = 3
iterations = 3

Result of Linear KMeans clustering:

The result is showed on a graphic, each point has a color wich identifies the cluster group it is assigned to. The cluster points are drawn as red dots.

_2DKMeans Class

Properties

Name Description
k Number of clusters in wich the points will be grouped.
data Array of coordinates [x,y] representing each point.
iterations Amount of repetitions that the algorithm will perform. The iterations apply specifically to the randomization of possible cluster points.

Methods

Name Parameters Description
clusterize
  • k
  • data
  • iterations
This is the main method of the KMeans algorithm, which follows the following steps:
  1. k ammount of random points are selected from the data array without repetition. This are the potential clusters.
  2. The distance between each point on the data array and each potential cluster is calculated and stored in the same array.
  3. Each point gets assigned to the closest (less distance) potential cluster point.
  4. The mean and variance of each cluster group is calculated and stored.
  5. The points are reasigned to the potential clusters but this time the distance is calculated between the point and the cluster group mean. This step is repeated untill there are no more changes in the cluster groups.
  6. The sum of each group variance (total variance) of the potential cluster is stored.
  7. Steps 1 through 6 are repeated iterations times
  8. The potential cluster who has the lesser total variance is selected as the optimal solution, and send as the returned value.
distance
  • point_a
  • point_b
This method returns the distance between two points
calculateMeanVariance
  • arr
This method returns the mean and variance of an array of values. The mean of arr is the sum of all the data divided by the item count. The variance of arr is the sum of the substraction of each point and the mean, to the power of 2, divided by de item count.

Example Data

data = [
    [11,6],  [4,2],  [15,0],  [10,6],  [7,8],
    [9,12],  [13,0],  [5,1],  [0,13],  [7,5],
    [6,1],  [3,6],  [0,10],  [14,10],  [6,14],
    [6,4],  [4,9],  [5,14],  [9,9],  [13,8] ]
k = 3
iterations = 10

Result of Linear KMeans clustering:

The result is showed on a graphic, each point has a color wich identifies the cluster group it is assigned to. The cluster points are drawn as red dots.


LinearKMeans Class

Properties

Name Description
k Number of clusters in wich the points will be grouped.
data Array of numbers representing each point on the X axis.
iterations Amount of repetitions that the algorithm will perform. The iterations apply specifically to the randomization of possible cluster points.

Methods

Name Parameters Description
clusterize
  • k
  • data
  • iterations
This is the main method of the KMeans algorithm, which follows the following steps:
  1. k ammount of random points are selected from the data array without repetition. This are the potential clusters.
  2. The distance between each point on the data array and each potential cluster is calculated and stored in the same array.
  3. Each point gets assigned to the closest (less distance) potential cluster point.
  4. The mean and variance of each cluster group is calculated and stored.
  5. The points are reasigned to the potential clusters but this time the distance is calculated between the point and the cluster group mean. This step is repeated untill there are no more changes in the cluster groups.
  6. The sum of each group variance (total variance) of the potential cluster is stored.
  7. Steps 1 through 6 are repeated iterations times
  8. The potential cluster who has the lesser total variance is selected as the optimal solution, and send as the returned value.
distance
  • point_a
  • point_b
This method returns the distance between two points (point_b - point_a)
calculateMeanVariance
  • arr
This method returns the mean and variance of an array of values. The mean of arr is the sum of all the data divided by the item count. The variance of arr is the sum of the substraction of each point and the mean, to the power of 2, divided by de item count.

Bayes Method

classDiagram class MethodBayes { +m : attributes = [] +b : classes = [] +b : frecuencyTables = [] +b : attributesNames = [] +b : className = null +addAttribute(values : [], attributesNames : []) +addClass(values : [], className: String) +train() +probability(attributesName : String, cause: String, effect: String) +predict(causes : [], effect: String) +train() +toFrecuencyTable(values:[]) }

Bayes Class

Properties

Name Description
attributes Attributes that the model to evaluate will contain
classes Classes that the model to evaluate will contain
frecuencyTables Table containing the data of the probabilities of each event
attributesNames Stores the name of each attribute registered in the model
className Stores the value of the last class that was added

Methods

Name Parameters Description
addAttribute
  • values
  • attributeName
Method used to add an attribute to the array of attributes that belong to a specific model. Its values ​​parameter will contain the values ​​to add the attribute and the name to associate it.
addClass
  • values
  • className
Method used to add a class to the array of classes that belong to a specific model. Its values ​​parameter will contain the values ​​to add the class and the name to associate it.
train Method used to train our model with its corresponding attributes and classes. This method also makes use of the frequency table to store the probabilities found.
probability
  • attributeName
  • cause
  • effect
Method used to calculate the probabilities of an attribute using its cause and effect
predict
  • cause
  • effect
Method used to predict an outcome through a given event. To predict these events, the cause and effect parameters are used.
isModelValid Method used to validate if a model is valid through its minimum requirements, such as the one that contains classes, attributes, etc.
toFrecuencyTable
  • values
Method used to store and return the frequency table that contains the probabilities, frequencies and values ​​of the model

Example Data

Result of Method Bayes:

The result is shown below


NaiveBayes


Class Diagram

classDiagram class NaiveBayes { +causes : any[] = [] +insertCause(name : string, array : any[]) +predict(effect : string, events : []]) any[] +getSimpleProbability(event : []) +getConditionalProbability(eventA : [],eventB : []) +getCauseByName(cause_name:string) }

NaiveBayes Class

Properties

Name Description
causes Array of the causes to be taken in count for the prediction

Methods

Name Parameters Description
insertCause
  • effect (string)
  • events (array)
This method fills the data usted to make the predictions. Each cause added must have the same events array as the others inserted, otherwise it will throw an error through console .
  • The effect parameter represents the name of the column / cause added and it must be unique
  • The events parameter recieves the data used for the column/cause
predict
  • effect (string)
  • events (array)
This is the main method that runs the prediction for the data entered through on the parameters.
  • The effect parameter represents the name of the column/cause for which you want to make the prediction
  • The events parameter recieves an array of tuples [column_name, value] for known events.
    eg. [["name", value],["name", value],["name", value] ...]
getSimpleProbability
  • event (array)
This method returns the probability that an event ocurrs. Events are represented by tuples with the format ["column_name", value]. In other words it returns the probability of finding "value" on cause "column_name"
getConditionalProbability
  • event_A (array)
  • event_B (array)
Returns probability of event_A occurring given that event_B occurs. Events are represented by tuples with the format ["column_name", value].
getCauseByName
  • cause_name (string)
Returns the array of values corresponding to the column/cause identified with the value recieved in the parameter.

Logistic Regression


Class Diagram

classDiagram LogisticModel <|-- LogisticRegression LogisticModel <|-- MultiClassLogistic class LogisticModel { } class LogisticRegression { + alpha: int , 0.001 + lambda: int, 0 + iterations: int, 100 + fit(data) + computeThreshold(X,Y) + grad(X,Y,theta) + h(x_i, theta) + transform(x) + cost(X,Y,theta) } class MultiClassLogistic { + alpha: int , 0.001 + lambda: int, 0 + iterations: int, 100 + fit(data,classes) + transform(x) }

Logistic Regression

Properties

Name Description
alpha percentage of certainty
lambda Number of lambda of the function
iterations number of iterations that the algorithm must perform

Methods

Name Parameters Description
fit
  • data (matrix) for simple
  • classes (array) for multiclass
This method separates the input matrix into respective arrays of X, Y, and the current value (s). To later carry out the transformation and evaluation of the function
  • The data parameter represents the matrix data to analyze
  • The classes parameter recieves the data array of multiclass
computeThreshold
  • X (int0)
  • Y (int)
This Method obtains the threshold of the coordinates obtained in X and Y
grad
  • X (int)
  • Y (int)
  • theta (int)
this method calculates and obtains the gradient with which the function will be evaluated
h
  • x_i (int)
  • theta (int)
This method evaluates the X coordinate with the theta coefficient in the logistic regression function.
transform
  • X (int)
This method returns the transformation of the prediction, in this case if it is binary it checks if it is 1 or 0 and if it is multiclass it compares the classes with the array.
cost
  • X (int)
  • Y (int)
  • theta (int)
The method returns the cost of the function, this is achieved by evaluating the function in the equation of the method h ()